Closed wpli closed 10 years ago
This is so cool! Thanks for coming out of nowhere and tackling this -- it's such an important dataset, and so hard to work with.
So this repository has been pretty inactive the last couple years, and nothing here is directly used in production anywhere I'm aware of. I think it'd be a better fit in https://github.com/unitedstates/uscode, which is a parser for the US Code's own structure. It's used in production by Sunlight and by GovTrack in different places. This would fit right in with that repo's goal, which is to data-ify the US Code and empower any projects that are relevant to it.
Separately, I tried running the scraper, and had two issues. One is that I needed to figure out the requirements myself, and install them. Mind adding a requirements.txt
file with simplejson
, ipdb
, BeautifulSoup4
, lxml
, and httplib2
?
Lastly, when I installed all of that, I ran python table3_scraper.py
, using Python 2.7.6, and got this error:
1950_2.htm
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in <module>()
171
172 if __name__ == '__main__':
--> 173 main()
/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in main()
156 def main():
157 dataset = []
--> 158 x = add_release("http://uscode.house.gov/table3/table3years.htm") #Could also use "/alltable3statutesatlargevolumes.html"
159 for filename, html_string in x:
160 final_pagename = filename.split('/')[-1]
/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in add_release(url)
78 sys.exit(1) #bomb out, non-zero return indicates error
79 #print content
---> 80 return mainscraper(content)
81
82 def add_subrelease(url): #function to grab sub page data
/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in mainscraper(content)
41 #print unitext, url
42 #releases += [(unitext, url)]
---> 43 subreleases += add_subrelease(url)
44 #return subreleases
45 else:
/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in add_subrelease(url)
87 sys.exit(1) #bomb out, non-zero return indicates error
88 #print content
---> 89 return subscraper(content)
90
91 def add_subsubrelease(url): #function to grab sub, sub page data
/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in subscraper(content)
64 page_content = add_subsubrelease(url)
65 #ipdb.set_trace()
---> 66 parsed_content = _parse_legislative_changes_page( page_content )
67 parsed_content[ 'URL' ] = url
68 subsubreleases.append( parsed_content )
/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in _parse_legislative_changes_page(page)
129 assert len( caption ) == 6
130
--> 131 _process_caption_span( 'congress', caption[0], caption_dict )
132 _process_caption_span( 'statutesatlargevolume', caption[1], caption_dict )
133 _process_caption_span( 'textdate', caption[2], caption_dict )
/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in _process_caption_span(expected_class, caption_span, caption_dict)
100
101 def _process_caption_span( expected_class, caption_span, caption_dict ):
--> 102 assert caption_span.get('class') == expected_class
103
104 text_val = " ".join ( caption_span.itertext() )
AssertionError:
Thanks for the feedback!
I've added a requirements.txt
file. I removed some unnecessary dependencies like ipdb
.
I also fixed the above bug, and added a README.
I'd also be happy to get your thoughts on where in the unitedstates/uscode repository it should go.
Nice! And yeah, a table3/
dir in the top-level of https://github.com/unitedstates/uscode should be just fine. Mind if I merge this, and then move it over?
Sure, please go ahead!
table3_scraper.py downloads the HTML files corresponding to each of the public laws.
ParseTable.py gets the section and subsection references in the U.S. Code.
I've also included 111_148.htm as an example file.