sunlightlabs / crosslaws

Collection of code for parsing information related to the Code of Federal Regulations, the US Code, the US Statues at Large, etc.
20 stars 3 forks source link

Actual table parser #21

Closed wpli closed 10 years ago

wpli commented 10 years ago

table3_scraper.py downloads the HTML files corresponding to each of the public laws.

ParseTable.py gets the section and subsection references in the U.S. Code.

I've also included 111_148.htm as an example file.

konklone commented 10 years ago

This is so cool! Thanks for coming out of nowhere and tackling this -- it's such an important dataset, and so hard to work with.

So this repository has been pretty inactive the last couple years, and nothing here is directly used in production anywhere I'm aware of. I think it'd be a better fit in https://github.com/unitedstates/uscode, which is a parser for the US Code's own structure. It's used in production by Sunlight and by GovTrack in different places. This would fit right in with that repo's goal, which is to data-ify the US Code and empower any projects that are relevant to it.

Separately, I tried running the scraper, and had two issues. One is that I needed to figure out the requirements myself, and install them. Mind adding a requirements.txt file with simplejson, ipdb, BeautifulSoup4, lxml, and httplib2?

Lastly, when I installed all of that, I ran python table3_scraper.py, using Python 2.7.6, and got this error:

1950_2.htm
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in <module>()
    171 
    172 if __name__ == '__main__':
--> 173         main()

/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in main()
    156 def main():
    157         dataset = []
--> 158         x = add_release("http://uscode.house.gov/table3/table3years.htm") #Could also use "/alltable3statutesatlargevolumes.html"
    159         for filename, html_string in x:
    160                 final_pagename = filename.split('/')[-1]

/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in add_release(url)
     78             sys.exit(1) #bomb out, non-zero return indicates error
     79         #print content
---> 80         return mainscraper(content)
     81 
     82 def add_subrelease(url): #function to grab sub page data

/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in mainscraper(content)
     41                                         #print unitext, url
     42                                         #releases += [(unitext, url)]
---> 43                                         subreleases += add_subrelease(url)
     44                                         #return subreleases
     45                                 else:

/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in add_subrelease(url)
     87             sys.exit(1) #bomb out, non-zero return indicates error
     88         #print content
---> 89         return subscraper(content)
     90 
     91 def add_subsubrelease(url): #function to grab sub, sub page data

/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in subscraper(content)
     64                                 page_content = add_subsubrelease(url)
     65                                 #ipdb.set_trace()
---> 66                                 parsed_content = _parse_legislative_changes_page( page_content )
     67                                 parsed_content[ 'URL' ] = url
     68                                 subsubreleases.append( parsed_content )

/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in _parse_legislative_changes_page(page)
    129         assert len( caption ) == 6
    130 
--> 131         _process_caption_span( 'congress', caption[0], caption_dict )
    132         _process_caption_span( 'statutesatlargevolume', caption[1], caption_dict )
    133         _process_caption_span( 'textdate', caption[2], caption_dict )

/home/eric/unitedstates/crosslaws/table3/table3_scraper.py in _process_caption_span(expected_class, caption_span, caption_dict)
    100 
    101 def _process_caption_span( expected_class, caption_span, caption_dict ):
--> 102     assert caption_span.get('class') == expected_class
    103 
    104     text_val = " ".join ( caption_span.itertext() )

AssertionError: 
wpli commented 10 years ago

Thanks for the feedback!

I've added a requirements.txt file. I removed some unnecessary dependencies like ipdb.

I also fixed the above bug, and added a README.

wpli commented 10 years ago

I'd also be happy to get your thoughts on where in the unitedstates/uscode repository it should go.

konklone commented 10 years ago

Nice! And yeah, a table3/ dir in the top-level of https://github.com/unitedstates/uscode should be just fine. Mind if I merge this, and then move it over?

wpli commented 10 years ago

Sure, please go ahead!