unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
929 stars 202 forks source link

New Data: Statutes at Large #34

Closed JoshData closed 11 years ago

JoshData commented 11 years ago

Just a heads-up that Gordon and I are working on pulling info out of the new Statutes at Large MODS files.

konklone commented 11 years ago

Ooh - using fdsys.py?

Curious how you intend to use them?

GPHemsley commented 11 years ago

Josh can probably explain better, but the plan is to extract as much of the standard metadata from them as possible.

GPHemsley commented 11 years ago

I've got a script now that extracts the basic Congress and bill numbers for all bills that became law. I'm going to attempt to squeeze some more data out of the MODS files in the coming days.

I'm also thinking that at some point the bill scraping code should be separated from the bill format/metadata files, since there will be multiple sources for bill data but only one way to output for each format (json/xml). I may tackle that at the same time.

JoshData commented 11 years ago

@konklone: My primary goal is to turn it into bills on GovTrack (enacted only, of course). So we'll be generating output that looks similar to the bill output. Plus whatever other interesting metadata is in the MODS files. And something with the text layer in the PDF.

GPHemsley commented 11 years ago

I think the goal should also be to be able to do something like this and have it Just Work™:

./run bills --congress=82

(Even if it doesn't get all the bills in the 82nd Congress at first.)

GPHemsley commented 11 years ago

I've got a working version of the script here:

https://github.com/GPHemsley/congress/blob/historical-bills-1951/tasks/statutes.py

You can run it by running a command like this:

./run statutes --path=STATUTE/1951/STATUTE-65 --govtrack

(After you run the fdsys task to put the fdsys files in place.)

GPHemsley commented 11 years ago

So Josh pulled this into master, with a bunch of documentation, if you want to use it. (There may be more to do still, so I don't know if this should be closed just yet.)

konklone commented 11 years ago

Ah ha! I get it now (looked at the code). So this is intended to fill in gaps from 1951 to 1972. I like this a lot. I have a couple thoughts (surprise), related to keeping the system sane as we expand into more scripts.

Since the data we get from THOMAS is uniformly superior to the data from the Statutes (right? is there anything unique to the Statutes collection?), then the statutes script should probably default to an end year of 1972. This could become moot if we make the bills script default to a Statutes-driven approach for years before 1973.

It'd be great to keep this down to running one command, instead of two, using fdsys.py as a support library that the statutes.py script uses (rather than running it as a script as a prerequisite). You can see how I did this in bill_versions.py, to generate JSON files for each version of bill text.

konklone commented 11 years ago

Separate thought - how much work would have to go into using scanned copies of the Statutes at Large for years prior to 1951? Since there's so much value just in getting the metadata, and scanning accurate text is not a concern, would it be worth it to engage in a manual (and one-time!) metadata collection effort using copies of the Statutes going back into antiquity, if we had them scanned?

GPHemsley commented 11 years ago

I think it would be good to separate the code that is related to all bills from the code related to a single source of bills. And on top of that, it might be good to have the scrapers be separate from the parsers. Then the source-specific scripts could import the generic processing and output methods (as I do in statutes.py). Consolidating fdsys might be a part of that.

The Statutes data turns gaping holes into slightly-less-gaping holes by reverse-engineering (sort of) the metadata related to bills that have become law. The metadata is provided by the LOC somewhat accidentally, as a byproduct of being needed for archiving. As it stands, this does not get any information about the many bills that were never passed/enacted during the period from 1951 to 1972, and the data that it does get sometimes suffers from poor OCR. So yes, the Statutes data is essentially only a fallback for the cases when THOMAS data is not available.

Prior to 1951, Statutes data is not even available for enacted bills (AFAIK—I could be wrong), at least not from FDsys. However, the LOC also has information available on its American Memory site, such as here that might be worth looking into. Some parts are in text form, which would be (relatively) easy to scrape, while others are in GIF/TIFF image format that would be a little more difficult. However, this does provide even different versions of a given bill, for Congresses that are available.

But if by "manual" metadata collection, you mean human-read and -input, I would definitely advise against that. There is just way too much metadata to collect.

GPHemsley commented 11 years ago

It might be worth noting that they don't appear to have begun using codes like "H.R." to refer to bills until the 9th Congress (or later, depending on what you use as a reference).

konklone commented 11 years ago

I do mean human-read and -input, but it'd be one-time only. It seems like it might be a worthwhile project, if the metadata that you've gathered from GPO's work is useful enough to build around. There is no fully official set of scanned Statutes before 1951 that I'm aware of, but I've definitely seen allegedly official unofficial sets of scanned Statutes PDFs going back a long way. Whether or not to consider them official enough for use would be an interesting question alone.

I don't yet think it's worth tearing up the way we've done bills yet in a big way. Right now, we have a solid scraper for bill metadata from 1973-now, a scraper for useful-if-holey data from 1951-1972, and a downloader for bill text from 1989-present (bill_versions.py). They all do very different things, they're all straightforward to use, and there's no friction or wasted effort yet. My inclination is usually to refactor reactively rather than proactively, and each scraper being relatively autonomous and separate allows us all to experiment more easily. I like how things are working.

On Sun, Jan 27, 2013 at 10:22 PM, Gordon P. Hemsley < notifications@github.com> wrote:

I think it would be good to separate the code that is related to all bills from the code related to a single source of bills. And on top of that, it might be good to have the scrapers be separate from the parsers. Then the source-specific scripts could import the generic processing and output methods (as I do in statutes.py). Consolidating fdsys might be a part of that.

The Statutes data turns gaping holes into slightly-less-gaping holes by reverse-engineering (sort of) the metadata related to bills that have become law. The metadata is provided by the LOC somewhat accidentally, as a byproduct of being needed for archiving. As it stands, this does not get any information about the many bills that were never passed/enacted during the period from 1951 to 1972, and the data that it does get sometimes suffers from poor OCR. So yes, the Statutes data is essentially only a fallback for the cases when THOMAS data is not available.

Prior to 1951, Statutes data is not even available for enacted bills (AFAIK—I could be wrong), at least not from FDsys. However, the LOC also has information available on its American Memory site, such as herehttp://memory.loc.gov/ammem/amlaw/lwhbsb.htmlthat might be worth looking into. Some parts are in text form, which would be (relatively) easy to scrape, while others are in GIF/TIFF image format that would be a little more difficult. However, this does provide even different versions of a given bill, for Congresses that are available.

But if by "manual" metadata collection, you mean human-read and -input, I would definitely advise against that. There is just way too much metadata to collect.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/34#issuecomment-12767398.

Developer | sunlightfoundation.com

GPHemsley commented 11 years ago

If I'm understanding your intentions correctly, you're talking hundreds of thousands—if not millions—of bills, aren't you? I think a much more worthwhile project in the short term would be scraping American Memory. That has text data from the 6th through the 42nd Congresses which would be much easier to parse automatically.

Regarding refactoring, my original purpose of suggestion was because I needed to import bill_info.py into statutes.py in order to get the generic output methods, but along with them came the THOMAS-related methods that I didn't need. At the very least, I think those two should be separated.

konklone commented 11 years ago

I definitely don't want to micromanage anything here - if you think something can be improved, improve it. I would just be careful about adding any burden (making anyone writing a new script having to know more about how other scripts work) to make things feel cleaner.

One way to make things better might be: utils.py is getting pretty weighty, and is a mix of project-meta helpers and congress-meta helpers. Making a congress.py file, and moving bill_info.output_bill, utils.current_congress, utils.split_bill_id, etc.into it seems like a good idea - it separates them like you describe, while keeping the scripts follow the same flat pattern of "I only depend on myself plus there's a couple pools of utility methods I can dip into". I could see fdsys.py becoming its own pool of methods, and those files being put in their own directory.

Again, I do not want to nitpick, this is all going to work. We're hitting an awesome stride of growth in this project, and it probably does merit a bit of reorganization. I just think it will be easier in the long run for all of us if this all stays flat and simple and mostly non-systematized.

On Sun, Jan 27, 2013 at 11:58 PM, Gordon P. Hemsley < notifications@github.com> wrote:

If I'm understanding your intentions correctly, you're talking hundreds of thousands—if not millions—of bills, aren't you? I think a much more worthwhile project in the short term would be scraping American Memory. That has text data from the 6th through the 42nd Congresses which would be much easier to parse automatically.

Regarding refactoring, my original purpose of suggestion was because I needed to import bill_info.py into statutes.py in order to get the generic output methods, but along with them came the THOMAS-related methods that I didn't need. At the very least, I think those two should be separated.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/34#issuecomment-12768709.

Developer | sunlightfoundation.com

GPHemsley commented 11 years ago

Speaking from my experience in writing statutes.py, I think splitting things out would make it easier, not harder, to write new scripts. I spent most of my time trying to track down all the various *_for() methods and what they meant and did, to see which ones I needed or could use. If all the generic ones were in their own file, it would have been somewhat easier for me to understand what was going on, I think.

For the record, these are the bill_info methods I used:

And there are probably others that I just didn't need but could be split out alongside them.

Speaking of utils, we could probably use some consolidating of the congress and congress-legislators utils.py files, perhaps as a separate project/repo. I had to do a lot of hacky things for legacy conversion to make things work together happily.

But yeah, I'm not attempting to make any crazy hierarchies here. Just splitting the pie up into smaller slices so I can pick only exactly what I need (while also making sure I can actually get what I need).

GPHemsley commented 11 years ago

Pull request #39 is an important fix for making sure you get the right correspondence between bill number and bill text.

JoshData commented 11 years ago

Let's not refactor yet. The next thing is pulling out bill text from 1951-1993 (there are fewer years of bill text on GPO than bill metadata on THOMAS).

GPHemsley commented 11 years ago

I've tied in the Statute PDFs in pull request #41, so now you can actually see the bill/law associated with the often obscure titles.

Of course, pulling the text out of those PDFs is going to be quite an adventure unto itself. (Perhaps even one reserved for a separate issue.)

GPHemsley commented 11 years ago

Would it be appropriate to include a reference such as "STATUTE-72-Pg3" alongside the action of enactment? The references field seems geared specifically towards the Congressional Record, but I think it would be good to open it up a little more to allow for other sources of information.

konklone commented 11 years ago

I'd just add a new field. Even though it's called something general like "references", I think it should remain only CR refs, to keep assumptions when parsing that field simple. "source" might make sense.

On Wed, Jan 30, 2013 at 1:56 AM, Gordon P. Hemsley <notifications@github.com

wrote:

Would it be appropriate to include a reference such as "STATUTE-72-Pg3" alongside the action of enactment? The references field seems geared specifically towards the Congressional Record, but I think it would be good to open it up a little more to allow for other sources of information.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/34#issuecomment-12876892.

Developer | sunlightfoundation.com

GPHemsley commented 11 years ago

Yeah, you're probably right. There will always be a record about it in the Congressional Record (or equivalent), but we might not always get the information from there. How about I make it a list named "sources", in case we ever have to combine sources to make a single action entry?

konklone commented 11 years ago

Sure.

On Wed, Jan 30, 2013 at 11:44 AM, Gordon P. Hemsley < notifications@github.com> wrote:

Yeah, you're probably right. There will always be a record about it in the Congressional Record (or equivalent), but we might not always get the information from there. How about I make it a list named "sources", in case we ever have to combine sources to make a single action entry?

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/34#issuecomment-12898960.

Developer | sunlightfoundation.com

GPHemsley commented 11 years ago

Of course that would leave the citation format to be determined. Should each source have a code for the general document/organization and then a specific citation within it, or should it just be { ... "sources": [ "STATUTE-72-Pg3" ] ... }?

konklone commented 11 years ago

It's not too big a deal, since we can always regenerate it later, so how about just a URL to the original document for now? The other option is a full dict that's like [{source: "statutes", volume: "72", page: "3"}], which is also fine.

On Wed, Jan 30, 2013 at 11:48 AM, Gordon P. Hemsley < notifications@github.com> wrote:

Of course that would leave the citation format to be determined. Should each source have a code for the general document/organization and then a specific citation within it, or should it just be { ... "sources": [ "STATUTE-72-Pg3" ] ... }?

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/34#issuecomment-12899255.

Developer | sunlightfoundation.com

JoshData commented 11 years ago

Both would be really helpful. I was going to add a source_url to all of our task output anyway, pointing to the page closest to where the information was scraped suitable for "see more" type links. I'd like to see source_url to all of the tasks, and for the Statutes-generated files just something special for that, i.e. statute_citation: { "volume": 72, "page": 3 ] which would match the "72 Stat 3" type citations people actually use.

konklone commented 11 years ago

So the sources field would look like:

[{
  "source": "statutes",
  "source_url": "...",
  "volume": 72,
  "page": 3
}]
GPHemsley commented 11 years ago

What URL should I use for the the source_url? MODS? PDF?

Also, I think it would be good to also include the access ID ("STATUTE-72-Pg3"), since that's the primary identifier of a particular statute at the GPO. (When multiple statutes appear on the same page, they get different access IDs; "72 Stat. 3" could be ambiguous, though I could also include a field containing the page position value.)

JoshData commented 11 years ago

I'd like a URL I can link to, so these sort of pages would be good: http://www.gpo.gov/fdsys/granule/STATUTE-118/STATUTE-118-Pg493/content-detail.html

I'm not sure the accessID is useful without also the package ID it's contained in (STATUTE-72). Feel free to include one or both.

The citation is ambiguous to a bill, but it's what lawyers use sometimes, so it's useful.

konklone commented 11 years ago

I think the "source_url" field should probably literally be the URL that was used to get the data being output, for provenance' sake. But you could add other URLs - and like you said, you can use the GPO identifier to construct other kinds of detail URLs client-side, too.

On Thu, Jan 31, 2013 at 12:27 PM, Joshua Tauberer notifications@github.comwrote:

I'd like a URL I can link to, so these sort of pages would be good:

http://www.gpo.gov/fdsys/granule/STATUTE-118/STATUTE-118-Pg493/content-detail.html

I'm not sure the accessID is useful without also the package ID it's contained in (STATUTE-72). Feel free to include one or both.

The citation is ambiguous to a bill, but it's what lawyers use sometimes, so it's useful.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/34#issuecomment-12953943.

Developer | sunlightfoundation.com

GPHemsley commented 11 years ago

I currently have it outputting this:

  "sources": [
    {
      "access_id": "STATUTE-71-PgB6", 
      "page": "B6", 
      "position": "1", 
      "source": "statute", 
      "source_url": "http://www.gpo.gov/fdsys/granule/STATUTE-71/STATUTE-71-PgB6/content-detail.html", 
      "volume": "71"
    }
  ], 
JoshData commented 11 years ago

@konklone If that's different then what I was suggesting, then I'm just going to ask for yet another field for a human-readable page....

JoshData commented 11 years ago

Gordon- Looks great to me.

GPHemsley commented 11 years ago

Updated:

  "sources": [
    {
      "access_id": "STATUTE-73-Pg14-2", 
      "package_id": "STATUTE-73", 
      "page": "14", 
      "position": "2", 
      "source": "statutes", 
      "source_url": "http://www.gpo.gov/fdsys/granule/STATUTE-73/STATUTE-73-Pg14-2/content-detail.html", 
      "volume": "73"
    }
  ], 
GPHemsley commented 11 years ago

I funny speak in summaries commit my. Pull request #43.

JoshData commented 11 years ago

I've got this new bill data from 1951-1972 up on GovTrack now (http://www.govtrack.us/congress/bills/browse). Nice work, Gordon.

For the text, I'm thinking we extract the text layer of the PDF into bills/x/xddd/text-versions/enr/document.txt. (That's where the fdsys --store command puts current bill text.) Thoughts?

konklone commented 11 years ago

That makes sense to me. bill_versions.py is near-identical, putting a file at bills/x/xddd/text-versions/enr.json. I'll change it to be enr/data.json instead.

On Sat, Feb 2, 2013 at 12:29 PM, Joshua Tauberer notifications@github.comwrote:

I've got this new bill data from 1951-1972 up on GovTrack now ( http://www.govtrack.us/congress/bills/browse). Nice work, Gordon.

For the text, I'm thinking we extract the text layer of the PDF into bills/x/xddd/text-versions/enr/document.txt. (That's where the fdsys --store command puts current bill text.) Thoughts?

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/34#issuecomment-13033948.

Developer | sunlightfoundation.com

GPHemsley commented 11 years ago

@tauberer It looks like you missed 1951–1957 (82–84). Also, you might want to make sure that the 85–88 files have been generated by the latest version of all files/scripts involved.

konklone commented 11 years ago

This looks done enough to close. Re-open if I'm wrong, of course.