unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
913 stars 198 forks source link

New Data: American Memory #37

Closed GPHemsley closed 10 years ago

GPHemsley commented 11 years ago

American Memory has dates, titles, keywords, and committee information for bills from the 6th through 42nd Congresses:

http://memory.loc.gov/ammem/amlaw/lwhblink.html http://memory.loc.gov/ammem/amlaw/lwsblink.html http://memory.loc.gov/ammem/amlaw/lwsrlink.html

It also has images of the bills, including multiple versions of the same bill in some cases (i.e. due to changes made to the bill).

See also: http://memory.loc.gov/ammem/amlaw/lawhome.html http://memory.loc.gov/ammem/amlaw/lwsp.html http://memory.loc.gov/ammem/amlaw/lwss.html

GPHemsley commented 11 years ago

This might also be of use: http://avalon.law.yale.edu/subject_menus/statutes.asp

GPHemsley commented 11 years ago

American Memory also has captured the Congressional Record for the 43rd Congress (1873–1875): http://memory.loc.gov/ammem/amlaw/lwcrlink.html

It includes a text index, which gives bill numbers and titles, along with page numbers for various actions: http://memory.loc.gov/cgi-bin/query/r?ammem/hlaw:@field(DOCID+@lit(cr002T000)): http://memory.loc.gov/cgi-bin/query/r?ammem/hlaw:@field(DOCID+@lit(cr004T000)):

GPHemsley commented 11 years ago

The Congressional Globe also has some of the same for the 23rd through the 42nd (1833–1873), though not uniformly: http://memory.loc.gov/ammem/amlaw/lwcglink.html

GPHemsley commented 11 years ago

Constitution.org has scans of all Statutes at Large, including those omitted by American Memory and GPO:

http://constitution.org/uslaw/sal/sal.htm

Unfortunately, there is only one (huge) PDF per volume, but they have been OCR'd.

GPHemsley commented 11 years ago

Early Statutes (pre-1927) are also available OCR'd here: http://homepages.uc.edu/~armstrty/statutes.html

DanielSchuman commented 11 years ago

I have some additional resources, some of which you've found already, on the statutes at large.

This blogpost lists the latest efforts for GPO to digitize the statutes at large, with new resources available from the 1950s forward. It pulls together tons of info in one central place.

Also, Sunlight has published a mirror to the constitution.org's website, where we make their copies of the statutes available via Amazon services (and thus easier to download). Mirror.

Also, I've got a long-running wiki with tons of resource, but alas the host is down at the moment. I'll add it when I have a chance. Feel free to ping me, by the way. I've been looking at this for a couple years now.

GPHemsley commented 11 years ago

We've already incorporated the data provided by GPO, unless more has been added since February?

I know I'd already read that article when it was published, but I guess I missed that link to Constitution.org at the time. (I found the link on Wikipedia today.)

Right now, the GPO provides near the minimum of what we need for a resource to be useful: separate PDFs for each bill, and machine readable metadata to describe what it's about. These other resources would need to be heavily processed in order to be helpful.

DanielSchuman commented 11 years ago

I think the February update is the latest, so there's no more there. As you wrote up-thread (or in a different thread), it may be possible to mine the stuff from the american memory, even if the pictures are awful TIFFs. ( http://memory.loc.gov/ammem/amlaw/lwsllink.html). We can also talk to the folks at the OLRC and Leg Counsel to see if they have some stuff they could provide ... would the meta-data be useful?

Daniel

Daniel Schuman Director | Advisory Committee on Transparencyhttp://transparencycaucus.org/ Policy Counsel | The Sunlight Foundation http://sunlightfoundation.com/ o: 202-742-1520 x 273 | c: 202-713-5795 | @danielschuman

On Mon, Apr 1, 2013 at 1:53 AM, Gordon P. Hemsley notifications@github.comwrote:

We've already incorporated the data provided by GPO, unless more has been added since February?

I know I'd already read that article when it was published, but I guess I missed that link to Constitution.org at the time. (I found the link on Wikipedia today.)

Right now, the GPO provides near the minimum of what we need for a resource to be useful: separate PDFs for each bill, and machine readable metadata to describe what it's about. These other resources would need to be heavily processed in order to be helpful.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/37#issuecomment-15705176 .

GPHemsley commented 11 years ago

It would certainly be useful for all users of American Memory (which includes, e.g., Wikipedia) if there was better metadata attached to the TIFFs, even if the TIFFs themselves are not OCR'd. It would be equally helpful to encourage the GPO to publish more of the old volumes, with a similar amount of metadata.

JoshData commented 11 years ago

So the question is, what's doable next?

Btw, I have a machine running now slowly mirroring all of the statue pdfs (granules).

DanielSchuman notifications@github.com wrote:

I have some additional resources, some of which you've found already, on the statutes at large.

This blogpost lists the latest efforts for GPO to digitize the statutes at large, with new resources available from the 1950s forward. It pulls together tons of info in one central place.

Also, Sunlight has published a mirror to the constitution.org's website, where we make their copies of the statutes available via Amazon services (and thus easier to download). Mirror.

Also, I've got a long-running wiki with tons of resource, but alas the host is down at the moment. I'll add it when I have a chance. Feel free to ping me, by the way. I've been looking at this for a couple years now.


Reply to this email directly or view it on GitHub: https://github.com/unitedstates/congress/issues/37#issuecomment-15702316

Sent from my Android phone with K-9 Mail. Please excuse my brevity.

DanielSchuman commented 11 years ago

(1) I've been encouraging GPO to digitize the rest of the statutes at large; and asking the Joint Committee on Printing to make it happen. Don't have a lot of hope for more digitization efforts from them on this while the sequester is going. (Or even without it.)

(2) We could ask some of our librarian friends if they have more information about the statutes at large. Do folks want a conversation between the Statute at Large and the bill number? The name of the Statute at Large? I guess I'm not aware of what meta-data we already have prior to 1952, so I'm not sure what to ask for.

JoshData commented 11 years ago

I see two paths from here.

A) For the 1st-42nd Congresses, the best bill text would be the American Memory TIFF images. Maybe we can get whatever they have on that. The hard part is matching TIFFs to bill numbers (forget even OCRing for now). Might be possible to scrape from their website. Might be easier if we had their files in bulk.

B) From the 43rd to the 102nd Congresses, the only source of (partial) bill text is the Statutes. From the 82nd Congress forward, the metadata on FDSys is enough to get the text (more or less), and I'm working on downloading those PDFs now. So the non-GPO sources of the Statutes for 43rd-81st Congresses are the starting place. The hard part there is the same as AmMem, matching up page numbers to bill numbers. I think the best solution is to Turk this out. Gordon, want to learn Amazon Mechanical Turk?

JoshData commented 11 years ago

Or, we could do our own scanning effort and just do everything from scratch so we get every bill. I wonder if there's an academic institution that would want to do the scanning work if they got a grant to do it?

DanielSchuman commented 11 years ago

Speaking of, the Internet Archive apparently has a scanned version of the SAL as well. It's here. That link will get you some random other stuff, but it does appear to cover the early versions.

JoshData commented 11 years ago

FYI:

1) I've mirrored the GPO FDSys Statute at Large collection in the public AWS snapshot snap-4e4d0908. The only real value is if you want individual-statute PDFs, which is the only way to get statute-by-statute text. It's 30 GB.

2) Thanks to @DanielSchuman, I'm now attempting to mirror the entire American Memory collection (with previously hidden metadata). It includes a LOT of collections, and it is well over 30 GB [the point at which I ran of out disk space].

JoshData commented 11 years ago

Here's a copy of the American Memory Century of Lawmaking's complete set of metadata files:

http://www.govtrack.us/data/misc/am_mem_law_metadata.tgz (187 MB)

Some of the collections include transcribed text, but not for bills.

I'm still working on mirroring the images etc. for posterity.

GPHemsley commented 11 years ago

I've begun a script to extract this metadata to be more useful:

https://github.com/govtrack/american-memory

konklone commented 11 years ago

This is awesome. Could I convince you guys to move it into unitedstates?

JoshData commented 11 years ago

I have big plans for putting AmMem in unitedstates, don't worry. :)

GPHemsley commented 11 years ago

FYI: The script now outputs data in a format more closely resembling that used by unitedstates/congress.

GPHemsley commented 11 years ago

The American Memory metadata (at least for llhb, llsb, and llsr) has been mirrored here:

https://github.com/govtrack/american-memory

The txt is the file that appears on the LOC's website. The information will be curated and corrected in the JSON files, which can now be used for easy data extraction (the fields are labeled). And the updated JSON files will be mirrored in the CSV files, which are intended to be (mostly) backwards-compatible with the original LOC files.

The process_metadata.py script handles the conversion process when the JSON or CSV files are changed.

And the bills.py script creates congress-compatible JSON files for each bill.

Naturally, all of this is still a work in progress.

GPHemsley commented 11 years ago

I've added documentation, if anyone is interested in what it all means:

https://github.com/govtrack/american-memory/blob/master/README.md

GPHemsley commented 11 years ago

This is really shaping up, for those wondering.

The bills.py script now outputs some additional granularity in the status of a bill as it stood in the version that is captured, as well as extracting the titles of bills for a good number of them. (Amendments and alternate phrasing sometimes makes this difficult. I also haven't yet attempted to extract sponsorship information, which is marginally available.)

The script also outputs some useful JSON files that list the all the committees that are listed with bills across all the Congresses available, as well as a calendar of events for each Congress. (I haven't yet documented these.)

The data itself has proven to be more riddled with errors than previously realized, however. And these errors have been introduced at various different levels of digitization. Usually, I come across them by accident in a particular bill here or there, but I expect a lot of them to be systemic.

That being said, I've processed and presented the American Memory metadata in such a way that it is still an improvement over the existing files. Most importantly, the existing files are not actually parsable as CSV in many cases, because of unescaped quotes, and they were subtly in an obscure file encoding that is too ASCII-compatible to have been noticed by a passive user of the data. These two issues have both been corrected in the GovTrack mirror of the data (the files are now UTF-8), and some easily fixed typos have been fixed.

There is still a lot of work to be done, but the data is in a state where it can be tested in a non-production environment, I think.

JoshData commented 11 years ago

Very exciting!

JoshData commented 10 years ago

We actually finished this a while ago, so closing the issue. See https://github.com/unitedstates/am_mem_law/tree/master/bills.