New data: Bill versions, with text links

konklone commented 12 years ago

Sync version information with GPO using their sitemaps for all the years they have it available.

Write a new bill_versions.py task, which deposits a versions.json file for every bill that is available.

This file should contain an array of information on each version, including:

Version code (e.g. "ih", "enr")
Version name ("Introduced in House", "Enrolled"
Date issued
PDF text link
XML text link
Plain text link
GPO landing page link
MODS XML link
PREMIS XML link

Some bills can appear in GPO first, or in THOMAS first, so neither bill_info.py nor bill_versions.py should depend on each other's output in any way.

notthatbreezy commented 11 years ago

Is someone doing this? I'd be willing to do it, but don't want to duplicate someone else's effort.

konklone commented 11 years ago

I was planning on tackling it in January, but not this year. If you'd like to jump on it, feel free. I've written a bulk downloader for bill text through GPO's sitemaps for Sunlight's Congress API already, in Ruby, and was going to port it to Python for this project.

I've also changed my thinking a bit since originally filing the ticket - instead of downloading the actual bill text (which is a large amount of data, and not particularly useful, since GPO has them in bulk at reliable URLs), the task should just fetch basic information for each version of each bill, along with links to bill text for each version.

So what I'd like to see for this project is a bill_versions.py task that iterates through GPO's sitemaps and produces a versions.json file for each bill (it'd sit alongside the data.json file for that bill, if it existed). That versions JSON would basically be an array of hashes, where each hash contains that version's version code and code name (I have a mapping for them), the date the version was issued, and then links to the bill's PDF, XML, and plaintext versions of its text - and also links to that version's landing pages on GPO, and its MODS and PREMIS metadata files.

Since I've done this all once before, have a specific idea of how it should be done, and would basically be porting a script over, I am very happy to do it. :) But you're welcome to start it too if you want, and I'll offer help however I can.

JoshData commented 11 years ago

I need the actual PDF, MODS, etc., so a command-line flag to mirror those files locally would help me. (It should be smart and only download files if a hash has changed, or something.) Or I can always write my own mirroring script later, of course.

There's also HTML from THOMAS, which is harder to scrape. I can post my old Perl code for that if anyone wants to tackle that.

konklone commented 11 years ago

Yeah, I'll need them too. I actually think that might be best implemented as another .py file, separate from bill_versions.py, whose sole goal is to use the data in any present versions.json files to download all the requested material to a local cache. That script would take in command line parameters dictating which kinds of URLs would be downloaded to disk, for instance, and maybe other useful parameters like rate limiting.

Is the HTML version on THOMAS useful in any way that's distinct from the value you get from what GPO has? GPO's XML versions of bills come with display stylesheets that render them as usefully (and much more official-looking) than THOMAS'.

JoshData commented 11 years ago

I got errors when I tried to run the XSLT stylesheet against the bill text XML. The HTML is also a simpler structure which is good for doing comparisons and other analysis. (If they gave us the raw GPO locator codes files.....)

notthatbreezy commented 11 years ago

Ok cool. I was just trying to look for a way to contribute - though I'm not sure I can get to it before January either (currently trying to get StateRepMe ready to launch the first week of next year).

I might leave this to you then, but if there's another task that I could help with or contribute to let me know. I haven't contributed to many open source projects (trying to change that now), but in the process of graduate school I have a lot of experience writing web scrapers for THOMAS and more generally.

konklone commented 11 years ago

@tauberer - Just checked out Congress.gov, apparently they have a plaintext view, their own PDF copy, and an XML version with CSS. They may just be mirroring GPO's data exactly, I don't know. But either way, since THOMAS is closing next year sometime, it may not be a good idea to build display stuff around its HTML structure.

konklone commented 11 years ago

@notthatbreezy Hope my information blast wasn't discouraging, you just got me thinking about things. :) Besides this task, there are also a couple of bugs that need addressing in summary parsing and in handling THOMAS' instability, that you may have already handled in your own scraper.

notthatbreezy commented 11 years ago

@konklone Ha, not at all - but you're right, might be easier to work on some of the bugs to start out with.

I don't think I've had these issues with Thomas yet, but I'll see what I can do. My webscrapers for THOMAS are actually some legacy code I wrote a couple of years ago before Capitol Words was around to grab the Congressional Record and parse it to identify speakers. That stuff is written in Perl, though I've been using Python for the last couple of years to do NLP stuff so everything I write now is in that for the most part.

JoshData commented 11 years ago

@notthatbreezy Agreed. In the meanwhile I'll keep using my Perl scripts, and hopefully some day I'll figure out how to use the bill text XML in a useful way.

JoshData commented 11 years ago

I did a first pass in cfaafd3990e92542e532b1bb7679b43465ccc420. This adds a new task call fdsys which has two parts. The first updates a local cache of the entire FDSys sitemap, which has value beyond bill text. The second part creates text-versions.json files next to each bill data.json which look like this:

{ "ih": { "lastmod": "2013-01-09T05:54:00.347Z", "url": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr30ih/content-detail.html" } }

I'll give this a second pass and extend this dict soonish. (Eric, feel free to jump in if you have particular logic you want to add, but I'll get to it too.)

konklone commented 11 years ago

Oh, awesome. Yeah, I think I'll jump in after I grab some lunch, just to split this out a bit. I think there should be a bill_versions.py task that makes use of a fdsys.py file full of generic FDSys goodness, that other tasks can use. It will probably operate as a peer to bills.py, since it uses its own method of iteration, and if one were running regular syncs of this data, you might want to do different sync intervals to THOMAS and to GPO.

Unless you have strong feelings, I'll also rename text-versions.json to versions.json, just cause version implies more than just a text change. There isn't always a text change between versions, and depending on the version code it has different legal significance independent from the text value, the bill XML structure/style will be different, etc. The best is when GPO includes CSS in their XML to make engrossed bills' background look like 1700's-style parchment.

I'll also import this mapping of version codes to version names, which used to exist at GPO Access and I don't know if they ever reproduced it in FDSys anywhere after GPO Access closed down: https://github.com/sunlightlabs/congress/blob/master/tasks/utils.rb#L258

JoshData commented 11 years ago

I don't see the point in either, really. I don't mind the versions code being split off, but I think it fits naturally with the other routines that use the same sitemap files and that are updated from the sitemap files. Unless there's a functional difference or something besides aesthetics, let's just leave it for now?

For the naming- 'versions' seems ambiguous (what's being versioned? could be change tracking of data.json) and "version" isn't a term used in Congress. I don't think it's understandable unless "text" is in there. (Also not entirely sure it makes sense. There aren't really multiple versions of a bill. It's the same bill throughout.)

JoshData commented 11 years ago

Oh and on status. There have been weird ad hoc codes like eas2 for a 2nd print at EAS status. Most recently hr2608-112. I think any status code can be followed by an integer... assuming there are any rules at all.

konklone commented 11 years ago

Let me see if I can show you what I mean, about a bill_versions.py - it preserves fdsys.py and has it do all the FDSys-specific stuff, while letting a bill_versions task do things specific to bills. It's more than aesthetics, because it will let us more easily make other tasks that source their information from FDSys - or even to make fdsys.py its own standalone lib that other non-Congressional projects can use.

I don't feel super strongly about text-versions vs versions, since it really is a pretty murky relationship between the text and the version (and this is just aesthetics), but I do view what is being versioned as including, but not being limited to, the raw text.

That's real good to know about the ad hoc codes, I'll try to work that in.

On Sun, Jan 20, 2013 at 4:03 PM, Joshua Tauberer notifications@github.comwrote:

Oh and on status. There have been weird ad hoc codes like eas2 for a 2nd print at EAS status. Most recently hr2608-112. I think any status code can be followed by an integer... assuming there are any rules at all.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/18#issuecomment-12477393.

Developer | sunlightfoundation.com

konklone commented 11 years ago

Just FYI - I am mid-refactor, working on an fdsys branch. It doesn't actually disturb much of the code you wrote, it just layers a bit on top - for example, I just merged your recent commit without any real trouble.

konklone commented 11 years ago

Also, one of the things this refactor will let us do is make use of the process_set utils function you carved out that expects response status codes from a task for standard logging, which is helpful. It makes FDSys an implementation detail, rather than a path of its own - while preserving fdsys.py as a general purpose task and lib for whatever you or anyone else wants to do with it.

konklone commented 11 years ago

OK, it's done in 5082e62663a1d0161b0190e45fbb0ea42aad87ba, 75bd99e359e5bbf5fd6093ebc0b55f571e158f6a, and b249015e0fca1fb8deac397e45d924bacb2b4402. All the fdsys task commands work that you added, and I make use of some of the code in it to do bill_versions.py. bill_versions.py works a lot like bills.py, and can be filtered by a congress, a bill, or a specific version of a bill.

Because of how our set processing works, it actually was a lot easier for me to deposit a file per version, instead of a file per bill. So the version info is in the bill's data dir, at text-versions/[version_code].json. Someone who was reading those versions in would be able to sort them in order by their issued_on date (which is the only reasonable way to order them anyhow - that's how I'd have sorted them if I'd saved them as an array).

It doesn't have a --fast mode or equivalent yet -- I'd like to add a --since flag to limit it just to bills with their lastmod in the last X days, 7 days by default. I'll get to this as I work to integrate this data into my own system.

I also didn't add a --store flag, though I could - I was anticipating downloading the actual bill text files separately, but maybe it makes more sense to include that here. I guess in that case I'd probably change it from text-versions/[version_code].json to text-versions/[version_code]/data.json and put the documents in there too - very similar to how your generic FDsys mirror-er does it.

JoshData commented 11 years ago

Cool.

konklone commented 11 years ago

I can already tell I'm going to want to re-use some of your fdsys.py work in other contexts, like downloading Congressional reports and possibly even court opinions. Probably not worth jumping the gun and breaking it out into its own library yet, but I could see a unitedstates/fdsys someday.

unitedstates / congress

New data: Bill versions, with text links #18