unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
928 stars 201 forks source link

Switch to Congress.gov #57

Closed konklone closed 8 years ago

konklone commented 11 years ago

It's not time yet, but I'm leaving this as an open issue as a reminder that sometime soon, this will happen. When Congress.gov was announced in September 2012, THOMAS was only given "approximately a year" to live.

We can switch as soon as Congress.gov's coverage table, which explicitly compares data availability between THOMAS and Congress.gov, shows that we won't lose information. Or, we switch because THOMAS shuts down and we have no choice.

konklone commented 11 years ago

Congress.gov had added "all actions", and they've actually done a ton of categorization work that should greatly simplify the work we have to do:

http://blogs.loc.gov/law/2013/04/all-actions-added-to-congress-gov-and-other-updates/

I think I'll start a new script that can live in parallel with our THOMAS-based one, that starts the process of learning how Congress.gov works and what information we can get out of it.

konklone commented 11 years ago

New news - the homepage of THOMAS will redirect to Congress.gov on Nov 19, 2013, and THOMAS will be retired in late 2014:

http://www.loc.gov/today/pr/2013/13-202.html?loclr=twtho

Nothing urgent yet for us, though obviously it'd be wise to transition gradually and early.

dwillis commented 11 years ago

Is there a branch or someplace where the congress.gov stuff is/will be?

JoshData commented 11 years ago

I would strongly suggest we don't start early so that we don't provide any excuses for them to avoid giving up the data.

konklone commented 11 years ago

Point well taken, @JoshData. And @dwillis, not yet, but if/when it begins, we'll start one.

chishaku commented 8 years ago

Just wondering if there has been any action towards the Congress.gov migration?

I imagine the retirement of THOMAS is still at least a year away. From the Congress.gov FAQ:

The THOMAS.gov homepage has redirected me here. Is THOMAS gone?

The Library of Congress is moving closer to retiring THOMAS. In September 2014, the Congress.gov beta website URL changed from beta.congress.gov to Congress.gov.

Thomas.loc.gov and www.thomas.gov direct visitors to Congress.gov. Researchers with a continued need to access THOMAS can bookmark http://thomas.loc.gov/home/thomas.php.

Why is THOMAS being replaced?

THOMAS is a comprehensive system that was launched in the mid-1990s, and which is approaching the end of its lifecycle. The new Congress.gov platform enhances access through features such as videos explaining the legislative process, compatibility with mobile devices, and a user-friendly presentation.

The new beta.congress.gov provides modern functionality, including -

  • Single search across all collections and all dates
  • Meaningful, persistent URLs
  • Faceted search results

When will THOMAS be retired?

A specific date for retiring THOMAS has not been determined, but the Library expects to be able to announce a date in the coming months. For the past year, the Library has been actively encouraging THOMAS users to begin using the Congress.gov site. Most of the outstanding content and features of THOMAS are expected to be incorporated into Congress.gov in a December 2015 release, and any remaining items to be part of a February/March release. As part of the future announcement, an ample grace period will be allowed for users to adjust their processes and systems to this change.

Also, updated link for Congress.gov coverage.

konklone commented 8 years ago

I think the general idea has been to see the extent of the upcoming release of bulk data on "bill status" by the LOC and GPO, and then see how much there is left to scrape from Congress.gov.

I personally think it's unlikely that LOC and GPO will manage to publish XML that covers everything we currently get by scraping THOMAS, but my understanding is that it's still being finalized. Let's leave this issue open and update it once we have a better idea, perhaps following the next meeting of Congress' Bulk Data Task Force.

dwillis commented 8 years ago

I agree with @konklone's description. I would add that, purely for work reasons, I've begun writing Ruby scrapers for congress.gov as part of a library I maintain. In general it's only slightly less annoying than scraping Thomas, and in some cases more so.

DanielSchuman commented 8 years ago

The next meeting of the Bulk Data Task Force is in two weeks. As @konklone suggested, the agenda includes: "sample Bill Status XML files and the XML User Guide that are scheduled to be released at the end of the year. "

@konklone and Derek would know whether the status info, combined with the summaries and text which already are being published, comprises everything from THOMAS.

I suspect there may be a few things, like the Appropriations Tables, which are on THOMAS but will not be available in bulk and will have to be scraped from Congress.gov.

If there is anything you would like brought up at the meeting, please let me know.

Daniel

On Sun, Nov 29, 2015 at 8:13 PM, Derek Willis notifications@github.com wrote:

I agree with @konklone https://github.com/konklone's description. I would add that, purely for work reasons, I've begun writing Ruby scrapers for congress.gov as part of a library I maintain https://github.com/dwillis/hulse. In general it's only slightly less annoying than scraping Thomas, and in some cases more so.

— Reply to this email directly or view it on GitHub https://github.com/unitedstates/congress/issues/57#issuecomment-160498173 .

JoshData commented 8 years ago

Per #165 we no longer need to do this for data going forward, as bill status data is now available from the 113th Congress forward, but we might want to consider this anyway for historical data until Congress publishes historical bill status data.

swt83 commented 8 years ago

Have all the scrapers been migrated to Congress.gov? The LOS indicates THOMAS will be shut down on July 5.

JoshData commented 8 years ago

Now that there's bill status data in XML, the plan is to use that and stop screen scraping entirely. See #165.

Since I don't think anyone is desiring/intending to update things here to scrape Congress.gov, I'm going to close this issue in favor of #165.