Create data directory hierarchy if not present

unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.

https://github.com/unitedstates/congress/wiki

Creative Commons Zero v1.0 Universal

928 stars 200 forks source link

Create data directory hierarchy if not present #201

Open gregoryfoster opened 7 years ago

gregoryfoster commented 7 years ago

Hello, and thank you for sharing and maintaining such a valuable project. I'm just getting started by way of legis-graph and intend to become a frequent user and hopefully a helpful contributor.

I've setup a fresh installation and Python 2.7 virtual environment. As a heads up for potential future congress users, I ran into an SSL handshake issue sourced to scrapelib which prevents execution of the fdsys task (and likely others). That issue and workaround is detailed here.

Currently, I'm attempting to ./run bills --congress=115 and the task fails because there is no data hierarchy in the filesystem yet. mkdir -p data/115 and a subsequent os.listdir call will fail because there are no bill types. This is easy enough to workaround with some knowledge of the expected hierarchy, but it seems like something we could also easily fix.

I see there's a mkdir_p function in utils.py we could reuse - is there a good central place in the codebase to anticipate this edge case? I'd be happy to put together a pull request with a little guidance.

Thanks again for this very useful project!

gregoryfoster commented 7 years ago

Hmm, after messing around a bit more I'm starting to feel like I'm missing an important step between ./run fdsys --collections=BILLSTATUS and ./run bills. It looks to me like the bills task is expecting a bootstrapped dataset to already exist in the data hierarchy, but I don't see any mention of how to achieve that in the README or the wiki.

gregoryfoster commented 7 years ago

Re-opening, didn't mean to close the issue.

konklone commented 7 years ago

@gregoryfoster I filed a quick PR to fix the issue you identified: https://github.com/unitedstates/congress/pull/202

However, the bills task is defunct and unused. It was designed for thomas.gov, which is now :skull: in favor of congress.gov. The fdsys task is active, and would be easier for @JoshData to speak to, as he has it up in production.

gregoryfoster commented 7 years ago

Thanks, @konklone, for the quick fix. It does take care of creating the data hierarchy through a specified Congress.

I'm a little puzzled and honestly a little distressed to hear that the bills task is regarded as defunct, as that shines a different light on GovTrack's announcement that they'll no longer support bulk data access after the 2017 summer recess. Is this project winding down?

JoshData commented 7 years ago

No no no, I re-wrote the bills task last year to convert the new official bill XML (from fdsys) into the existing JSON data format. Since GovTrack relies on the JSON format and I don't have the capacity to re-write GovTrack's importer to use the fdsys XML directly, I'm still invested in keeping the bills task running.

JoshData commented 7 years ago

The mkdir issue probably stemmed from my rewrite last year, btw. Sorry about breaking it on clean directories (which I never test on).

gregoryfoster commented 7 years ago

Whew, glad to hear, @JoshData!

Returning to the original edge case of an absent and now clean data hierarchy - should I open a separate issue to tackle a clean load scenario? Meaning: while PR #202 avoids the os.listdir errors, the bills task as written doesn't take any action on a clean directory as it's compiling the list of bill types and bill IDs from an empty data hierarchy. That seems like a more substantial chunk of work that would require traversing the fdsys sitemap metadata files (or is there an easier route?).

Let me know if you want me to open a separate issue. And if you can sketch an outline of what needs to be done, I'd be happy to contribute a PR.

konklone commented 7 years ago

Apologies for confusing the issue! And I can verify what @gregoryfoster says -- #202 fixes the errors, but it still doesn't cause the bills task to do anything, it just stops with some messages about fetching 0 bills. I couldn't figure out why that was, and mistook the lack of network requests to mean it'd been retired.

joec58 commented 5 years ago

Hello, I came to this issue report after attempting to run a clean installation of this scraper and got the error: "No such file or directory: 'data'"

This issue and #202 seems to be related to my error even though it is over 2 years old and still Open. #202 says _"This fixes #201 by using mkdirp as necessary when examining data paths on disk.", but without any specific directions on how or where that fix should be applied.

After reading the last 2 comments here, I have to ask if this scraper is still being maintained? If so, where can I find directions on how to fix this issue? Thanks.

JoshData commented 5 years ago

Hi.

At GovTrack we use this project extensively.

Unfortunately we don't have the resources to fix problems that we're not experiencing ourselves, though. This repository was created at a time when multiple well-funded organizations (besides us) we're investing in creating a shared data ecosystem for legislative data, but now some of those organizations effectively don't exist anymore.

joec58 commented 5 years ago

Thanks for you quick response.

I started a project several years ago with GovTrack (GT) bulk data. When I came back to it last year the GT data was no longer online. I found parts of it on ProPublica and elsewhere but some parts I can’t find, like the set of Amendments.

I will spend some time over the next few days trying to figure this scraper out. If it can produce what I’m looking for I will post the fix. I might even try to fork it to Python3 since Python2 is due to be obsolete next year.

dwillis commented 5 years ago

@jox58 Can I ask specifically which scraper you're running that it doesn't create a data directory? I ask because I cloned the repository into a new directory and ran ./run govinfo --bulkdata=BILLSTATUS and it created a data directory.

joec58 commented 5 years ago

You are right. My mistake for not reading the instructions carefully. I did a ./run bills without first ./run govinfo --bulkdata=BILLSTATUS