unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
929 stars 202 forks source link

utils.py across repos #95

Closed wilson428 closed 11 years ago

wilson428 commented 11 years ago

Several times I've looking for a function I added in utils.py only to realize it was in a different UnitedStates repo. Some functions are repo-specific, but many, like the download caching, are not. In fact, I often use that function in unrelated projects.

Trying to abstract it out of the individual projects may be more trouble than it's worth, but thought I would raise the prospect. Another possibility is moving the download caching into scrapelib.

konklone commented 11 years ago

Yeah, I reuse it myself -- just did again for inspectors-general. I use like 8 methods at least that I copy/pasted out of the congress repo, and congress-legislators uses it too.

I've thought about it, and it always feels like trouble -- especially because as it stands, the download method in the congress repo is much more complicated than the others (optimized for peeking inside zip files, even). If we could refactor download to move repo-specific code out of it, it might be a good candidate for a tiny pip module. (The Node world has greatly influenced me to stop thinking of modules as large or significant -- even serving up one good function as a module is a great idea.)

wilson428 commented 11 years ago

Yeah, I just rewrote it for Node, in fact. And plenty of room for improvement (like cache expiration).

wilson428 commented 11 years ago

(In case anyone's in need: https://github.com/wilson428/downcache)

GPHemsley commented 11 years ago

I'd like to tackle this, if no one else wants to.

wilson428 commented 11 years ago

All yours, far as I'm concerned!

On Sat, Nov 16, 2013 at 10:16 AM, Gordon P. Hemsley < notifications@github.com> wrote:

I'd like to tackle this, if no one else wants to.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/95#issuecomment-28628631 .

christopher.e.wilson@gmail.com 434.242.9728

konklone commented 11 years ago

Go for it. What approach do you think we should take for integrating the module? Pip, git...?

GPHemsley commented 11 years ago

My thinking was to make it a Python module.

konklone commented 11 years ago

Makes sense. The main utils, pickling and downloading and all that, aren't congress-specific, so maybe something like unitedstates-utils that scopes it more clearly to our work?

konklone commented 11 years ago

Then it becomes a place for any useful cross-repo code we want, even if it's not congress-related.

JoshData commented 11 years ago

Pickling could go in rtyaml module too (congress - legislators) , that's a good candidate for pip.

GPHemsley commented 11 years ago

@konklone Indeed, unitedstates/congress-utils was precisely what I had in mind. (See my comments on #98.)

I also agree with @JoshData that rtyaml should be its own module (perhaps unitedstates/rtyaml, I don't know) to contain, as he put it privately, "all the annoying stuff we had to do to make YAML usable". Then, presumably, congress-utils would import that.

JoshData commented 11 years ago

If rtyaml gets its own repo, that will definitely be the repo description. :)

GPHemsley commented 11 years ago

FWIW, this would also allow a separation of powers between who works on the tools and who works on the data. (wink, wink)

GPHemsley commented 11 years ago

And, of course, the repo(s) would include extensive testing to ensure that nothing breaks unexpectedly downstream.

konklone commented 11 years ago

All awesome. Feel free to take it in stages. :)

Also, I was suggesting unitedstates-utils over congress-utils, since it's not congress-specific and I'd like to use it in unitedstates/inspectors-general too. On Nov 17, 2013 12:57 PM, "Gordon P. Hemsley" notifications@github.com wrote:

And, of course, the repo(s) would include extensive testing to ensure that nothing breaks unexpectedly downstream.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/95#issuecomment-28657794 .

GPHemsley commented 11 years ago

@konklone Oh wow, I didn't even pick up on that. But unitedstates/unitedstates-utils seems long and redundant to me. Any particular preference for that over, say, unitedstates/utils?

konklone commented 11 years ago

/utils works just fine. For the Python module, unitedstates-utils or us-utils would be better. FWIW, I ended up publishing unitedstates/documents as a gem called us-documents.

sbma44 commented 11 years ago

+1 to avoiding "utils" by itself as any level of the namespace, particularly if the aim is to have this reused outside of the project. It's unfortunately common and not that descriptive.

FWIW, I am still slowly (slooooowly) working on district office collection, and over the weekend found myself needing to ditch my former reliance on YAML as part of the somewhat lengthy matching/automatic/manual review workflow I've built -- it's great for output and patches but loading, seeking and saving in a per-record review were all posing problems for me. dataset https://dataset.readthedocs.org/en/latest/ has really impressed me so far, so I thought I'd mention it here and see what folks thought about incorporating sqlite as an intermediate datastore in some cases...

On Sun, Nov 17, 2013 at 2:30 PM, Eric Mill notifications@github.com wrote:

/utils works just fine. For the Python module, unitedstates-utils or us-utils would be better. FWIW, I ended up publishing unitedstates/documents https://github.com/unitedstates/documents as a gem called us-documents.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/95#issuecomment-28661431 .

konklone commented 11 years ago

Incorporating sqlite as a transient datastore, where no sqlite file needs to be versioned in the repo, is probably fine. It's a tool that a script can use to get done what it needs to do. But, even though YAML is not ideal for everything, it's important to keep the data in one canonical place and format.

(For everyone's reference, this is about the congress-legislators project, we just happen to be having the discussion here.)

I anticipate, very soon now, to start automatically publishing CSV and JSON versions of the underlying YAML data to this projects' gh-pages branch, and advertising those at a theunitedstates.io landing page. The YAML files' relevance will become more of a background thing, a sane single dataset that can be transmuted on demand into whatever.

GPHemsley commented 11 years ago

Does somebody want to set up unitedstates/rtyaml and/or unitedstates/utils (and add me to them?) so I can begin to tinker?

konklone commented 11 years ago

On it.

konklone commented 11 years ago

OK, made https://github.com/unitedstates/utils and https://github.com/unitedstates/rtyaml - closing this issue, we can take it up there.

GPHemsley commented 11 years ago

Great, thanks.

FYI: I've updated the rtyaml description accordingly.