Closed wilson428 closed 11 years ago
Yeah, I reuse it myself -- just did again for inspectors-general. I use like 8 methods at least that I copy/pasted out of the congress repo, and congress-legislators uses it too.
I've thought about it, and it always feels like trouble -- especially because as it stands, the download method in the congress repo is much more complicated than the others (optimized for peeking inside zip files, even). If we could refactor download
to move repo-specific code out of it, it might be a good candidate for a tiny pip module. (The Node world has greatly influenced me to stop thinking of modules as large or significant -- even serving up one good function as a module is a great idea.)
Yeah, I just rewrote it for Node, in fact. And plenty of room for improvement (like cache expiration).
(In case anyone's in need: https://github.com/wilson428/downcache)
I'd like to tackle this, if no one else wants to.
All yours, far as I'm concerned!
On Sat, Nov 16, 2013 at 10:16 AM, Gordon P. Hemsley < notifications@github.com> wrote:
I'd like to tackle this, if no one else wants to.
— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/95#issuecomment-28628631 .
christopher.e.wilson@gmail.com 434.242.9728
Go for it. What approach do you think we should take for integrating the module? Pip, git...?
My thinking was to make it a Python module.
Makes sense. The main utils, pickling and downloading and all that, aren't congress-specific, so maybe something like unitedstates-utils that scopes it more clearly to our work?
Then it becomes a place for any useful cross-repo code we want, even if it's not congress-related.
Pickling could go in rtyaml module too (congress - legislators) , that's a good candidate for pip.
@konklone Indeed, unitedstates/congress-utils was precisely what I had in mind. (See my comments on #98.)
I also agree with @JoshData that rtyaml should be its own module (perhaps unitedstates/rtyaml, I don't know) to contain, as he put it privately, "all the annoying stuff we had to do to make YAML usable". Then, presumably, congress-utils would import that.
If rtyaml gets its own repo, that will definitely be the repo description. :)
FWIW, this would also allow a separation of powers between who works on the tools and who works on the data. (wink, wink)
And, of course, the repo(s) would include extensive testing to ensure that nothing breaks unexpectedly downstream.
All awesome. Feel free to take it in stages. :)
Also, I was suggesting unitedstates-utils over congress-utils, since it's not congress-specific and I'd like to use it in unitedstates/inspectors-general too. On Nov 17, 2013 12:57 PM, "Gordon P. Hemsley" notifications@github.com wrote:
And, of course, the repo(s) would include extensive testing to ensure that nothing breaks unexpectedly downstream.
— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/95#issuecomment-28657794 .
@konklone Oh wow, I didn't even pick up on that. But unitedstates/unitedstates-utils seems long and redundant to me. Any particular preference for that over, say, unitedstates/utils?
/utils works just fine. For the Python module, unitedstates-utils or us-utils would be better. FWIW, I ended up publishing unitedstates/documents as a gem called us-documents
.
+1 to avoiding "utils" by itself as any level of the namespace, particularly if the aim is to have this reused outside of the project. It's unfortunately common and not that descriptive.
FWIW, I am still slowly (slooooowly) working on district office collection, and over the weekend found myself needing to ditch my former reliance on YAML as part of the somewhat lengthy matching/automatic/manual review workflow I've built -- it's great for output and patches but loading, seeking and saving in a per-record review were all posing problems for me. dataset https://dataset.readthedocs.org/en/latest/ has really impressed me so far, so I thought I'd mention it here and see what folks thought about incorporating sqlite as an intermediate datastore in some cases...
On Sun, Nov 17, 2013 at 2:30 PM, Eric Mill notifications@github.com wrote:
/utils works just fine. For the Python module, unitedstates-utils or us-utils would be better. FWIW, I ended up publishing unitedstates/documents https://github.com/unitedstates/documents as a gem called us-documents.
— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/95#issuecomment-28661431 .
Incorporating sqlite as a transient datastore, where no sqlite file needs to be versioned in the repo, is probably fine. It's a tool that a script can use to get done what it needs to do. But, even though YAML is not ideal for everything, it's important to keep the data in one canonical place and format.
(For everyone's reference, this is about the congress-legislators project, we just happen to be having the discussion here.)
I anticipate, very soon now, to start automatically publishing CSV and JSON versions of the underlying YAML data to this projects' gh-pages branch, and advertising those at a theunitedstates.io landing page. The YAML files' relevance will become more of a background thing, a sane single dataset that can be transmuted on demand into whatever.
Does somebody want to set up unitedstates/rtyaml and/or unitedstates/utils (and add me to them?) so I can begin to tinker?
On it.
OK, made https://github.com/unitedstates/utils and https://github.com/unitedstates/rtyaml - closing this issue, we can take it up there.
Great, thanks.
FYI: I've updated the rtyaml description accordingly.
Several times I've looking for a function I added in utils.py only to realize it was in a different UnitedStates repo. Some functions are repo-specific, but many, like the download caching, are not. In fact, I often use that function in unrelated projects.
Trying to abstract it out of the individual projects may be more trouble than it's worth, but thought I would raise the prospect. Another possibility is moving the download caching into scrapelib.