Closed daguar closed 10 years ago
sweet radical working on this
Is the intent to do this in parallel with the request to Netfile for a downloadable directory listing, for contingency?
Also, perhaps it is worth looking into connecting with a Socrata uploading API* so that step 4 becomes "upload to socrata with no human intervention"?
*assuming one exists
Is the intent to do this in parallel with the request to Netfile for a downloadable directory listing, for contingency?
Yes. I was mentioning these steps in a debrief, and @whatasunnyday was keen on it and has some bandwidth, so I said go for it. Worst case we retire this because we strike gold with Netfile.
Also, perhaps it is worth looking into connecting with a Socrata uploading API* so that step 4 becomes "upload to socrata with no human intervention"? *assuming one exists
Totally exists! In fact, Datasync is not much more than a GUI wrapper around the API. We'd need publish credentials, though, so I'm inclined to just dump to a server where the City can point Datasync to.
Hey Dave, I have Socrata publishing credentials - is that what you mean we need to utilize Datasync?
I think it makes more sense to separate downloading+converting the files from the upload step.
This is because, in the long term, Datasync will probably be how the City uploads the data, and so keeping them separate would be helpful to begin with.
But @lla2105 I do think two next steps are for you to:
Might be worth opening new issues for those.
Dave, I need some help drafting the email to request the CSV directory that SF has from Netfile. Can you possibly help with the technical language by drafting a few sentences on what SF has and what we want and are asking them for? I think I understand it but I do not want to screw up this opportunity. Whitney wants to review it before I send along an ask to Netfile.
The two things you should ask Netfile for:
Thank you Dave! I really appreciate it.
To answer your Twitter question, @whatasunnyday:
Lauren acquired a copy of Safe FTE that someone in the Port had -- this is the GUI ETL software SF uses to get Netfile data up on Socrata. That said, we don't yet know the license status/cost/etc.
I think it's still worthwhile to write some Ruby code that pulls down the Netfile data, does the conversion to one-CSV-per-tab-across-all-years, and then saves the CSVs locally.
This is because (a) it's a relatively lightweight script, and (b) it would obviate the need to use any heavier software whatsoever.
What's more, I think the core of "given multiple Excel spreadsheets, consolidate tabs" is actually a potentially reusable tool for data munging (so I'd probably modularize that logic.)
Also, as I mentioned on Twitter, I think it makes sense to save to local disk, since I think we could potentially just run this script wherever Socrata Datasync is running.
I have connections at FME/safe who offered a copy for nonprofit use, I'll follow up on it. Awesome tool btw, used to use it daily, powerful shiz. On Feb 28, 2014 8:29 AM, "Dave Guarino" notifications@github.com wrote:
To answer your Twitter question, @whatasunnydayhttps://github.com/whatasunnyday :
Lauren acquired a copy of Safe FTE that someone in the Port had -- this is the GUI ETL software SF uses to get Netfile data up on Socrata. That said, we don't yet know the license status/cost/etc.
I think it's still worthwhile to write some Ruby code that pulls down the Netfile data, does the conversion to one-CSV-per-tab-across-all-years, and then saves the CSVs locally.
This is because (a) it's a relatively lightweight script, and (b) it would obviate the need to use any heavier software whatsoever.
What's more, I think the core of "given multiple Excel spreadsheets, consolidate tabs" is actually a potentially reusable tool for data munging (so I'd probably modularize that logic.)
Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/21#issuecomment-36367994 .
Awesome @spjika!
I think there still remains the issue of what makes for the most robust long-term solution. I actually think a Ruby script + Socrata Datasync ain't a bad idea.
That said, if Oakland could use FME to bolster its ETL/data capacity across all departments I could see it being an invaluable get. (Note: Chicago CDO has been using Pentaho Kettle for this.)
Applied for the software, asked for 3 licenses.
Spike openoakland.org www.stealingbeautyphotography.com
On Fri, Feb 28, 2014 at 8:34 AM, Dave Guarino notifications@github.comwrote:
Awesome @spjika https://github.com/spjika!
I think there still remains the issue of what makes for the most robust long-term solution. I actually think a Ruby script + Socrata Datasync ain't a bad idea.
That said, if Oakland could use FME to bolster its ETL/data capacity across all departments I could see it being an invaluable get. (Note: Chicago CDO has been using Pentaho Kettle for this.)
Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/21#issuecomment-36368630 .
Awesome Spike!
I gave FME a quick look, and it was actually more confusing to me than code.
So I've started a bit of ETL work here: https://github.com/daguar/netfile-etl
Super initial, but pretty easy to write. (Using Python because csvkit is badass and Ruby's roo was giving me troubles with big Excel files.)
Okay, the ETL scripts are basically done here: https://github.com/daguar/netfile-etl
Things remaining are:
But it basically does all the work to download, unzip, and merge the files. We'll use Datasync for the last 10 yards.
That looks great @daguar. Cygwin seems like a pretty good way to do that, and it's way more straightforward that way than using some FME thing that none of us have experience with nor actual desire to use.
(hey @whatasunnyday, any progress on the Ruby ETL thing?)
Hey Tom, I'm happy to make a likewise ruby version if you guys would like. It should take no time I was just traveling and settled but do you think its worth if it we have a python version? On Feb 28, 2014 8:40 PM, "Tom Dooner" notifications@github.com wrote:
(hey @whatasunnyday https://github.com/whatasunnyday, any chance of a Ruby ETL thing too?)
Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/21#issuecomment-36416121 .
Ah okay, just wondering if you had anything. Probably not worth starting since @daguar's looks like it'll get the job done.
I ended up taking a step back and waiting for the link from Netfile which has direct links to the excel docuemnts because the current site has a javascript link that starts a download which required a headless browser like selenium or waitr. I didn't do the combining step which looks what @daguar. Since he mentioned the ruby library that handles excel documents chokes on the size, its probably not worth pursuing at all.
Yeah, I'd stick with this. Roo hung for like 2 minutes (and I ctrl C'd it) with the 2013 data, and 2014 will be larger.
If need be, I can make this a cron job on Heroku that presents the final CSVs as a web dir.
(Also WTF python, I can't cast ints as strings out of the box? Really?)
@daguar str(1337)
?
Goddamnit why doesn't
"String shit" + 1337
know what's up?
And don't get me started on dependency management.
Btw my stuff really isn't Python: it's pretty much shell scripting with Python as a wrapper for convenience.
Pip is god awful.
I think "this is a string " + str(1337)
should work. Not in front of a PC
though.
On Feb 28, 2014 9:09 PM, "Dave Guarino" notifications@github.com wrote:
Goddamnit why doesn't
"String shit" + 1337
know what's up?
And don't get me started on dependency management.
Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/21#issuecomment-36416540 .
Oh it does, I just like Ruby's default conversion. Strings being the least common denominator makes sense in a dynamic language.
Closing; this is done in https://github.com/daguar/netfile-etl
This essentially replicates SF's process with a script:
Then, we can work with City of Oakland IT to get these onto a server within the City where DataSync can nightly upload these to Socrata.