openoakland / opendisclosure

THIS PROJECT IS UNMAINTAINED - SEE: https://github.com/caciviclab/odca-jekyll AND https://github.com/caciviclab/disclosure-backend-static
http://opendisclosure.io/
49 stars 40 forks source link

Convert Netfile spreadsheets into per-sheet CSVs across all years #21

Closed daguar closed 10 years ago

daguar commented 10 years ago

This essentially replicates SF's process with a script:

  1. Get zipped files for each year (for Oakland, that's 2011-2013, soon to also have 2014) from the Netfile site
  2. Unzip the files
  3. Take each "schedule" sheet (example: "A-Contributions") from each year's Excel Workbook and compile a single CSV per tab
    • Example: "A-Contributions.csv" will include all the rows from the "A-Contribution" sheets in the 2011, 2012, and 2013 Excel files
  4. [EDIT 022814: Changing from S3 to local disk] Store those CSVs on local disk

Then, we can work with City of Oakland IT to get these onto a server within the City where DataSync can nightly upload these to Socrata.

sunnyrjuneja commented 10 years ago

sweet radical working on this

tdooner commented 10 years ago

Is the intent to do this in parallel with the request to Netfile for a downloadable directory listing, for contingency?

Also, perhaps it is worth looking into connecting with a Socrata uploading API* so that step 4 becomes "upload to socrata with no human intervention"?

*assuming one exists

daguar commented 10 years ago

Is the intent to do this in parallel with the request to Netfile for a downloadable directory listing, for contingency?

Yes. I was mentioning these steps in a debrief, and @whatasunnyday was keen on it and has some bandwidth, so I said go for it. Worst case we retire this because we strike gold with Netfile.

Also, perhaps it is worth looking into connecting with a Socrata uploading API* so that step 4 becomes "upload to socrata with no human intervention"? *assuming one exists

Totally exists! In fact, Datasync is not much more than a GUI wrapper around the API. We'd need publish credentials, though, so I'm inclined to just dump to a server where the City can point Datasync to.

lla2105 commented 10 years ago

Hey Dave, I have Socrata publishing credentials - is that what you mean we need to utilize Datasync?

daguar commented 10 years ago

I think it makes more sense to separate downloading+converting the files from the upload step.

This is because, in the long term, Datasync will probably be how the City uploads the data, and so keeping them separate would be helpful to begin with.

But @lla2105 I do think two next steps are for you to:

  1. Delete the old campaign finance data sets
  2. Creating schemas + upload for some of the key forms (eg, "Contributions-A") similar to what SF has; as Steven mentioned, you can actually copy their schema

Might be worth opening new issues for those.

lla2105 commented 10 years ago

Dave, I need some help drafting the email to request the CSV directory that SF has from Netfile. Can you possibly help with the technical language by drafting a few sentences on what SF has and what we want and are asking them for? I think I understand it but I do not want to screw up this opportunity. Whitney wants to review it before I send along an ask to Netfile.

daguar commented 10 years ago

The two things you should ask Netfile for:

  1. Provide a public web directory with direct links to the data files available via the "Export Amended" form at http://nf4.netfile.com/pub2/Default.aspx?aid=COAK . Netfile provided this for San Francisco here http://nf4.netfile.com/pub2/excel/SFOBrowsable/ and it is how they integrate with Socrata, because the normal form uses a Javascript postback. A public directory with direct links will let us download the files nightly to load into Socrata.
  2. Provide individual CSV files with the data from each of the tabs in the Excel reports, spanning all the years. (For example, providing a "A-Contributions.csv" that has all the rows from the "Contributions-A" tabs in the 2011, 2012, and 2013 spreadsheets.) This would let us really trivially put all the raw data up on Socrata, and have it be always up-to-date. (This would be equally beneficial for any other city with a public data portal.)
lla2105 commented 10 years ago

Thank you Dave! I really appreciate it.

daguar commented 10 years ago

To answer your Twitter question, @whatasunnyday:

Lauren acquired a copy of Safe FTE that someone in the Port had -- this is the GUI ETL software SF uses to get Netfile data up on Socrata. That said, we don't yet know the license status/cost/etc.

I think it's still worthwhile to write some Ruby code that pulls down the Netfile data, does the conversion to one-CSV-per-tab-across-all-years, and then saves the CSVs locally.

This is because (a) it's a relatively lightweight script, and (b) it would obviate the need to use any heavier software whatsoever.

What's more, I think the core of "given multiple Excel spreadsheets, consolidate tabs" is actually a potentially reusable tool for data munging (so I'd probably modularize that logic.)

daguar commented 10 years ago

Also, as I mentioned on Twitter, I think it makes sense to save to local disk, since I think we could potentially just run this script wherever Socrata Datasync is running.

spjika commented 10 years ago

I have connections at FME/safe who offered a copy for nonprofit use, I'll follow up on it. Awesome tool btw, used to use it daily, powerful shiz. On Feb 28, 2014 8:29 AM, "Dave Guarino" notifications@github.com wrote:

To answer your Twitter question, @whatasunnydayhttps://github.com/whatasunnyday :

Lauren acquired a copy of Safe FTE that someone in the Port had -- this is the GUI ETL software SF uses to get Netfile data up on Socrata. That said, we don't yet know the license status/cost/etc.

I think it's still worthwhile to write some Ruby code that pulls down the Netfile data, does the conversion to one-CSV-per-tab-across-all-years, and then saves the CSVs locally.

This is because (a) it's a relatively lightweight script, and (b) it would obviate the need to use any heavier software whatsoever.

What's more, I think the core of "given multiple Excel spreadsheets, consolidate tabs" is actually a potentially reusable tool for data munging (so I'd probably modularize that logic.)

Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/21#issuecomment-36367994 .

daguar commented 10 years ago

Awesome @spjika!

I think there still remains the issue of what makes for the most robust long-term solution. I actually think a Ruby script + Socrata Datasync ain't a bad idea.

That said, if Oakland could use FME to bolster its ETL/data capacity across all departments I could see it being an invaluable get. (Note: Chicago CDO has been using Pentaho Kettle for this.)

spjika commented 10 years ago

Applied for the software, asked for 3 licenses.

Spike openoakland.org www.stealingbeautyphotography.com

On Fri, Feb 28, 2014 at 8:34 AM, Dave Guarino notifications@github.comwrote:

Awesome @spjika https://github.com/spjika!

I think there still remains the issue of what makes for the most robust long-term solution. I actually think a Ruby script + Socrata Datasync ain't a bad idea.

That said, if Oakland could use FME to bolster its ETL/data capacity across all departments I could see it being an invaluable get. (Note: Chicago CDO has been using Pentaho Kettle for this.)

Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/21#issuecomment-36368630 .

daguar commented 10 years ago

Awesome Spike!

I gave FME a quick look, and it was actually more confusing to me than code.

So I've started a bit of ETL work here: https://github.com/daguar/netfile-etl

Super initial, but pretty easy to write. (Using Python because csvkit is badass and Ruby's roo was giving me troubles with big Excel files.)

daguar commented 10 years ago

Okay, the ETL scripts are basically done here: https://github.com/daguar/netfile-etl

Things remaining are:

But it basically does all the work to download, unzip, and merge the files. We'll use Datasync for the last 10 yards.

tdooner commented 10 years ago

That looks great @daguar. Cygwin seems like a pretty good way to do that, and it's way more straightforward that way than using some FME thing that none of us have experience with nor actual desire to use.

tdooner commented 10 years ago

(hey @whatasunnyday, any progress on the Ruby ETL thing?)

sunnyrjuneja commented 10 years ago

Hey Tom, I'm happy to make a likewise ruby version if you guys would like. It should take no time I was just traveling and settled but do you think its worth if it we have a python version? On Feb 28, 2014 8:40 PM, "Tom Dooner" notifications@github.com wrote:

(hey @whatasunnyday https://github.com/whatasunnyday, any chance of a Ruby ETL thing too?)

Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/21#issuecomment-36416121 .

tdooner commented 10 years ago

Ah okay, just wondering if you had anything. Probably not worth starting since @daguar's looks like it'll get the job done.

sunnyrjuneja commented 10 years ago

I ended up taking a step back and waiting for the link from Netfile which has direct links to the excel docuemnts because the current site has a javascript link that starts a download which required a headless browser like selenium or waitr. I didn't do the combining step which looks what @daguar. Since he mentioned the ruby library that handles excel documents chokes on the size, its probably not worth pursuing at all.

daguar commented 10 years ago

Yeah, I'd stick with this. Roo hung for like 2 minutes (and I ctrl C'd it) with the 2013 data, and 2014 will be larger.

If need be, I can make this a cron job on Heroku that presents the final CSVs as a web dir.

daguar commented 10 years ago

(Also WTF python, I can't cast ints as strings out of the box? Really?)

sunnyrjuneja commented 10 years ago

@daguar str(1337)?

daguar commented 10 years ago

Goddamnit why doesn't

"String shit" + 1337

know what's up?

And don't get me started on dependency management.

daguar commented 10 years ago

Btw my stuff really isn't Python: it's pretty much shell scripting with Python as a wrapper for convenience.

sunnyrjuneja commented 10 years ago

Pip is god awful.

I think "this is a string " + str(1337) should work. Not in front of a PC though. On Feb 28, 2014 9:09 PM, "Dave Guarino" notifications@github.com wrote:

Goddamnit why doesn't

"String shit" + 1337

know what's up?

And don't get me started on dependency management.

Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/21#issuecomment-36416540 .

daguar commented 10 years ago

Oh it does, I just like Ruby's default conversion. Strings being the least common denominator makes sense in a dynamic language.

daguar commented 10 years ago

Closing; this is done in https://github.com/daguar/netfile-etl