Try out Python ETL scripts on Windows box in city

daguar commented 10 years ago

Lauren -- Adding this issue because I think a next step is to see if you can run my scripts that do the Netfile data ETL on your Windows box:

https://github.com/daguar/netfile-etl

An alternative is to run them on an external Unix-y server (like Heroku or elsewhere) and then set up a job to download them every day to a computer within the city.

daguar commented 10 years ago

I can stop by and see if I can get this set up. Basically we'd just need to download/install:

Cygwin http://www.cygwin.com/ Python http://www.python.org/download/ Pip https://sites.google.com/site/pydatalog/python/pip-for-windows Csvkit (just run pip install csvkit after Pip is installed)

Ugh, writing this all out makes me a sad panda. Maybe we will use Heroku.

sunnyrjuneja commented 10 years ago

I’m not sure if you guys have used cygwin before but I personally did not have a good experience. I think it might be easier to setup putty and a DO instance for $5/mo. -- Sunny Juneja Sent with Airmail

On March 3, 2014 at 10:24:14 AM, Dave Guarino (notifications@github.com) wrote:

I can stop by and see if I can get this set up. Basically we'd just need to download/install:

Cygwin http://www.cygwin.com/ Python http://www.python.org/download/ Pip https://sites.google.com/site/pydatalog/python/pip-for-windows Csvkit (just run pip install csvkit after Pip is installed)

Ugh, writing this all out makes me a sad panda. Maybe we will use Heroku.

— Reply to this email directly or view it on GitHub.

lla2105 commented 10 years ago

Hey Dave! I can download and install Cygwin, python, and pip right now. We can experiment on if the script works and if there is anything else we still need to do to get this all up and running. I'm free at work all day today and all day tomorrow .... when works best for you to stop by? Thanks everybody!!!

sunnyrjuneja commented 10 years ago

@daguar I can volunteer my personal server to do this.

tdooner commented 10 years ago

heroku++

tdooner commented 10 years ago

(as it turns out, Heroku is a difficult platform to do Unixy things on, like wget, unzip, etc. I'm working on getting it to run on my branch of netfile-etl but it's proving to be an arduous process)

sunnyrjuneja commented 10 years ago

@ted27 i think it honestly might be work than its worth because that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month. i think using a digital ocean instance or someone's personal server (like mine!) is the best way of going forward.

daguar commented 10 years ago

@whatasunnyday:

that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month.

Agreed it's not really Heroku's use-case, but you can do for free with the job scheduler ( https://devcenter.heroku.com/articles/scheduler ); the dyno cost is simply the time it takes the job to run, so a nightly 5-minute task like this one will be way under the limit, and we could throw a simple Python single-page service with code that just displays the contents of the S3 bucket it's saving to.

@ted27: Thanks for getting started with an attempt on Heroku! I tried deploying and got wget and unzip not found, so, yeah, I think it's a problem with buildpack silliness.

sunnyrjuneja commented 10 years ago

I had no idea heroku had a free job scheduler. Cooool. Thanks for the share. On Mar 4, 2014 10:04 AM, "Dave Guarino" notifications@github.com wrote:

@whatasunnyday https://github.com/whatasunnyday:

that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month.

Agreed it's not really Heroku's use-case, but you can do for free with the job scheduler ( https://devcenter.heroku.com/articles/scheduler ); the dyno cost is simply the time it takes the job to run, so a nightly 5-minute task like this one will be way under the limit, and we could throw a simple Python single-page service with code that just displays the contents of the S3 bucket it's saving to.

@ted27 https://github.com/ted27: Thanks for getting started with an attempt on Heroku! I tried deploying and got wget and unzip not found, so, yeah, I think it's a problem with buildpack silliness.

Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/24#issuecomment-36654429 .

daguar commented 10 years ago

Yeah, it's pretty badass.

This is actually the exact use-case of Docker. But I'm a little more comfortable having it on a service we know could be there forever and be free, so I think futzing around with Heroku is the right call.

daguar commented 10 years ago

PS, @ted27: one of the issues with Heroku is the ephemeral and non-writeable disk. You can write to /tmp however.

This means that (a) the scripts can't write any files to the folder they're located in [which is how it's currently written], (b) any data written to /tmp will not be there after the script completes

So the job I'd probably set up would be:

Copy all script files to /tmp
Run bash run_all.sh
Upload resulting CSVs in /tmp to S3

Alternately, the scripts could be modified to always work in /tmp, but I think having the default be just writing to the current folder is the more scripty and reasonable-to-expect way for it to work.

tdooner commented 10 years ago

Agreed @daguar about the general methodology. By combining various buildpacks I was able to get the wget and unzip to run. However, you have to compile everything yourself. And, openssl (required to wget files from an SSL server) is not compiled in by default. So, yeah, lots of manual tweaking to get it working.

But the main benefit of Heroku (as I see it) is that we have shared ownership of a project - i.e. you can invite people to collaborate on the project with you. That way we don't rely on any single person's server or recurring attention for the data to populate!

daguar commented 10 years ago

@ted27: Oh boy. So does your Heroku instance have it running now? (Just git cloneing and pushing to Heroku out of the box didn't let it work for me, so maybe your compilation was done manually in console?)

If you'll be around tonight we can hack on this and get it working.

daguar commented 10 years ago

@migurski also pointed me to this, his notes on getting packaged binaries w/ Heroku: https://github.com/codeforamerica/heroku-buildpack-pygeo/blob/master/Build.md

migurski commented 10 years ago

Heroku has curl already, which can be a fine wget replacement.

migurski commented 10 years ago

Also, Python has baked-in support for zip files. It’s pretty easy to use so it’s possible you could skip compiling binaries altogether.

daguar commented 10 years ago

@migurski -- Thanks; and, yeah, most of this is my laziness (I wrote these scripts super quickly, and I know actually all of it could be done in pure Python, even.)

I used wget because it's simpler for 404's and 302's (which are happening here in the naive implementation) but looks like I can do curl -f -L now that I take the 5 minutes to read.

migurski commented 10 years ago

Quick attempt to get unzip built on Heroku:

curl -L http://sourceforge.net/projects/infozip/files/UnZip%206.x%20%28latest%29/UnZip%206.0/unzip60.tar.gz/download | tar -xzvf -
cd unzip60
make -f unix/Makefile generic

…and unzip now works. At 156KB it should be fine to include in Git repo. ldd suggests it’s not linked too badly:

linux-vdso.so.1 =>  (0x00007fff6a1b9000)
libc.so.6 => /lib/libc.so.6 (0x00007f48da7c7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f48dab60000)

migurski commented 10 years ago

…and the result, which should Just Work™: http://dbox.teczno.com/unzip.gz

daguar commented 10 years ago

Okay, @ted27 I've replaced wget with curl in my repo if you want to rebase. Will take a look at Python vs. unzip buildpack shortly.

daguar commented 10 years ago

I think I can actually save us all from the Heroku+S3 steps and run this on Lauren's comp, which now has Vagrant + an Ubuntu VM!

Documenting (incomplete) setup here: https://github.com/daguar/netfile-etl/issues/2

daguar commented 10 years ago

We've got it working on Lauren's comp!!!

Next steps for this are:

Dave:

[ ] Create a virtual machine custom for Lauren's PC (smallish, containing all dependencies)
[ ] Configure VM to dump to specified folder on Lauren's PC (local disk, confirm folder location with Lauren)
[ ] Set up cron job for the script

Lauren:

[ ] Move the DataSync jobs to her local computer (off of shared drive)
[ ] Edit the Windows Scheduler and DataSync jobs to look for new (local drive) folder

openoakland / opendisclosure

Try out Python ETL scripts on Windows box in city #24