Closed daguar closed 10 years ago
I can stop by and see if I can get this set up. Basically we'd just need to download/install:
Cygwin http://www.cygwin.com/
Python http://www.python.org/download/
Pip https://sites.google.com/site/pydatalog/python/pip-for-windows
Csvkit (just run pip install csvkit
after Pip is installed)
Ugh, writing this all out makes me a sad panda. Maybe we will use Heroku.
I’m not sure if you guys have used cygwin before but I personally did not have a good experience. I think it might be easier to setup putty and a DO instance for $5/mo. -- Sunny Juneja Sent with Airmail
On March 3, 2014 at 10:24:14 AM, Dave Guarino (notifications@github.com) wrote:
I can stop by and see if I can get this set up. Basically we'd just need to download/install:
Cygwin http://www.cygwin.com/ Python http://www.python.org/download/ Pip https://sites.google.com/site/pydatalog/python/pip-for-windows Csvkit (just run pip install csvkit after Pip is installed)
Ugh, writing this all out makes me a sad panda. Maybe we will use Heroku.
— Reply to this email directly or view it on GitHub.
Hey Dave! I can download and install Cygwin, python, and pip right now. We can experiment on if the script works and if there is anything else we still need to do to get this all up and running. I'm free at work all day today and all day tomorrow .... when works best for you to stop by? Thanks everybody!!!
@daguar I can volunteer my personal server to do this.
heroku++
(as it turns out, Heroku is a difficult platform to do Unixy things on, like wget
, unzip
, etc. I'm working on getting it to run on my branch of netfile-etl but it's proving to be an arduous process)
@ted27 i think it honestly might be work than its worth because that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month. i think using a digital ocean instance or someone's personal server (like mine!) is the best way of going forward.
@whatasunnyday:
that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month.
Agreed it's not really Heroku's use-case, but you can do for free with the job scheduler ( https://devcenter.heroku.com/articles/scheduler ); the dyno cost is simply the time it takes the job to run, so a nightly 5-minute task like this one will be way under the limit, and we could throw a simple Python single-page service with code that just displays the contents of the S3 bucket it's saving to.
@ted27: Thanks for getting started with an attempt on Heroku! I tried deploying and got wget
and unzip
not found, so, yeah, I think it's a problem with buildpack silliness.
I had no idea heroku had a free job scheduler. Cooool. Thanks for the share. On Mar 4, 2014 10:04 AM, "Dave Guarino" notifications@github.com wrote:
@whatasunnyday https://github.com/whatasunnyday:
that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month.
Agreed it's not really Heroku's use-case, but you can do for free with the job scheduler ( https://devcenter.heroku.com/articles/scheduler ); the dyno cost is simply the time it takes the job to run, so a nightly 5-minute task like this one will be way under the limit, and we could throw a simple Python single-page service with code that just displays the contents of the S3 bucket it's saving to.
@ted27 https://github.com/ted27: Thanks for getting started with an attempt on Heroku! I tried deploying and got wget and unzip not found, so, yeah, I think it's a problem with buildpack silliness.
Reply to this email directly or view it on GitHubhttps://github.com/openoakland/opendisclosure/issues/24#issuecomment-36654429 .
Yeah, it's pretty badass.
This is actually the exact use-case of Docker. But I'm a little more comfortable having it on a service we know could be there forever and be free, so I think futzing around with Heroku is the right call.
PS, @ted27: one of the issues with Heroku is the ephemeral and non-writeable disk. You can write to /tmp however.
This means that (a) the scripts can't write any files to the folder they're located in [which is how it's currently written], (b) any data written to /tmp will not be there after the script completes
So the job I'd probably set up would be:
bash run_all.sh
Alternately, the scripts could be modified to always work in /tmp, but I think having the default be just writing to the current folder is the more scripty and reasonable-to-expect way for it to work.
Agreed @daguar about the general methodology. By combining various buildpacks I was able to get the wget and unzip to run. However, you have to compile everything yourself. And, openssl (required to wget files from an SSL server) is not compiled in by default. So, yeah, lots of manual tweaking to get it working.
But the main benefit of Heroku (as I see it) is that we have shared ownership of a project - i.e. you can invite people to collaborate on the project with you. That way we don't rely on any single person's server or recurring attention for the data to populate!
@ted27: Oh boy. So does your Heroku instance have it running now? (Just git clone
ing and pushing to Heroku out of the box didn't let it work for me, so maybe your compilation was done manually in console?)
If you'll be around tonight we can hack on this and get it working.
@migurski also pointed me to this, his notes on getting packaged binaries w/ Heroku: https://github.com/codeforamerica/heroku-buildpack-pygeo/blob/master/Build.md
Heroku has curl
already, which can be a fine wget
replacement.
Also, Python has baked-in support for zip files. It’s pretty easy to use so it’s possible you could skip compiling binaries altogether.
@migurski -- Thanks; and, yeah, most of this is my laziness (I wrote these scripts super quickly, and I know actually all of it could be done in pure Python, even.)
I used wget
because it's simpler for 404's and 302's (which are happening here in the naive implementation) but looks like I can do curl -f -L
now that I take the 5 minutes to read.
Quick attempt to get unzip built on Heroku:
curl -L http://sourceforge.net/projects/infozip/files/UnZip%206.x%20%28latest%29/UnZip%206.0/unzip60.tar.gz/download | tar -xzvf -
cd unzip60
make -f unix/Makefile generic
…and unzip
now works. At 156KB it should be fine to include in Git repo. ldd
suggests it’s not linked too badly:
linux-vdso.so.1 => (0x00007fff6a1b9000)
libc.so.6 => /lib/libc.so.6 (0x00007f48da7c7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f48dab60000)
…and the result, which should Just Work™: http://dbox.teczno.com/unzip.gz
Okay, @ted27 I've replaced wget with curl in my repo if you want to rebase. Will take a look at Python vs. unzip buildpack shortly.
I think I can actually save us all from the Heroku+S3 steps and run this on Lauren's comp, which now has Vagrant + an Ubuntu VM!
Documenting (incomplete) setup here: https://github.com/daguar/netfile-etl/issues/2
We've got it working on Lauren's comp!!!
Next steps for this are:
Dave:
Lauren:
Lauren -- Adding this issue because I think a next step is to see if you can run my scripts that do the Netfile data ETL on your Windows box:
https://github.com/daguar/netfile-etl
An alternative is to run them on an external Unix-y server (like Heroku or elsewhere) and then set up a job to download them every day to a computer within the city.