Optimize conform CPU and disk IO

openaddresses / machine

Scripts for running OpenAddresses on a complete data set and publishing the results.

ISC License

97 stars 36 forks source link

Now that the code is working and stable it'd be nice to spend some time optimizing performance. Some large sources like nl are timing out, and in general no effort has been made for efficiency. This issue is specifically about optimizing what happens after the source is downloaded. Some ideas:

[ ] Characterize performance of conform after the download is finished. Compare CSV vs. GeoJSON vs. Shapefile. Identify disk IO vs. CPU bottlenecks.
[ ] Experiment with psyco or PyPy and see if it helps. (Note: PyPy is mostly Py2 only.) Psyco uses a huge amount of RAM.
[ ] Consider removing the "extracted" intermediate CSV file; go straight from source material to output CSV in a row by row fashion. Should work conceptually, makes the code more complicated to debug. Only do it if disk IO is really a problem.
[ ] Random profiling / micro-optimizations in the inner loops.
[ ] I suspect the CSV DictReader is way inefficient for what we're doing, a simple indexed column thing might be better.

I'm happy to work on this when I'm back from vacation. Mostly opening the ticket to make some notes.

I've taken a quick look at performance of conform. Top CPU usage: parsing CSV.

I dug deeper in to CSV parsing speed in Python and have three ideas for speeding it up. We could switch to Python 3 and dump unicodecsv, cutting CSV time to 50%. We could make our code more complicated by no longer using DictReader, cutting CSV time to 33%. Or we could switch to Pandas for CSV parsing, cutting CSV time to 25%. These are all significant optimizations, it'd make nl take maybe 90 minutes instead of 120, but they are not life-altering.

The other option is to rearchitect the code to do less CSV parsing. Right now each source is parsed twice: once from the source format to produce an "extracted" CSV file, and then the extracted CSV file is parsed again to produce the output. Eliminating that second step would also speed things up so it takes maybe 50-75% of the original time, but at the cost of making the code more complicated and harder to test.

I'm not enthusiastic about doing any of this work just now. Of all the options switching to Python3 is my favorite. Switching to DictReader is also probably not so hard.

More notes on my blog: https://nelsonslog.wordpress.com/2015/02/25/openaddresses-optimization-some-baseline-timings/ https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/

openaddresses / machine

Optimize conform CPU and disk IO #65