openaddresses / machine

Scripts for running OpenAddresses on a complete data set and publishing the results.
http://results.openaddresses.io/
ISC License
97 stars 36 forks source link

Optimize conform CPU and disk IO #65

Open NelsonMinar opened 9 years ago

NelsonMinar commented 9 years ago

Now that the code is working and stable it'd be nice to spend some time optimizing performance. Some large sources like nl are timing out, and in general no effort has been made for efficiency. This issue is specifically about optimizing what happens after the source is downloaded. Some ideas:

I'm happy to work on this when I'm back from vacation. Mostly opening the ticket to make some notes.

NelsonMinar commented 9 years ago

I've taken a quick look at performance of conform. Top CPU usage: parsing CSV.

I dug deeper in to CSV parsing speed in Python and have three ideas for speeding it up. We could switch to Python 3 and dump unicodecsv, cutting CSV time to 50%. We could make our code more complicated by no longer using DictReader, cutting CSV time to 33%. Or we could switch to Pandas for CSV parsing, cutting CSV time to 25%. These are all significant optimizations, it'd make nl take maybe 90 minutes instead of 120, but they are not life-altering.

The other option is to rearchitect the code to do less CSV parsing. Right now each source is parsed twice: once from the source format to produce an "extracted" CSV file, and then the extracted CSV file is parsed again to produce the output. Eliminating that second step would also speed things up so it takes maybe 50-75% of the original time, but at the cost of making the code more complicated and harder to test.

I'm not enthusiastic about doing any of this work just now. Of all the options switching to Python3 is my favorite. Switching to DictReader is also probably not so hard.

More notes on my blog: https://nelsonslog.wordpress.com/2015/02/25/openaddresses-optimization-some-baseline-timings/ https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/