openaddresses / machine

Scripts for running OpenAddresses on a complete data set and publishing the results.
http://results.openaddresses.io/
ISC License
97 stars 36 forks source link

900 MB CSV file = 18 GB of RAM, fails to complete (GNAF) #310

Closed NelsonMinar closed 8 years ago

NelsonMinar commented 8 years ago

Trying to process the new Australian GNAF source is failing on the production machines. I ran it manually via openaddr-process-one on my development machine and suspect it's related to memory usage. I tested with python2 on 64 bit Ubuntu with the default install options.

The source is a 900MB CSV file. The process job takes up to 18.2GB of virtual memory and as much resident as it can get. My development machine has 16GB of RAM and 16GB of swap. Running the job it quickly ballooned up to 13GB resident size. And it seemed to keep swapping, so it's actively touching that memory. There's some oddness in memory usage I don't fully understand, but it's definitely Too Much.

I gave up running it after about an hour, in which the process only used 7 minutes of actual CPU time. Presumably the rest of the time went to swapping. The converted directory inside the process_one temporary directory had no files in it, which suggests it never got to actually writing any output. (edit: my mistake, there is an intermediate output file written in /tmp.)

It's been a year since I looked at the code, but my memory was that CSV sources processed in constant memory of like 100MB, nothing proportional to input size. I think this memory use needs more investigation.

iandees commented 8 years ago

I wonder if we can do all of machine's work using python generators instead of storing much of anything in memory. Is there anything we do that needs to know about more than a single line of CSV at a time?

NelsonMinar commented 8 years ago

I believe I found the source of the memory problem: zip(reader, itertools.count(1)) is mapping the entire contents of the file into memory. Working on a fix and test.

migurski commented 8 years ago

Hoped-for fix from https://github.com/openaddresses/machine/pull/311.

migurski commented 8 years ago

Keeping an eye on this in https://github.com/openaddresses/openaddresses/pull/1588

NelsonMinar commented 8 years ago

After the fix, I was able to run the entire GNAF dataset in 45 minutes and 30MB of RAM.

migurski commented 8 years ago

Looks like it worked!