Closed NelsonMinar closed 8 years ago
I wonder if we can do all of machine's work using python generators instead of storing much of anything in memory. Is there anything we do that needs to know about more than a single line of CSV at a time?
I believe I found the source of the memory problem: zip(reader, itertools.count(1))
is mapping the entire contents of the file into memory. Working on a fix and test.
Hoped-for fix from https://github.com/openaddresses/machine/pull/311.
Keeping an eye on this in https://github.com/openaddresses/openaddresses/pull/1588
After the fix, I was able to run the entire GNAF dataset in 45 minutes and 30MB of RAM.
Looks like it worked!
Trying to process the new Australian GNAF source is failing on the production machines. I ran it manually via
openaddr-process-one
on my development machine and suspect it's related to memory usage. I tested with python2 on 64 bit Ubuntu with the default install options.The source is a 900MB CSV file. The process job takes up to 18.2GB of virtual memory and as much resident as it can get. My development machine has 16GB of RAM and 16GB of swap. Running the job it quickly ballooned up to 13GB resident size. And it seemed to keep swapping, so it's actively touching that memory. There's some oddness in memory usage I don't fully understand, but it's definitely Too Much.
I gave up running it after about an hour, in which the process only used 7 minutes of actual CPU time. Presumably the rest of the time went to swapping. The
converted
directory inside theprocess_one
temporary directory had no files in it, which suggests it never got to actually writing any output. (edit: my mistake, there is an intermediate output file written in/tmp
.)It's been a year since I looked at the code, but my memory was that CSV sources processed in constant memory of like 100MB, nothing proportional to input size. I think this memory use needs more investigation.