Open mbarlow-rmi opened 5 days ago
I've only reviewed the (my) code briefly, but without expending any real effort on this, I'd say try using the --batch-size
argument, which causes csv files to be written after given number of results are returned. Otherwise, elements of the results structure are appended onto lists (in manager.py) which probably keeps the entire results structure in memory.
Note that there's also a --collect
argument that concatenates the numbered batch result files into a single CSV.
Alternatively, we could copy from results structures to avoid retaining references.
Encountered when running multiple fields, setting
--cluster-type=local
, and testing various values for--num-tasks
.Some examples:
The leak was observed in both instances and grew faster with more workers. The input xml contained ~3000 fields in total, and I recall observing slightly different behavior in the memory footprint between the different chunks. IIRC, by only changing the
--start-with
value and keeping all other options the same, some would crash and some wouldn't.Some potential sources:
Python has a built-in
tracemalloc
module for auditing memory allocation, so that could be a good place to start.