msmasnadi / OPGEEv4

OPGEE v4
Other
9 stars 2 forks source link

Memory Leak occurs when running with multiple workers #12

Open mbarlow-rmi opened 5 days ago

mbarlow-rmi commented 5 days ago

Encountered when running multiple fields, setting --cluster-type=local, and testing various values for --num-tasks.

Some examples:

opg run --analyses trace \
    --model inputs/rmi_opgee_inputs.xml \
    --output-dir outputs/trial-num-fields5 \
    --result-type detailed \
    -c local \
    --num-fields 50 \
    --start-with "Field 6" \
    --num-tasks 20
opg run --analyses trace \
    --model inputs/rmi_opgee_inputs.xml \
    --output-dir outputs/fields-1-1000 \
    --result-type detailed \
    -c local \
    --num-fields 1000 \
    --num-tasks 10

The leak was observed in both instances and grew faster with more workers. The input xml contained ~3000 fields in total, and I recall observing slightly different behavior in the memory footprint between the different chunks. IIRC, by only changing the --start-with value and keeping all other options the same, some would crash and some wouldn't.

Some potential sources:

Python has a built-in tracemalloc module for auditing memory allocation, so that could be a good place to start.

rjplevin commented 2 days ago

I've only reviewed the (my) code briefly, but without expending any real effort on this, I'd say try using the --batch-size argument, which causes csv files to be written after given number of results are returned. Otherwise, elements of the results structure are appended onto lists (in manager.py) which probably keeps the entire results structure in memory.

Note that there's also a --collect argument that concatenates the numbered batch result files into a single CSV.

Alternatively, we could copy from results structures to avoid retaining references.