Running into memory issues when simulating many models with with different initial conditions #1225

dalbabur commented 4 months ago

I'm running simulations in parallel on Hyak using ipyparallel. I'm able to load and simulate models on many engines, but eventually I run out of memory. After doing a couple of tests, I believe the memory leak is related to roadrunner and not ipyparallel.

Here is what I'm seeing: initial load: image and the load eventually after some iterations: image

I'm doing something like this in a loop with different parameter sets:

      for r in many_r :  # iterate through models     
            # first, set parameters (models have ~1k parameters, this takes a while)
             [r.setValue('init('+l+')',v) for l, v in zip(parameter_labels, parameter_values)]

            # now, iterate over conditions and set variables
            # across conditions, models have same structure but just a couple of different variables (~20 variables)
            rb = r.saveStateS() # this is convenient to keep the newly set parameters, instead of resetting the model to how it was first loaded
            for conditions in conditions:
                r2 = RoadRunner()
                r2.loadStateS(rb) # this has the new parameters
                # set variable
                [r.setValue('init('+l+')',v) for l, v in zip(variable_labels, variable_values)]
                results[condition] = r2.simulate()

       # what I've tried to deal with the memory issues
        del r2, rb

        return all_results

And i have these config flags:

from roadrunner import Config, RoadRunner
Config.setValue(Config.LOADSBMLOPTIONS_RECOMPILE, False) 
Config.setValue(Config.LLJIT_OPTIMIZATION_LEVEL, 4)
hsauro commented 4 months ago

What version are you running? Can this be reproduced on a desktop machine?

dalbabur commented 4 months ago

I'm using libroadrunner 2.5.0

haven't tried running it locally yet

luciansmith commented 4 months ago

I can say that the saveState/loadState functions had a bug that was fixed with 2.7.0. It was causing crashes, not memory leaks, though. But it can't hurt to try the latest version, at least?

dalbabur commented 4 months ago

sure ill update and report back

dalbabur commented 4 months ago

same thing with 2.7.0...

hsauro commented 4 months ago

We'll probably need a desktop example that shows the effect in order to pin down the leak.

luciansmith commented 4 months ago

Thanks for checking!

I just ran all of roadrunner's C-based tests through valgrind and there were no errors/leaks there, so the problem must lie either in Python directly or in the Python bindings. If you could manage to get something that illustrated the problem and could be run locally, that would be ideal.

dalbabur commented 4 months ago

Thanks for checking that Lucian. Working on a minimal example that will show the issue locally...

What are some way I could check for memory leaks in python or bindings?

luciansmith commented 4 months ago

It's possible to run valgrind on python, but that's going to find issues on even the blandest of scripts. It should also find the leak we're looking for, though. The main thing I can think of is to just have the exact same script as ran on Hyak, but locally (and maybe simpler) and watch it eat memory?

VivianeKlingel commented 4 weeks ago

I've also had this/a similar problem when simulating many times in parallel. My model is of a population of cells, so each model simulation contains ~500 simulations of individual cells. For parameter optimization this is then simulated again 10.000x. I have this issue on my local machine but also on a cluster (where I first noticed the problem, because it used up all the memory, ~90 GB). libroadrunner version is 2.7.0.

The only way that I found to prevent this memory leak, was to use joblib to dump the loaded model and only load it within each child process (short of reloading it every time, which just takes way to long). It's been some time since I looked into it, but I tested many different ways of either resetting/clearing the model, simulation settings and also different multiprocessing setups, but none worked. The most it did was, that the memory went down after one process was done, but went up immediately when it started the next simulation.

I made two minimal examples, one with and one without using joblib. I examined and plotted the memory usage with Memory Profiler. libroadrunner version is 2.7.0.

mprof run --multiprocess --include-children -o "./mprofile_$(date +"%F-%H%M").dat" ./
mprof plot -o  "./MemPlot_Standard_$(date +"%F-%H%M").png"

Standard Parallel Simulation MemPlot_Standard_2024-09-04-1317

Simulation where loaded model is dumped and loaded in child process MemPlot_Dump_2024-09-04-1337


Classic Simulation

```python from memory_profiler import profile from roadrunner.tests import TestModelFactory as tmf from joblib import dump, load import time import tellurium as te from concurrent.futures import ProcessPoolExecutor, as_completed def SimModel(m): m.resetAll() start_t = 0 end_t = 250 steps = 250*10 result = m.simulate(start_t, end_t, steps) return def pSim(r): # Population Simulation - Many individual simulations nSims = 480 executor = ProcessPoolExecutor() Results =[] futures = (executor.submit(SimModel, r) for n in range(nSims)) for future in as_completed(futures): Results.append(future.result()) future = [] Results = [] return @profile def run_sim(r): # Represents Optimzation with many simulations of the model for i in range(10): pSim(r) def main(): sbml = tmf.Brown2004().str() r = te.loadSBMLModel(sbml) t1 = time.perf_counter() run_sim(r) elapsed_time = time.perf_counter() - t1 print('Time:', elapsed_time, 'sec') if __name__ == '__main__': main() ```

Simulation With Dumping

```python from memory_profiler import profile from roadrunner.tests import TestModelFactory as tmf from joblib import dump, load import time import tellurium as te from concurrent.futures import ProcessPoolExecutor, as_completed def SimModel(r_loc): m = load(r_loc) start_t = 0 end_t = 250 steps = 250*10 result = m.simulate(start_t, end_t, steps) return def pSim(r_loc): # Population Simulation - Many individual simulations nSims = 480 executor = ProcessPoolExecutor() Results =[] futures = (executor.submit(SimModel, r_loc) for n in range(nSims)) for future in as_completed(futures): Results.append(future.result()) future = [] Results = [] return @profile def run_sim(r_loc): # Represents Optimzation with many simulations of the model for i in range(10): pSim(r_loc) def main(): sbml = tmf.Brown2004().str() r = te.loadSBMLModel(sbml) r_loc = 'rrmodel_test.joblib' dump(r, r_loc) r = [] t1 = time.perf_counter() run_sim(r_loc) elapsed_time = time.perf_counter() - t1 print('Time:', elapsed_time, 'sec') if __name__ == '__main__': main() ```