sys-bio / tellurium

Python Environment for Modeling and Simulating Biological Systems
http://tellurium.analogmachine.org/
Apache License 2.0
106 stars 36 forks source link

Parallelization issues with pocoMC #591

Open lisaotten opened 1 month ago

lisaotten commented 1 month ago

When employing the pocoMC package for Bayesian Inference runs using tellurium for modeling, we have encountered issues with parallelization. Using multiprocess(ing), we noticed a very big discrepancy between the results obtained by parallelized and non-parallelized runs (see also attached corner plots). Both runs run through smoothly without any error messages or other large differences. I have been able to reproduce the results of both runs separately multiple times both on a HPC and my personal notebook. Besides the change of parallelization, all other parameters are kept exactly the same. Changes in the number of parallel kernels do not seem to change these results. The results of the non-parallelized run are what we would expect as the correct results from experience.

I have attached a self-contained code including the environment I am running it on. The config-file has an option under bayesian_inference to turn parallelization on/off as well as specify the number of kernels.

parallel.pdf not_parallel.pdf Bayesian_Transporter.zip

matthiaskoenig commented 1 month ago

Hi @lisaotten,

I had a quick look at the plot. Not sure the plots are really different. You have to set the axes of the two plots identical to do a better comparison visually. Often you get single rare samples which are far outside resulting in very different axes ranges (if the axes are adapted automatically). I assume you have just 1-2 outlier (very unlikely samples) in the one run resulting in very small distributions visually due to changes of the axes limits.

You should do some actual test if the distributions are different, e.g. by comparing the modes of your multi-dimensional distributions or using something like: EFECT – A Method and Metric to Assess the Reproducibility of Stochastic Simulation Studies T.J. Sego, Matthias König, Luis L. Fonseca, Baylor Fain, Adam C. Knapp, Krishna Tiwari, Henning Hermjakob, Herbert M. Sauro, James A. Glazier, Reinhard C. Laubenbacher, Rahuman S. Malik-Sheriff arXiv:2406.16820 (preprint). doi:10.48550/arXiv.2406.16820

Hope this helps. TDLR: most likely a plotting issue, not a sampling issue

lisaotten commented 1 month ago

Hi @matthiaskoenig,

Thanks for your reply! Both plots actually have the exact same axis ranges. I agree with you that the parallel results are much broader than the results from the non-parallel run, which is where my problem lies. I have reproduced these results multiple times with very similar results both on a HPC and my personal notebook.

luciansmith commented 1 month ago

The first thing I can think of is that if the seeds are being set from the system clock (which they are by default), you might be getting the same seed on multiple runs? I can imagine reasons for both the parallel and non-parallel runs to end up this way, so you might want to try explicitly setting the seed for each run manually, to ensure that they're all unique.

You could also examine the individual results to see if this is actually happening or not.

lisaotten commented 1 month ago

I played around with the seed quite a bit in finding a possible source for the errors, but even setting the seed manually resulted in the same distributions.

lisaotten commented 1 day ago

I have attached a combined corner plot that maybe illustrates the issues a little better: The black line corresponds to a non-parallel run, while the other three lines correspond to parallel runs using the multiprocess, multiprocessing and pathos packages. They were all created using the same fixed seed at the start of the runs.

The last parameter in the corner plot corresponds to the deviation of the parameter fits from the data points that we are trying to analyze and is much larger for the parallel run. This clearly indicates that the parallel runs fit the data much worse.

corner_16Dpococheck_3