Analyse sample growth on multiple media (multi threading)

PathogeNish commented 8 months ago

Checklist

[x] There are no similar issues or pull requests for this yet.
[x] The request is not specific for MICOM Qiime 2 plugin (q2-micom)

Is your feature related to a problem? Please describe it.

I want to analyse the microbial growth rates for a given microbiome profile (single sample) when run on multiple different diets (or media). The idea is to understand how individual growth rates are impacted with changes in the medium. Multiple samples run on the same medium or grow with the appropriate models has in-built multithreading to take advantage of a multi-core cpu. But growth of a single sample on multiple media doesn't have a multi threading option.

Describe the solution you would like.

Given that a single sample grow uses only 1 thread, I want to be able to run growth calculations on different media (100-200 types of media) by using multithreading to access all threads simultaneously.

Describe alternatives you considered

I currently use the pool.starmap function in the standard python multiprocessing library. I am facing an issue of memory pile up (likely due to threads that are not deallocated/ terminated despite the completion of the calculation) that causes the program to crash due to heavy ram usage.

cdiener commented 8 months ago

Yeah good idea. I think it makes sense to allow that the medium arg is a list in grow.

There is a way to make this work right now if you exploit the sample_specific media and are fine with storing the models multiple times. The trick is to create one sample with a unique id for each medium and have a matching sample_id column in your medium.

Here is an example with the test data included in MICOM. We will simulate an E. coli community with a normal and anaerobic medium.

import micom as mm
from micom.workflows import build, grow
import pandas as pd

tax = mm.data.test_data(1)  # only one sample
tax["sample_id"] = "aerobic"
tax_noo2 = tax.copy()
tax_noo2["sample_id"] = "anaerobic"
multi_tax = pd.concat([tax, tax_noo2])
# So we now have a table with the same 3 taxa in two sample ids and build those models
manifest = build(multi_tax, mm.data.test_db, "models")

# Now we build a sample-specific medium
med = mm.qiime_formats.load_qiime_medium(mm.data.test_medium)
med_noo2 = med[med.reaction != "EX_o2_m"].copy()
med["sample_id"] = "aerobic"
med_noo2["sample_id"] = "anaerobic"
multi_med = pd.concat([med, med_noo2])

# The rest is straight-forward 
# This will be run in parallel and will make sure to free the RAM
results = grow(manifest, "models", multi_med, 1.0, threads=2)

# And let's have a look
print(results.growth_rates)

Which gives you something like this:

                    abundance  growth_rate  reactions  metabolites               taxon  tradeoff  sample_id
compartments                                                                                               
Escherichia_coli_2   0.609524     0.235910         95           72  Escherichia_coli_2       1.0  anaerobic
Escherichia_coli_3   0.347253     0.134401         95           72  Escherichia_coli_3       1.0  anaerobic
Escherichia_coli_4   0.043223     0.016729         95           72  Escherichia_coli_4       1.0  anaerobic
Escherichia_coli_2   0.609524     0.987139         95           72  Escherichia_coli_2       1.0    aerobic
Escherichia_coli_3   0.347253     0.562384         95           72  Escherichia_coli_3       1.0    aerobic
Escherichia_coli_4   0.043223     0.070001         95           72  Escherichia_coli_4       1.0    aerobic

But, that is pretty inefficient in terms of storage.

PathogeNish commented 8 months ago

Hi @cdiener. This is interesting! I didn't know one could give specific tags to media as well!

Yeah... the process is quite lossy because one is forced to create multiple duplicate models. But it is better than the alternative multiprocessing wrapper I wrote because of the inexplicable ram buildup (likely due to termination issues with the created threads).

cdiener commented 8 months ago

The RAM buildup is because of https://github.com/opencobra/cobrapy/issues/568 which affects multiple solvers. You can get around that my setting maxtasksperchild to something small (preferably 1) in multiprocessing.Pool. Or you can use the MICOM workflow wrapper which already does that and avoids some other problems as well.

micom-dev / micom