multiple biomass reactions identified in model with single biomass reaction

BenjaSanchez commented 4 years ago

Problem description

From the definition of test_biomass_presence:

"Implementation: Identifies possible biomass reactions using two principal steps: 1. Return reactions that include the SBO annotation "SBO:0000629" for biomass. If no reactions can be identifies this way: 1. Look for the buzzwords "biomass", "growth" and "bof" in reaction IDs. 2. Look for metabolite IDs or names that contain the buzzword "biomass" and obtain the set of reactions they are involved in. 3. Remove boundary reactions from this set. 4. Return the union of reactions that match the buzzwords and of the reactions that metabolites are involved in that match the buzzword."

However, when the report is ran on this model, which has a single reaction BIOMASS_yeastGEM with the SBO term SBO:0000629, it returns 214 reactions (also matches to component pseudoreactions e.g. protein pseudoreaction + other reactions where the biomass components are involved). This does not sound like intended behavior to me.

Code Sample

memote report snapshot yeastGEM.xml --solver-timeout 30

Context

System Information ================== OS Windows OS-release 10 Python 3.7.7 Package Versions ================ Jinja2 2.11.2 click 7.1.2 click-configfile 0.2.3 click-log 0.3.2 cobra 0.18.1 cookiecutter 1.7.2 depinfo 1.5.3 equilibrator-api 0.1.26 future 0.18.2 gitpython 3.1.3 goodtables 2.5.0 importlib-resources 3.0.0 lxml 4.5.1 memote 0.11.0 numpydoc 1.1.0 pandas 1.0.5 pip 20.1.1 pylru 1.2.0 pytest 5.4.3 requests 2.24.0 ruamel.yaml 0.16.10 setuptools 47.3.1.post20200622 six 1.15.0 sqlalchemy 1.3.18 sympy 1.6 travis-encrypt 1.1.2 wheel 0.34.2

Midnighter commented 4 years ago

This is partly due to the model definition and partly due to how memote tries to be very eager to find the biomass reaction. I have recorded how these components are matched:

BIOMASS_yeastGEM_LIP identifier match by buzzword.
GROWTH growth name match by buzzword.
BIOMASS_yeastGEM match by SBO.
BIOMASS_yeastGEM_PROT identifier match by buzzword.
BIOMASS_yeastGEM_CARB identifier match by buzzword.
BIOMASS_yeastGEM_RNA identifier match by buzzword.
BIOMASS_yeastGEM_DNA identifier match by buzzword.
BIOMASS_yeastGEM_LIPBACK identifier match by buzzword.
BIOMASS_yeastGEM_LIPCHAIN identifier match by buzzword.
BIOMASS_yeastGEM_COFACTOR identifier match by buzzword.
BIOMASS_yeastGEM_ION identifier match by buzzword.
pail_cho_c match by SBO.
biomass_c match by SBO.
dag_hs_r match by SBO.
ergstest_c match by SBO.
fa_c match by SBO.
mip2c_g match by SBO.
ipc_g match by SBO.
lipid_c match by SBO.
mipc_g match by SBO.
ps_cho_c match by SBO.
pchol_cho_c match by SBO.
pe_hs_c match by SBO.
pe_hs_r match by SBO.
tag_cho_c match by SBO.
ergstest_rm match by SBO.
ps_cho_rm match by SBO.
pchol_cho_rm match by SBO.
pe_hs_rm match by SBO.
tag_cho_rm match by SBO.
protein_c match by SBO.
carbohydrate_c match by SBO.
rna_c match by SBO.
dna_c match by SBO.
cer_g match by SBO.
cer_c match by SBO.
mip2c_c match by SBO.
ipc_c match by SBO.
mipc_c match by SBO.
lcb_r match by SBO.
lcb_c match by SBO.
lcbp_r match by SBO.
lcbp_c match by SBO.
pa_rm match by SBO.
pa_c match by SBO.
dag_rm match by SBO.
dag_c match by SBO.
lpi_rm match by SBO.
lpi_c match by SBO.
pg_mm match by SBO.
pg_c match by SBO.
cl_mm match by SBO.
cl_c match by SBO.
c160chain_c match by SBO.
c161chain_c match by SBO.
c180chain_c match by SBO.
c181chain_c match by SBO.
c240chain_c match by SBO.
c260chain_c match by SBO.
lipidbackbone_c match by SBO.
lipidchain_c match by SBO.
cofactor_c match by SBO.
ion_c match by SBO.

As you can see, most of the matches are by SBO term and only few by 'buzzword'. The main problem is that in addition to the reactions, we also search through the metabolites (as some models only have a biomass compound within an unspecified reaction). For metabolites we use https://www.ebi.ac.uk/sbo/main/SBO:0000649 which is not very specific and many metabolites match in the yeast model (this is correctly done in the model). Then, for each matching metabolite, we add all reactions to the list of candidates. That is why so many reactions are found.

One possible change in logic is that if we have found a reaction by SBO, we should assume that the model is well annotated and use only that reaction (or multiple with SBO:0000629). This is the quickest change that we can implement.
Another possible change is to try and match the name of metabolites with SBO:0000649 to the buzzwords and thus exclude most of them.
Any other solution would probably require building a sub network of reactions for split biomass reactions and only use leaf nodes but that's more involved to implement.

If you agree that those are solutions, I can implement 1. and 2. rather quickly.

BenjaSanchez commented 4 years ago

@Midnighter thanks for getting back to me. I think solution 1 would be indeed the simplest one, and it will for sure solve our issue (we just have one reaction with SBO:0000629). Additionally, it would make the test more consistent with its documentation :)

opencobra / memote