klamt-lab / straindesign

StrainDesign is a python package for the computational design of metabolic networks and based on COBRApy
Apache License 2.0
36 stars 6 forks source link

StrainDesign compatibility issues with enzyme-constrained model #18

Open JakubChromy opened 9 months ago

JakubChromy commented 9 months ago

Hi, StrainDesign has worked very well for me when used on basic stoichiometric genome-scale models (e.g. iML1515). However, I am encountering problems when I try to use it on an enzyme-constrained model, specifically eciML1515 from this ECMpy study https://doi.org/10.3390/biom12010065. Unlike standard cobrapy library functions, StrainDesign seems to not be able to pick up on the additional enzyme constraints included in the model and, as a result, the sd.fba, sd.fva and sd.plot_flux_space functions simulate flux distributions identical to those obtained with the basic stoichiometric model, ignoring the enzyme constraints.

I have not been able to overcome this issue on my own. Looking into the source code, it seems to me these constraints get ignored during the "A_eq_base = sparse.csr_matrix(create_stoichiometric_matrix(model))" step within the fva, fba and yopt functions in the straindesign.lptools module, but I might be wrong. I tried providing the enzyme constraints obtained from "model.constraints" directly into the constraints parameter within the "sd.fba(model, constraints=...)" function, all in a straindesign-friendly format (e.g. '0.00860581989817056 13PPDH2 + 0.00805572834477629 13PPDH2_reverse + 0.00134590026565838 23PDE2pp + ... <= 0.227'), but that didn't work either.

It would be great if there was a way to consider additional enzyme-constraints in straindesign simulations and strain optimization methods. # Below, I included a simple test to reproduce the different treatment of enzyme-constrained model eciML1515 by standard cobrapy and by straindesign, which can be launched from the ECMpy repository folder here https://github.com/tibbdc/ECMpy/tree/master. # import straindesign as sd import cobra import sys sys.path.append(r'./code/') from cobrapy_ec_model_function import *

#load enzyme-constrained model json_model_path="./model/iML1515_irr_enz_constraint_adj_round2.json" enz_model=get_enzyme_constraint_model(json_modelpath) **#load stoichiometric model_** norm_model=cobra.io.json.load_json_model(json_model_path)

#FBA via cobra - differentiates between enzyme-constrained and basic model fba_solcobra_enz = enz_model.optimize() print(fba_solcobra_enz.fluxes['BIOMASS_Ec_iML1515_core_75p37M'], fba_solcobra_enz.fluxes['EX_ac_e']) fba_sol__cobra_norm = norm_model.optimize() print(fba_solcobra_norm.fluxes['BIOMASS_Ec_iML1515_core_75p37M'], fba_sol__cobra_norm.fluxes['EX_ac_e'])

**#0.6802441208853646 7.711579625195041

0.8697726420320148 0.0**

#FBA via straindesign - simulates enzyme-constrained model identically to basic model fba_solsd_enz = sd.fba(enz_model) print(fba_solsd_enz.fluxes['BIOMASS_Ec_iML1515_core_75p37M'], fba_solsd_enz.fluxes['EX_ac_e']) fba_sol__sd_norm = sd.fba(norm_model) print(fba_solsd_norm.fluxes['BIOMASS_Ec_iML1515_core_75p37M'], fba_sol__sd_norm.fluxes['EX_ac_e'])

**#0.869772642032 0.0

0.869772642032 0.0**

#straindesign production envelopes also look identical for both enzyme-constrained and basic model sd.plot_flux_space(enz_model,('BIOMASS_Ec_iML1515_core_75p37M','EX_ac_e')); sd.plot_flux_space(norm_model,('BIOMASS_Ec_iML1515_core_75p37M','EX_ac_e'));

VonAlphaBisZulu commented 9 months ago

Thank you very much for the details. This was very helpful for finding fixing the issue.

### === issue ===
## Add enzyme constraint to StrainDesign unchanged (issue)
enzyme_constraint = [c for c in enz_model.constraints if c.ub>0.1][0]
coeff_dict = {k.name : v for k,v in enzyme_constraint.expression.as_coefficients_dict().items() if hasattr(k,'name')}
rhs = enzyme_constraint .ub

fba_sol__sd_enz_1 = sd.fba(enz_model, constraints=[coeff_dict,'<=',rhs])

print("upper bound enzyme constraint: "+str(rhs))
print("solution value enzyme constraint: ",sum([fba_sol__sd_enz_1.fluxes[k]*v for k,v in coeff_dict.items()]))
print("Growth rate: "+str(fba_sol__sd_enz_1.fluxes["BIOMASS_Ec_iML1515_core_75p37M"]))

# upper bound enzyme constraint: 0.227
# solution value enzyme constraint: 0.226999999999998
# Growth rate: 0.8697726420320167

# Is there any term in the enzyme expression constraint that produces a negative value?
[print(k+": "+str(v)+" produces: "+ str(coeff_dict[k]*v)) for k,v in fba_sol__sd_enz.fluxes.items() if k in coeff_dict and coeff_dict[k]*v <0]

# SHK3Dr_reverse: -999.6685131204414 produces: -2.11611977448844

The problem is that reactions may get a negative sign and "produce" enzymes. The code below should help you fix this. Until enzyme constraints are integrated in StrainDesign, just use this.

### === fix ===
## Add constraint to StrainDesign, but with negative coefficients for reverse reactions
coeff_dict = {k.name : v for k,v in enzyme_constraint.expression.as_coefficients_dict().items() if hasattr(k,'name')}
coeff_dict2 = {k: (-v if enz_model.reactions.get_by_id(k).upper_bound <= 0 else v) for k, v in coeff_dict.items()}
rhs = enzyme_constraint.ub

fba_sol__sd_enz_2 = sd.fba(enz_model, constraints=[coeff_dict2,'<=',rhs])

print("upper bound enzyme constraint: "+str(rhs))
print("solution value enzyme constraint: ",sum([fba_sol__sd_enz_2.fluxes[k]*v for k,v in coeff_dict.items()]))
print("Growth rate: "+str(fba_sol__sd_enz_2.fluxes["BIOMASS_Ec_iML1515_core_75p37M"]))

# upper bound enzyme constraint: 0.227
# solution value enzyme constraint:  0.227000000000000
# Growth rate: 0.6802441208853683
VonAlphaBisZulu commented 9 months ago

Something is going on between ECMpy and cobrapy.

  1. ECMpy splits reactions into forward and backward reactions and makes sure boundaries are positive, all fine.
  2. ECMpy uses positive-only coefficients for the enzyme constraint, which is also great.
  3. At some point, reversible reactions get flipped and become negative in cobrapy. I don't know why. This leads to optlang flipping them once again, but not keeping them in the enzyme constraint. So for every reaction, there are now 4 reactions in optlang: (1) forward (in enz. constraints), (2) reverse of forward (not in enz. constr.) (3) reverse with negative bounds (in enzyme constraint, but bounds fixed to zero in optlang, because the cobrapy bounds were negative) (4) the reverse of the reverse (that has the right bounds but is not in the enzyme constraint)
  4. You end up with a model that has somewhat chaotic bounds. And two problems - 1 for StrainDesign, 1 for ECMpy.

For StrainDesign, we keep the reactions from cobrapy (1) and (3) with correct bounds. Since the reverse reaction has a negative value, reactions (3) can "buy" us enzyme. We can fix this by using a negative enzyme coefficient for the reverse sense, as suggested above.

For ECMpy, the reactions kept are (1) and (4). This does not buy us enzyme, but we get reaction (4) for free. I have raised an issue on their github. (https://github.com/tibbdc/ECMpy/issues/1#issue-2043064996)

For now, I will probably wait to see how they fix this. I advise you to use the solution I posted before, since you won't get the reverse reactions for free. Should be more accurate.

VonAlphaBisZulu commented 9 months ago

If you want to use ECMpy with StrainDesign, also try my fork of ECMpy: https://github.com/VonAlphaBisZulu/ECMpy/tree/master which implements enzyme constraints as metabolites and reactions, mathematically equivalent to ECMpy

JakubChromy commented 9 months ago

Hi Philipp, thank you very much for your extremely prompt response and help. I now have got the chance to test out both of your suggested solutions:

  1. including the enzyme constraints explicitly as constraints=[coeff_dict2,'<=',rhs] works well with sd.fba(), sd.fva() and sd.plot_flux_space() functions, however, it breaks sd.compute_strain_designs() when this constraint is included during the definition of optimization problem module within sd.SDModule() function. It seems that during sd.compute_strain_designs() workflow, the initial FVA(s) to identify essential reactions and model compression steps work fine, but the error occurs during the FVA(s) in compressed model to identify essential reactions step - it appears the error is due to some of the reaction IDs figuring in the constraints=[coeff_dict2,'<=',rhs] being no longer present in the final, compressed model.

  2. your ECMpy code modification (i.e. including enzyme constraints as one extra reaction and metabolite) overcomes even this issue and sd.SDModule() runs properly.

I haven't encountered any other problems, so thank you once again for your massive help with this and the working solutions. Best wishes, Jakub

VonAlphaBisZulu commented 8 months ago

Hi Jakub, that's great!

  1. Would you mind sharing a code snippet that helps me to reproduce the constraint-compression issue? Constraints are usually compressed alongside the model, and this should be error-free. If there is an issue, I should definitely take a look.

  2. I offered my suggestions to the ECMpy people. They revised the issue, and it seems to work now, but they haven't added the pseudo reaction and metabolite. Some background story: After suggesting my changes to ECMpy, they told me that ECMpy was discontinued, and I should use ECMpy2.0 instead, which ended up having the same issue. Thus, I made my suggestions for ECMpy2.0. Then they merged ECMpy2.0 into ECMpy and removed the ECMpy2.0 from GitHub, so I raised the issue again for the updated ECMpy. They ended up fixing it there, but didn't implement it as pseudo reactions and metabolites. The changes needed for that are in https://github.com/VonAlphaBisZulu/ECMpy, and they might still accept the pull request and merge in the future. If they don't, feel free to use my repository for your computation. Everything else there is identical to the most recent version of ECMpy. (Or continue to use the version that seems to work for you)

JakubChromy commented 8 months ago

Hi Philipp, thank you for the information about the ECMpy update. I now tried using StrainDesign with the updated ECMpy code and model, but it seems to have the same original problem of not registering the enzyme constraints. So for now I'm going to stick to using your repository with the modified original ECMpy code that already works for me.

Below is a code that should reproduce the error ("ValueError: 'HMPK1_num1' is not in list") I'm getting in the fix scenario 1 discussed above (i.e. using constraints=[coeff_dict2,'<=',rhs]):

import cobra
import straindesign as sd
import logging
import sys
sys.path.append(r'./ECMpy_files/code/') ### CHANGE AS NEEDED
from cobrapy_ec_model_function import *

### Load enzyme-constrained model
json_model_path="./ECMpy_files/model/iML1515_irr_enz_constraint_adj_round2.json" ### CHANGE AS NEEDED
enz_model=get_enzyme_constraint_model(json_model_path)

### Add constraint to StrainDesign, but with negative coefficients for reverse reactions
enzyme_constraint = [c for c in enz_model.constraints if c.ub>0.1][0]
coeff_dict = {k.name : v for k,v in enzyme_constraint.expression.as_coefficients_dict().items() if hasattr(k,'name')}
coeff_dict2 = {k: (-v if enz_model.reactions.get_by_id(k).upper_bound <= 0 else v) for k, v in coeff_dict.items()}
rhs = enzyme_constraint.ub

fba_sol = sd.fba(enz_model, constraints=[coeff_dict2,'<=',rhs])

print("upper bound enzyme constraint: "+str(rhs))
print("solution value enzyme constraint: ",sum([fba_sol.fluxes[k]*v for k,v in coeff_dict.items()]))
print("Growth rate: "+str(fba_sol.fluxes["BIOMASS_Ec_iML1515_core_75p37M"]))

# upper bound enzyme constraint: 0.227
# solution value enzyme constraint:  0.227000000000000
# Growth rate: 0.6802441208853683

### SETTING ALL THE PARAMETERS
# Calculate maximum growth rate
sol1 = sd.fba(enz_model, constraints=[coeff_dict2,'<=',rhs], obj='BIOMASS_Ec_iML1515_core_75p37M', obj_sense='max')
max_growth_rate = sol1.objective_value
print(f"Maximum possible growth rate: {max_growth_rate}.")
print(f"Acetate synthesis rate at maximum growth rate: {sol1.fluxes['EX_ac_e']}.")
# Calculate maximum acetate production
sol2 = sd.fba(enz_model, constraints=[coeff_dict2,'<=',rhs], obj='EX_ac_e', obj_sense='max')
max_production_rate = sol2.objective_value
print(f"Maximum possible acetate synthesis rate: {max_production_rate}.")
print(f"Growth rate at maximum possible acetate synthesis rate: {sol2.fluxes['BIOMASS_Ec_iML1515_core_75p37M']}.")
# Set protect thresholds
growth_protect_threshold = 0.1   # = 10% of max growth rate
production_protect_threshold = 0.01   # = 1% of max production rate

### DEFINE OPTKNOCK MODULE/OPTIMIZATION PROBLEM
module_optknock = sd.SDModule(enz_model,sd.names.OPTKNOCK,
                             inner_objective='BIOMASS_Ec_iML1515_core_75p37M',
                             outer_objective='EX_ac_e',  
                             constraints=[[{'BIOMASS_Ec_iML1515_core_75p37M':1},'>=',max_growth_rate*growth_protect_threshold],
                                          [{'EX_ac_e':1},'>=',max_production_rate*production_protect_threshold], 
                                          [coeff_dict2,'<=',rhs]
                                         ])  

### RUN OPTKNOCK
gko_cost = {g.id:1 for g in enz_model.genes}
gko_cost.pop('s0001')    # removing spontaneous reactions' "gene" from the list of allowed gene KOs
logging.basicConfig(level=logging.INFO)
sols = sd.compute_strain_designs(enz_model,
                                 sd_modules = module_optknock, 
                                 time_limit = 3600,
                                 max_cost = 30,
                                 gko_cost = gko_cost,
                                 gene_kos = True,   # CSO is based on gene KOs
                                 solution_approach = sd.names.ANY   
                                )
VonAlphaBisZulu commented 8 months ago

Thanks, this is very helpful. Hopefully, I can fix this some time soon.

ASchooneveld commented 7 months ago

Hello both :) thank you very much for this elucidating discussion on the issue. I've been thinking about the second fix you proposed (where you implement enzyme constraints as metabolites and reactions).

I have been wondering, a key motivating factor in the development of ECMpy is the fact that for other enzyme constraint models like GECKO "introducing the enzyme constraints into the original metabolic model using GECKO needs to be extensively revised by modifying every metabolic reaction with a pseudo-metabolite representing an enzyme and adding hundreds of exchange reactions for enzymes, which is complex and significant increases the model size." (quote from the paper). The idea was that: "we propose a simpler workflow called ECMpy by explicitly introducing an enzyme constraint without modifying existing metabolic reactions or adding new reactions". My question is, am I right in thinking that this workaround essentially undoes that advantage? And make this method more like GECKO (probably specifically GECKO light https://www.nature.com/articles/s41596-023-00931-7 )?

Thanks for any help, Anna

VonAlphaBisZulu commented 7 months ago

Hey Anna,

That is a great question. TLDR: Proposed fix 2 is equivalent to GECKO/ECM, and the key difference is merely a technical one. Performance differences should not exist.

Here is why:

"hundreds of exchange reactions for enzymes" Your quote refers to an implementation of enzyme constraints with n enzymes, that would require n pseudo metabolites and n pseudo reactions. Let's break it down with an example of two reactions ICL and PFL (numbers are completely made up): ICL: icit_c → glx_c + succ_c PFL: coa_c + pyr_c → accoa_c + for_c

If the reactions are enzyme constraint, these reactions 'consume' the enzymes (icl and pfl) and then read:

ICL: 2 enz_icl + icit_c → glx_c + succ_c PFL: 3 enz_pfl + coa_c + pyr_c → accoa_c + for_c These enzymes are produced by reactions R_enz_icl: → enz_icl R_enz_pfl: → enz_pfl And the total amount of enzyme produced is capped by: 5 R_enz_icl+ 4 R_enz_pfl+ ... ≤ 10

so that's n more metabolites and n more reactions and 1 more inequality

Here, the magic of GECKO and ECM comes in. They combine all coefficients reaction-wise and only introduce a single inequality to the model, which represents the enzyme-availability. The pre-determined factors that are multiplied to the reactions, take into account the enzyme cost to keep the reaction running per unit of reaction rate (I'm not familiar with the details). The inequality then reads: 2.5 ICL + 1.33 PFL ≤ 10 If the model is represented in LP-form that (kinda) reads:

          ICL    PFL  
      Mass balance: S * v = 0         
coa_c      0     -1            =  0
pyr_c      0     -1            =  0
accoa_c    0      1            =  0
for_c      0      1            =  0
icit_c    -1      0      * v   =  0
glx_c      1      0            =  0
succ_c     1      0            =  0
      Flux bounds lb ≤ v ≤ ub
lb_icl     1      0            ≥  0
lb_pfl     0      1            ≥  0
lb_pfl     1      0            ≤  inf
lb_pfl     0      1            ≤  inf
   Additional constraint A*v ≤ b
enz_const  2.5    1.33         ≤ 10

S: stoichiometric matrix
v: flux vector
lb, ub: upper and lower flux bound
A: Matrix used for the inequality

Why implement the enzyme-constraint as a pseudo-reaction instead of a constraint: Traditionally, constraint-based models consist only of two parts, mass balance and flux bounds. Most of the COBRA methods conform with this paradigm. The support of constraints has been introduced with optlang, something like an optimization-backend for COBRA that is not straightforward to use and which is ignored by StrainDesign and other packages because it is so deeply interwoven with solving models and not just their mathematical representation. ECMpy, however, decided to use it, which led to the various problems we discussed and resolved. To comply with the COBRA standard, I proposed an implementation of the inequality constraint through a pseudo metabolite and a pseudo reaction. The additional constraint almost reads like a metabolite already, so not too much to adapt. In place of the inequality, you introduce a metabolite 'enz' that reads almost exactly like the enzyme inequality, and a supply reaction 'R_enz' whose boundaries simulate the enzyme cap. I then reads:

          ICL    PFL    R_enz
      Mass balance: S * v = 0         
coa_c      0     -1      0          =  0
pyr_c      0     -1      0          =  0
accoa_c    0      1      0          =  0
for_c      0      1      0          =  0
icit_c    -1      0      0    * v   =  0
glx_c      1      0      0          =  0
succ_c     1      0      0          =  0
enz       -2.5   -1.33   1          =  0
      Flux bounds lb ≤ v ≤ ub
lb_icl     1      0      0          ≥  0
lb_pfl     0      1      0          ≥  0
lb_r_enz   0      0      1          ≥ -inf
ub_pfl     1      0      0          ≤  inf
ub_pfl     0      1      0          ≤  inf
ub_r_enz   0      0      1          ≤  10

Yes, they look different, for sure. But since inequalities, like the one used by ECM, are often resolved with slack variables by LP solvers (which read just like an unbound variable in the LP matrix), complexity should not change. You actually don't add any real vertices or degrees of freedom, and after preprocessing by the LP solver, the matrices of both problems might even look 100% identical. Even if the solver fails to iron it out, a genome-scale model with one more reaction should be still fine. Definitely a price I'm willing to pay to make EC models usable within the full COBRA ecosystem.

I still have a pull request open (https://github.com/tibbdc/ECMpy/pull/5) that I hope the ECM people accept. That doesn't seem to happen anytime soon. So if you'd like to use ECM with StrainDesign, feel free to use my fork of ECM (https://github.com/VonAlphaBisZulu/ECMpy) that has is implemented that way.

ASchooneveld commented 7 months ago

Thank you so much for that detailed explanation!