opencobra / cobrapy

COBRApy is a package for constraint-based modeling of metabolic networks.
http://opencobra.github.io/cobrapy/
GNU General Public License v2.0
461 stars 216 forks source link

Extracting fluxes using slim_optimize #942

Closed jccvila closed 1 year ago

jccvila commented 4 years ago

So I though i would share an extensions to slim_optimize that has proven helpful for what i suspect is a fairly common task when performing FBA. I don't know if this is redundant, and am certain there are way to speed this up, but i can't find anything similar in the existing Cobrapy code.

As it currently stands there are two ways to do optimization in Cobrapy

  1. Is to use model.slim_optimize() which is extremely fast as it only returns the objective value and doesn't extract the flux through every reaction
  2. The second is use model.optimize which return everything (fluxes, shadow prices etc) but is much slower because get_solution() takes time to retrieve the data from the solver.

It would helpful to have something in between, for example when we only want the fluxes through a defined subset of reactions such as the exchange reactions . This is especially important when we want to do repeated rounds of optimization each time extracting the same subset of fluxes (say repeatedly optimizing the same model each time changing the bounds on one reactions). This i what i came up with as an extension to slim_optimize

def slim_optimize_v2(self, error_value=float('nan'), message=None,rxn_id = None):
    """Optimize model without creating a solution object.

    Creating a full solution object implies fetching shadow prices and
    flux values for all reactions and metabolites from the solver
    object. This necessarily takes some time espeically in cases where only a
    subset of values are of interest. This function supplies a list of ids
    for which the flux (primal value) is of interest and uses efficient means to fetch those values

    Parameters
    ----------
    error_value : float, None
        The value to return if optimization failed due to e.g.
        infeasibility. If None, raise `OptimizationError` if the
        optimization fails.
    message : string
        Error message to use if the model optimization did not succeed.
    rxn_id: An iterable of `cobra.Reaction` ids (.id and .reverse_id)
    """
    self.solver.optimize()
    if self.solver.status == optlang.interface.OPTIMAL and rxn_id is None:
        return self.solver.objective.value
    elif self.solver.status == optlang.interface.OPTIMAL and rxn_id is not None:
        rxn_variables = [model.variables._dict[i] for i in rxn_id]
        fluxes = [variable.primal for variable in rxn_variables]
        return Series(index = rxn_id,data = fluxes)
    elif error_value is not None:
        return error_value
    else:
        assert_optimal(self, message)

A few benchmarks

For the E.coli model if you want to get the flux through every reaction slim_optimize_v2() is slightly faster than optimize() (0.046 vs 0.07) but it's might still be worth using model.optimize() because it combines the forward and reverse primal values.

Testing code:

#default
model = cobra.io.read_sbml_model('../Data/iJO1366.xml')
model.optimize() # First optimization always take the longest so want to discount that effect
model.reactions.EX_ac_e.lower_bound = -10.0
rxn_id = [rxn.id for rxn in model.reactions] + [rxn.reverse_id for rxn in model.reactions]
start = time.process_time()
solution = model.optimize()
print(time.process_time() - start)
#slim optimize v2
model = cobra.io.read_sbml_model('../Data/iJO1366.xml')
model.optimize() # First optimization always take the longest so want to discount that effect
model.reactions.EX_ac_e.lower_bound = -10.0  #Change bound to new environment for testing
rxn_id = [rxn.id for rxn in model.reactions] + [rxn.reverse_id for rxn in model.reactions]
start = time.process_time()
solution = slim_optimize_v2(model,rxn_id = rxn_id)
print(time.process_time() - start)

For the E.coli model if you want to get only a subset of fluxes (for example all exchange reactions) than it's much faster (0.008)

#Extension  Exchanges only
model = cobra.io.read_sbml_model('../Data/iJO1366.xml')
model.optimize() # First optimization always take the longest so want to discount that effect
model.reactions.EX_ac_e.lower_bound = -10.0  #Change bound to new environment for testing
rxn_id = [rxn.id for rxn in model.exchanges] + [rxn.reverse_id for rxn in model.exchanges]
start = time.process_time()
solution = slim_optimize_v2(model,rxn_id = rxn_id)
print(time.process_time() - start)

If someone has a quicker /more elegant way of doing this, let me know as it would be very helpful. Otherwise i hope this is helpful in some way.

Midnighter commented 4 years ago

If you use the get_solution function (it's admittedly hidden pretty deeply), you can extract fluxes for exactly the reactions and/or metabolites that you want. Maybe take a look if that serves your purpose.

On a side note: When you do benchmark you should use the timeit functionality. That will run your code multiple times and give you average values. IPython and Jupyter expose this in the very convenient magic function %timeit and cell magic %%timeit.

jccvila commented 4 years ago

So the problem with the get_solution function is that on top of spending time extracting dual values and shadow prices (which are unnecessary) it also runs.

var_primals = model.solver.primal_values 

which get the primal value for every reaction (digging into optlang.interface.py you'll see that this involves running.

[variable.primal for variable in self.variables]

which can be less efficient than than only collecting the primal value for the subset of reaction you are interested in.

variables = [model.variables._dict[i] for i in rxn_id]
fluxes = [variable.primal for variable in rxn_variables]

PS yeah agree with %timeit. I was being lazy :).

cdiener commented 4 years ago

Model._get_primal_values is overwritten by the specific interfaces in optlang. For instance for GLPK it uses a low level C function we added to swiglpk which gets all primals directly (https://github.com/biosustain/optlang/blob/13673ac26f6b3ba37a2ef392489722c52e3c5ff1/optlang/glpk_interface.py#L638 and https://github.com/biosustain/swiglpk/pull/22). When we benchmarked this, simply calculating the differences forward_flux - reverse_flux was actually slower than getting all the primals from the model, so this where your speed up probably comes from (since it does not include that step). CPLEX has a similar optimization but Gurobi has not, which is why getting a solution from Gurobi is the slowest in optlang.

There may be faster ways of getting only a handful of fluxes you are interested in but it's kind of hard to say at which number the advantage of _get_primal_values kicks in. This also depends on the time needed to building up the data structures or calculating the final fluxes so I would include those in the benchmarks.

jccvila commented 4 years ago

I'm was using Gurobi which does not have it's own _get_primal_values and so defaults to the one used by otplang. This probably explains the big time difference. In my case calculating foward_flux - reverse_flux (using numpy) is definitely not the limiting step. I'm not using GLPK because i've been getting some inconsistent solutions with very large models. Below is an updated version which works for both cplex and gurobi and combines foward with reverse flux

def slim_optimize_v3(self, error_value=float('nan'), message=None,rxn_id_foward = None,rxn_id_reverse=None):
    """Optimize model without creating a solution object.

    Creating a full solution object implies fetching shadow prices and
    flux values for all reactions and metabolites from the solver
    object. This necessarily takes some time espeically in cases where only a
    subset of values are of interest. This function supplies a list of ids
    for which the flux (primal value) is of interest and uses efficient means to fetch those values

    Parameters
    ----------
    error_value : float, None
        The value to return if optimization failed due to e.g.
        infeasibility. If None, raise `OptimizationError` if the
        optimization fails.
    message : string
        Error message to use if the model optimization did not succeed.
    rxn_id: An iterable of `cobra.Reaction` ids (.id and .reverse_id)
    """
    self.solver.optimize()
    if self.solver.status == optlang.interface.OPTIMAL and (rxn_id_foward is None or rxn_id_reverse is None):
        return self.solver.objective.value
    elif self.solver.status == optlang.interface.OPTIMAL and rxn_id_foward is not None and rxn_id_reverse is not None:
        if cobra.util.solver.get_solver_name(model) == 'cplex':
            foward_flux = model.solver.problem.solution.get_values(rxn_id_foward)
            reverse_flux =model.solver.problem.solution.get_values(rxn_id_reverse)
        else:
            foward_flux = [model.variables._dict[i].primal for i in rxn_id_foward]
            reverse_flux = [model.variables._dict[i].primal for i in rxn_id_reverse]
        return Series(index = rxn_id_foward,data = np.subtract(foward_flux,reverse_flux))
    elif error_value is not None:
        return error_value
    else:
        assert_optimal(self, message)

For Gurobi lets do a proper comparison of getting the fluxes through all the exchange reactions using slim_optimize_v3 compared to get_solution

For all tests I start with

model = cobra.io.read_sbml_model('../Data/iJO1366.xml')
model.optimize() # First optimization always take the longest so want to discount that effect
exchange_reactions = model.exchanges
foward_ids = [x.id for x in exchange_reactions]
reverse_ids = [x.reverse_id for x in exchange_reactions]

get_solution version:

%%timeit
model.solver.optimize()
solutions = get_solution(model,reactions = model.exchanges)

84.1 ms ± 4.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

slim_optimize_v3 version

%%timeit
solutions = model.slim_optimize_v3(model,rxn_id_foward = model.rxn_id_reverse=)

6.43 ms ± 90.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

repeating this for cplex the differece is marginal.

get_solution takes 14.4 ms ± 586 slim_optimize_v3 takes 12.4 ms ± 154

Any suggestions on how to improve this would be appreciated on my end. Agree that whether this is worth doing more widely will depend on how many fluxes you want to retrieve.

cdiener commented 4 years ago

Reactions carry around their own list of forward and reverse variables. So iterating over reactions directly and doing something like rxn.forward_variable.primal - rxn.reverse_variable.primal may be faster. There is also rxn.flux so something like the following might be shorter to write:

import pandas as pd
from cobra.test import create_test_model

mod = create_test_model("ecoli")  # similar size as yours
mod.slim_optimize()
ex_fluxes = pd.Series({r.id: r.flux for r in mod.exchanges})

If you really want the fastest way to get the solution then GLPK is the way to go. what do you mean by inconsistent solutions? Does that persist if you adjust the solver tolerance, for instance with model.tolerance = 1e-9?

jccvila commented 4 years ago

Thanks for the suggestion. So i tested it using the same approach as above and your solution is actually still slightly slower than slim_optmize_v3 ()

7.06 ms vs 21.6 ms for gurobi 14 ms vs 30.8 ms for cplex

I think there are a few reasons for this. One is that rxn.forward_variable.primal is less efficient than model.variables._dict[rxn_id].primal (for gurobi ) and is also less efficient than model.solver.problem.solution.get_values(rxn_id) (for cplex). Another possible reason is that some error checks are avoided by directly accessing the primal value from the otplang model interface rather than through the reaction class.

In terms of GLPK the inconsistency arises in a very specific context and i have'nt yet been able to nail down the source. Basically for an extremely large model GLPK (when run iteratively in the context manager) is not able to reach a solution that gurobi and cplex can (regardless of whether the tolerance is 1e-7 or 1e-9). You can reproduce the error by running the below code and noting that len(U_mets2) is much shorter than len(U_mets) when they should be same. Sorry i could'nt supply a simpler version, but that's basically why i've been focusing on CPLEX or Gurobi.The input model is the universal model from carveme :found here https://github.com/cdanielmachado/carveme/blob/master/carveme/data/generated/universe_bacteria.xml.gz. I have not been able to reproduce this with a smaller model such as a the e.coli model.

%%capture 
#CARVME models don't include charge in the right location so give a bunch of warnings which i prefer to suppress
#Load Models
def set_minimal_media(mod):
    default_exchanges = ['EX_' +  x for x in ['ca2_e', 'cl_e','cobalt2_e','cu2_e','fe2_e','fe3_e','h_e','h2o_e','k_e','mg2_e','mn2_e','mobd_e','nh4_e', 
                         'o2_e','pi_e','so4_e','zn2_e']]
    for k in mod.exchanges:
        if k.id in default_exchanges:
            mod.reactions.get_by_id(k.id).lower_bound =-1000
        else:
            mod.reactions.get_by_id(k.id).lower_bound =0.0
    return None
def irreversabilize(self):
    '''Convert model into an equivalent model where all non demand/sink reactions are foward only 
    (i.e lower bound of 0 upper bound of 1000).
    '''
    for i in range(0, len(self.reactions)):
        if self.reactions[i].lower_bound < -999:
                self.reactions[i].lower_bound = -1000.0
        if self.reactions[i].upper_bound > 999:
            self.reactions[i].upper_bound = 1000.0
        if (self.reactions[i].lower_bound <0 and  len(self.reactions[i].metabolites)!=1 and self.reactions[i].objective_coefficient==0):
            pass
        else:
            continue

        # copy reversible reaction and split
        temp_rxn = self.reactions[i].copy()
        self.reactions[i].lower_bound = 0
        # change bounds of new reaction
        temp_rxn.upper_bound = -1*temp_rxn.lower_bound
        temp_rxn.lower_bound = 0

        #change metabolite signs of new reaction
        new_mets = temp_rxn.metabolites
        new_mets = {k: -v for (k, v) in new_mets.items()}
        temp_rxn.subtract_metabolites(temp_rxn.metabolites)
        temp_rxn.add_metabolites(new_mets)
        # add "rev" to new rxn name and add it to model
        temp_rxn.id = temp_rxn.id + '_rev'
        self.add_reactions([temp_rxn]) 

#Measure growth rate with each exchange reaction as a carbon source using gurobi
U = cobra.io.read_sbml_model('../Data/universe_bacteria.xml.gz')
U.solver = 'gurobi' #
irreversabilize(U) #convert model into an equivalent model where all metabolic reactions are foward only (i.e lower bound of 0 upper bound of 1000)
U.tolerance = 1e-7
set_minimal_media(U)
U_mets =[]
for i in U.exchanges:
    with U as temp_U:
        temp_U.reactions.get_by_id(i.id).lower_bound = -10.0
        if temp_U.slim_optimize() >0.01:
            U_mets.append(i.id)

#repeat using GLPK   
U = cobra.io.read_sbml_model('../Data/universe_bacteria.xml.gz')
U.solver = 'glpk' #I found this solver to be much faster than commerical ones
irreversabilize(U) #convert model into an equivalent model where all metabolic reactions are foward only (i.e lower bound of 0 upper bound of 1000)
U.tolerance = 1e-7
set_minimal_media(U)
U_mets2 =[]
for i in U.exchanges:
    with U as temp_U:
        temp_U.reactions.get_by_id(i.id).lower_bound = -10.0
        if temp_U.slim_optimize() >0.01:
            U_mets2.append(i.id)
cdiener commented 4 years ago

I think this would have to be tested with larger models since in those ms ranges you are mostly benchmarking attribute lookups of Python object. Reaction.flux also uses model.variables[rxn_id] so the _dict method just circumvents a single attribute lookup. Reaction.flux will also be slightly slower since it checks for validity of the solution (checks if the models has been modified since last solver) which generates a slight overhead. Caching indices over multiple calls so you don't have to lookup reaction ids might even be faster.

As for your example code there is a lot of stuff going on. First in cobrapy the actual model solved by the solver is already irreversible (this is why we have forward and reverse fluxes). Your code actually just duplicates all reactions but retains the same model (now with redundant rows in the stoichiometric matrix). Solvers tend to eliminate those so this might create some problems which could result in your observed discrepancy. Also The gurobi solver in cobrapy currently has a bug where changing the ID may not work, so this might be another problem here. I checked the CarveME model with some common workflows where one may expect unique solutions and they were always within solver tolerance between GLPK and Gurobi so I would suspect the problem lies in the model modification. Additionally checking against an exact optimal cutoff like 0.01 will lead to some arbitrary list lengths. What happens if the optimal value is 0.01 + 1e-10 in gurobi and 0.01 - 1e-10 in GLPK? The solver tolerance is larger than that so those results would not be incompatible but would lead to different lengths of the two lists.

cdiener commented 1 year ago

This has mostly been improved in optlang. model.solver.primal_values is now guaranteed to use the fastest strategy compatible with cobrapy.