opencobra / cobrapy

COBRApy is a package for constraint-based modeling of metabolic networks.
http://opencobra.github.io/cobrapy/
GNU General Public License v2.0
460 stars 217 forks source link

Elementary Flux Modes and Minimum Cut Sets #293

Closed pstjohn closed 1 year ago

pstjohn commented 7 years ago

So I'll use this issue to kick off the discussion on EFMs and MCSs

The two methods I've had the most luck with in calculating EFMs and MCSs are efmtool, doi:10.1093/bioinformatics/btn401, and mhsCalculator, doi:10.1186/1471-2105-14-318.

efmtool is distributed under the Simplified BSD Style License in two formats, a compiled matlab extension (which includes java .jar files which can be run from the command line), and the java source code. It requires java 1.7. Also worth noting are the more recent modifications of efmtool by Christian Jungreuthmayer, including tEFMA and [regEfmtool)[http://www.biotec.boku.ac.at/en/arbeitsgruppenresearch-groups/research-group-mattanovich-gasser-sauer/staff/associated-group-metabolic-modelling/regefmtool/].

Is anyone on the cobrapy team familiar at all with java? (I'm not 😄 ) My current solution is a fairly quick and dirty wrapper that writes input files from a cobrapy model to a temporary directory, calls the compiled java code, and reads out the output files into a pandas dataframe. Its reliable enough for what I've used it for, but I'm not sure how that would best fit into the general cobrapy ecosystem.

mhsCalculator is a set of C codes which function similarly to efmtool, processing input reaction + stoichiometry files and outputting the minimum cut sets. This is also under the Simplified BSD License.

The wrapper functions seem to work fine, but I think the issues will come down to either providing install directions for the third-party tools or appropriately packaging the different code libraries. (And whether or not this is worth including in the core library or as a sub-library)

aebrahim commented 7 years ago

My $.02 as a former developer is that it's much easier to have c dependencies.

You can make an extension in cython and link directly against the C API, and bundle the whole thing as an independent wheel with no additional dependencies.

On Oct 16, 2016 4:39 PM, "Peter St. John" notifications@github.com wrote:

So I'll use this issue to kick off the discussion on EFMs and MCSs

The two methods I've had the most luck with in calculating EFMs and MCSs are efmtool http://www.csb.ethz.ch/tools/software/efmtool.html, doi:10.1093/bioinformatics/btn401 http://dx.doi.org/10.1093/bioinformatics/btn401, and mhsCalculator http://www.biotec.boku.ac.at/en/arbeitsgruppenresearch-groups/research-group-mattanovich-gasser-sauer/staff/associated-group-metabolic-modelling/mhscalculator/, doi:10.1186/1471-2105-14-318 http://www.biomedcentral.com/1471-2105/14/318.

efmtool is distributed under the Simplified BSD Style License in two formats, a compiled matlab extension (which includes java .jar files which can be run from the command line), and the java source code. It requires java 1.7. Also worth noting are the more recent modifications of efmtool by Christian Jungreuthmayer, including tEFMA https://github.com/mpgerstl/tEFMA and [regEfmtool)[http://www. biotec.boku.ac.at/en/arbeitsgruppenresearch-groups/ research-group-mattanovich-gasser-sauer/staff/associated- group-metabolic-modelling/regefmtool/].

Is anyone on the cobrapy team familiar at all with java? (I'm not 😄 ) My current solution is a fairly quick and dirty wrapper that writes input files from a cobrapy model to a temporary directory, calls the compiled java code, and reads out the output files into a pandas dataframe. Its reliable enough for what I've used it for, but I'm not sure how that would best fit into the general cobrapy ecosystem.

mhsCalculator is a set of C codes which function similarly to efmtool, processing input reaction + stoichiometry files and outputting the minimum cut sets. This is also under the Simplified BSD License.

The wrapper functions seem to work fine, but I think the issues will come down to either providing install directions for the third-party tools or appropriately packaging the different code libraries. (And whether or not this is worth including in the core library or as a sub-library)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencobra/cobrapy/issues/293, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUPSIEJ8ZZ6mmhLdtE2vK9mreXERg2Yks5q0rXHgaJpZM4KYH6S .

pstjohn commented 7 years ago

That was my thought as well. Sadly though it seems like efmtool and its various offshoots are all written in java. To do the minimum cut set calculation, you'll need to start with the calculated EFMs, so just including mhsCalculator alone doesn't add much functionality.

cdiener commented 7 years ago

To get extreme rays of polytopes there is also the classic ccd which has python bindings. In C there is also lrs. Both use the double description method as efmtool but do not use the bit vector optimization. When I tried the command line versions ccd+ was actually faster than lrs for me. However, both take a long time even for small models (60 reversible reactions, 80.000 basis vectors, 1-2 hours)...

Many of those packages are compatible with Fukuda's original input format (no idea about efmtool), so we could also just write a function that outputs the problem definition and the user just uses whatever he likes...

pstjohn commented 7 years ago

Another way that efmtool seems to speed up the calculation is by a reversible compression of the metabolic network to the simplest possible representation. That might also have utility in other cobrapy functions, so that could be something worth looking into implementing

cdiener commented 7 years ago

Does that mean deleting redundancies (linearly dependent rows) in the H-representation? I think all of the methods do/recommend that. Nevermind, just skimmed the paper :) Something we could also think about is to use flux sampling and than calculate a convex hull for the sample which gives you an approximate basis. Scipy already has a convex hull function so we would not need additional dependencies. However, I have no idea how good this basis would be.

pstjohn commented 7 years ago

Having used scipy's convex hull function, I'd probably avoid that method 😄 . It definitely struggles with more than just a few dimensions. There's also a lot of research in this area, I'd shy away from trying to come up with a new method. ccd and lrs seem promising, I'll look into them more.

I do think the network expansion / compression would be an important tool, I might create a new issue to ask about it.

I'll use this issue to keep track of some alternative algorithms and implementations:

cdiener commented 7 years ago

Yeah I have encountered the same problem with convex hulls. Sometimes reducing dimensions by SVD and reprojecting after calculating the hull helps. Either way I would not really implement it explicitly since it's just plugging the samples into the convex hull function of scipy. Maybe as an example in the docs...

pstjohn commented 7 years ago

So one option here could be a docker container.. that would obviously require a separate package, but it might be a relatively easy way of shipping a java 7 runtime for EFMtool and a compiled MCScalculator package. Would anyone find that useful? Or is the docker dependency a nonstarter?

phantomas1234 commented 7 years ago

Ok, we already have functions for calculating shortest EFMs and cut sets here https://github.com/biosustain/cameo/blob/devel/cameo/flux_analysis/structural.py that do not have any special dependencies. Two reasons why I think this should go in a separate package: (1) Enumerating all EFMs has no practical value for genome-scale models and I think cobrapy is mainly intended for genome-scale model. Furthermore, everything that has dependencies that go beyond C extensions should not go into cobrapy. Why not make a separate package?

pstjohn commented 7 years ago

Do you happen to have a good reference for the shortest EFMs and MCSs? It would be cool to compare and see whats left out for core-carbon models.

phantomas1234 commented 7 years ago

@pstjohn http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003378

pstjohn commented 7 years ago

Cool, I had actually just read through that. Definitely seems like a promising approach, even for smaller models. We should do another cobrapy hangout to talk about the optlang interface, it would help me figure out what you guys were doing in cameo/cameo/flux_analysis/structural.py

LuisFF commented 7 years ago

The paper on shortest EFMs is available here. The initial code was written in Java and needs rewriting. I'm happy to help.

Newer methods to calculate EFMs in genome-scale networks involved sampling of EFM either using genetic algorithms or a tree like search, e.g. CASOP-GS or TreeEFM. In order to use them in genome-scale networks one can formulate the EFM calculation problem as an LP.

I am being a bit bias here, but I do see value in enumerating not all but a subset of EFMs for genome-scale models, for example, for data integration. I am sure there are other areas where the enumeration of EFMs can be applied to.

phantomas1234 commented 7 years ago

Hey Luis (what a nice surprise!), sorry for not providing the original publication earlier. We actually have a shortest EFM implementation here in python https://github.com/biosustain/cameo/blob/devel/cameo/flux_analysis/structural.py.

pstjohn commented 7 years ago

Hey Luis, thanks for commenting!

I looked into TreeEFM, seemed like a neat tool! Its written in C++ too, so that could be fairly easy to incorporate. Do you know if the source code might be available? -- the original publication only distributes binaries.

@phantomas1234, I had a tough time figuring out how you implemented the shortest EFM algorithm, or how easy it would be to port to the optlang interface. Do you know how much of a speed detriment there is for the python over a java/c++ approach? I know the original algorithm made pretty heavy use of CPLEX's populate, does your version work with other solvers too? (and if so, do we loose the benefit of populate?)

LuisFF commented 7 years ago

@phantomas1234 Nice implementation of the K-shortest! Another approach we explored was to use a more restrictive exclusion_constraint by bounding it by len(exclusion_list) - w, where w can be any integer higher than 1. The higher w is the more distinct (containing less reactions in common) the newly calculated EFMs are.

@pstjohn Fell free to contact Francis Planes at CEIT, I am sure he will be happy to collaborate.

cdiener commented 7 years ago

The k-shortest variations seem very cool. That is definitely something we should consider!

KristianJensen commented 7 years ago

@pstjohn The shortest EFM implementation in cameo is already fully compatible with optlang and can be used with any MILP solver. In addition it implements cplex-specific subroutines that make use of the populate function to improve performance. This performance gain will indeed be lost with solvers that don't have a feature similar to populate. A very large majority of the running time of shortest EFM is spent solving MILP's (optimized solver functions in C) so I don't think there will be a noticeable difference between a python and java/c++ implementation.

hredestig commented 7 years ago

@pstjohn can we close this leaving EFM to the cameo implementation / other packages? Or was there still something that you think should go in cobrapy?

pstjohn commented 7 years ago

I still think it would be great if we could port it to cobrapy eventually, its a feature I certainly would use frequently.

Maybe not as urgent as the 0.6.0 merge, probably better to let the optlang interface in cobrapy stabilize a bit.

phantomas1234 commented 7 years ago

I would support moving the shortest EFM/minimal cut sets to cobrapy eventually.

pstjohn commented 7 years ago

It would also be awesome to add elementary flux vectors at some point as well, in particular to be able to integrate them with arbitrary optlang constraints: S. Klamt et al., PLoS Comput Biol. 13, e1005409–22 (2017)

phantomas1234 commented 7 years ago

It would be great if we could get people from the EFM crowd involved in cobrapy development.

pstjohn commented 7 years ago

Agreed! Although for EFVs specifically, its actually fairly easy to implement: I have it more or less working in my pyefm package. For cobrapy, we would probably want to do the k-shortest method... but we also need to implement some sort of network compression / decompression as well.

pstjohn commented 6 years ago

So for the k-shortest method, the original authors use integer fluxes only and mention that the method is more expensive using non-integer fluxes.

Do we have an opinion as to whether or not we want the fluxes to be allowed to take non-integer values? I don't have an immediate insight into whether or not that makes EFV or MCS calculation harder or easier..

pstjohn commented 6 years ago

Will depend on #612. WIP at feat/elementary_flux_modes

Midnighter commented 6 years ago

I wonder though if EFMs and related plus the tooling surrounding it would be a good target for a small cobrapy dependent (sub-) package?

pstjohn commented 6 years ago

There's already pstjohn/pyefm :). But i think if we had a k-shortest implementation using pure optlang it would fit nicely in the existing package.

Certainly open to discussion though

KristianJensen commented 6 years ago

@pstjohn Requiring fluxes to be integers only works when all the coefficients of the S-matrix are integers. Since this is usually not the case for genome-scale models I think it makes most sense to allow non-integer fluxes by default. If it is significantly faster to have a pure integer problem, then I guess it would be fairly easy to add a switch argument to the function/object or automatically check if the S-matrix is integer

cdiener commented 1 year ago

Closing because that discussion is 5 years old. Feel free to open a thread in the Discussion section. Either way it seems that any solution (part of the core, or separate packages) seem to have support, though it's more a lack of contributors willing to implement it.