opencobra / cobrapy

COBRApy is a package for constraint-based modeling of metabolic networks.
http://opencobra.github.io/cobrapy/
GNU General Public License v2.0
467 stars 218 forks source link

Extracting core functionality into a separate package #967

Open Midnighter opened 4 years ago

Midnighter commented 4 years ago

Hi,

I'm trying to gauge the general interest in and start a discussion around extracting some of cobrapy's core functionality into a separate package. The goal of the new package would be a faithful Python representation of an SBML model, as well as covering seamless input and output of models; importantly without requiring mathematical solvers.

I have written down my motivation and tried to explain it further in a Google Doc that anyone can comment on.

As I explain in the document, I also hope that this may provide a common foundation for the various Python constraint-based modeling frameworks so that we can solve and maintain the boring parts in one place.

This should be of interest to @opencobra/cobrapy-core and especially @Hemant27031999 due to his GSoC project with @matthiaskoenig and @draeger. I am particularly interested in your opinions @bgoli and @cdanielmachado. Do you see the same value in this as I do? Can you see yourself helping to maintain such a package and adjusting your own frameworks to build on top of it?

With best intentions and hopes of a stronger COBRA community, Moritz

cdanielmachado commented 4 years ago

Hi @Midnighter,

In general, I think this is a good idea and it might be easier to convince the community to extend a common core library than to convince everyone to adopt cobrapy.

Here are a few comments/concerns:

Midnighter commented 4 years ago

There is already one Model class that is "a faithful Python representation of an SBML model", that is the SBML model class from the libsbml API. Anything else we develop would be a not-so-faithful representation of the SBML model and I can imagine it will be hard to make everyone agree on what should be the core functionality and what can be discarded from the libsbml Model object.

While I agree that the libSBML Model class is, of course, the authoritative source, it's purpose is to provide functionality for exchanging information. It is not the right place to implement your 'business logic' and in fact, none of the modeling frameworks use it that way.

The purpose I had in mind for this class is to be a parent to a constraint-based model class, as a hypothetical example, Parent :arrow_right: Model :arrow_right: ConstraintBasedModel. I think it would be sub-optimal at best, actually bad design, to inherit directly from the libSBML Model. However, it should be faithful in the sense that one can round trip a defined subset of the information from an SBML document to a Model and back without loss of information. This subset should fulfil constraint-based modeling needs and encompass current model encoding practices.

It will certainly be a struggle to agree on class designs but that is the question I'm posing here, is the effort worth the gains for the community?

After reading your document proposal, I disagree with the Serialization part (sorry, that was fast, smile ). I don't see the purpose of serializing models to JSON or YAML. In fact, I don't like the fact that some cobra projects (cobrapy, memote, Bigg database) have implemented the JSON format. It is not a community standard like SBML, it is a representation that was adopted in an ad hoc fashion by a subset of the community.

I firmly stand on the side of practicality here. JSON is the de facto standard for data exchange via HTTP and YAML has proven helpful for line-based diffs. Since it is less verbose than XML, JSON has also proven to be a more efficient format which is an important consideration for multi step workflows involving metabolic models. I agree that BiGG should probably only distribute models in SBML format as the source of truth but that is a different discussion.

Again, in my opinion the goal should be the possibility to round trip from SBML to JSON and back to SBML without loss of relevant information. This is certainly not the case with the current JSON format. By ironing out the current problems and getting more buy-in for the format from a larger community, we can make the format less ad hoc.

draeger commented 4 years ago

There are many excellent ideas here, and I am happy to see such eagerness in implementing this:

  1. Separating the data model from the application-driven library COBRApy makes a lot of sense and will improve its reuse.
  2. Being able to serialize SBML to file formats other than XML is very handy, given the use-cases outlined above. @cdanielmachado please note that SBML is mainly concerned about the data structures, i.e., objects, but uses XML for its serialization. If other file formats represent identical objects, it will still be SBML. It seems that the SBML editors even want to more prominently highlight that XML is a recommended way of writing SBML files, but not necessarily the only one. So far, the specifications say that SBML is primarily serialized in XML, hence principally also allowing other serialization formats (see quote below).
  3. However, @Midnighter should consider that starting this effort might become an immense endeavor. Therefore, it is essential to have support from the community and plan this project properly, considering that it will require ongoing maintenance and development.

Appendix (quote from the specification of SBML Level 3 Version 2 Release 2, page 3):

SBML is defined neutrally with respect to programming languages and software encoding; however, it is oriented primarily towards allowing models to be encoded using XML, the eXtensible Markup Language (Bray et al., 2004). This document contains many examples of SBML models written in XML. Formal schemas describing the syntax of SBML, as well as other materials and software, are available from the SBML project web site, http://sbml.org/.

matthiaskoenig commented 4 years ago

Hi all,

  1. SBML is and should be always considered the truth for the model. So having all features from the SBML in the cobrapy core classes which are relevant for FBC should be a priority.
  2. The core is already separated very well from the rest of cobrapy by having it in a separate module. I don't see any additional need for separation.
  3. Simulation is part of the checking of the core classes, i.e. to be able to test if the model representation is correct the SBML test suite for the FBC models should be run, which at least requires FBA as an implementation. So packages for running FBA and creating the LP problem like optlang must be pat of the core (at least for testing that the correct LP problem with the correct solutions is generated).
  4. Basic simulation algorithms such as FBA and FVA, gene/reaction deletions and sampling should be part of cobrapy. I don't see any advantage of splitting these from cobrapy. All the advantages you mention in your docs can be achieved with having a good core module instead of a separate core package.
  5. Personally I think the other way around would be much more attractive: I.e. all the dependent packages should provide core functionality like basic algorithms to cobrapy instead of reinventing the wheel in their packages. It is not trivial to generate correct LP problems from SBML and additional constraints. Having every dependent package implement these will result in many problems.

In summary: I highly agree we need the core data objects to be as compatible as possible to SBML, but I think this can be done in the core module. I don't think creating a core data structure package is a good idea, because we need the simulation to test the correct implementation of the core data structure (e.g. SBML test suite).

cdanielmachado commented 4 years ago

I think @Midnighter's suggestion for having a core metabolic modeling class as an independent package rather than an independent module inside cobrapy is an attempt to bring together all the python community developing metabolic modeling tools in python which for whatever reason prefer not to use cobrapy.

I compiled a list (already 2 years ago, so it might be incomplete) of constraint-based metabolic modeling packages in python:

If you are a cobrapy user it is easy to say "let's all just agree on using cobrapy", but after developing framed/reframed for the past 7 years, I am not easily willing to give up on it, and I suppose @bgoli also has no plans to give up on cbmpy, which he also has been developing for many years :)

Sharing a core modeling package that we all agree to write from scratch would be a nice way to bring the community together, but I am not overly optimistic that we will come to a consensus.

pstjohn commented 4 years ago

It might be a good chance to merge some of the ideas from #394 as well, specifically having a more hierarchical view of reactions, metabolites, and models that allow components to be re-used. But the thought that it would be one package to rule them all does feel a bit like the standards problem...

matthiaskoenig commented 4 years ago

Yes, it makes sense if some of the other packages/tools would adopt such a common data structure (but I don't see this happening). SBML is the existing exchange standard for models. I don't think there will be a common standard for the internal data structures (and don't see any need for that). Many of the tools are 1 man shows which have not been updated in years (e.g. https://github.com/linsalrob/PyFBA). Is there any other actively developed tool with more then 1 developer in python which would be willing to adopt such a core data structure?

Midnighter commented 4 years ago

@matthiaskoenig some of the points I would have answered were already made by @cdanielmachado :slightly_smiling_face: Additionally, I think you highlight an important design choice with the following point:

Simulation is part of the checking of the core classes, i.e. to be able to test if the model representation is correct the SBML test suite for the FBC models should be run, which at least requires FBA as an implementation. So packages for running FBA and creating the LP problem like optlang must be pat of the core (at least for testing that the correct LP problem with the correct solutions is generated).

In how far is this really necessary and how much validation is already done by SBML itself for us? I certainly see a few use-cases for a package without any solver implementation:

  1. The above mentioned possibility for different frameworks to implement optimization on top of such a core package.
  2. I hear many complaints in the SBML community that there are no good tools to do the kind of annotation that everyone would love to see. Annotation only tools could be built on top of this.
  3. Format conversion tools as you yourself are interested in @matthiaskoenig could be much more light weight without the added solver dependencies and work faster because they don't need to create optimization problems, too.
  4. Interfacing with other packages becomes easier. At the moment, cobrapy is a bit of a heavy dependency. Other packages could more easily return such core objects or know how to manipulate them without burdening their users with the full list of dependencies of cobrapy. As a concrete example, I'm working with @eladnoor on equilibrator and annotating thermodynamic information on a metabolic model would be a breeze that way.

But the thought that it would be one package to rule them all does feel a bit like the standards problem...

@pstjohn what I have in mind is rather a bottom-up and not a top-down approach. As I'm sure you would agree, the sustainability of research software is in serious doubt. And the incentives for implementing SBML well and thoroughly are basically non-existent. Most of us get rewarded for publishing new shiny methods, not for creating an implementation of the SBML specification down to the letter.

And I can understand that when we innovate, we want to do it in the tool that we're most familiar with or that enables us to best do what we need to do. But what I'm asking is if we can agree enough on these basics so that we can pool our limited time to create this core that we all benefit from. SBML is just such an effort itself. Can we build a sub community for SBML in Python? Or do we continue to work in isolation?

cdanielmachado commented 4 years ago

Maybe we could have a survey to see how many of the "one-man shows" would be willing to jump into such a common project.

Speaking for myself, I started (re)framed in 2012 because I was not happy with the messy structure of the cobrapy modeling classes (and multiple other design decisions), which in my opinion could only be solved by writing something from scratch.

I have discussed with the cobrapy folks once every few years about a possible reconciliation, which has never happened, one large reason being the insistence on keeping this legacy cobrapy model structure.

Therefore, the idea of having a brand new model structure sounds very appealing to me. Regardless of this going forward or not, I plan to keep developing reframed for the near future.

Also, despite the larger community, cobrapy has very limited functionality, and many of these dozen other tools implement methods not available in cobrapy (for instance, in reframed alone there are 15 different simulation methods). Hence the cobrapy community would also benefit from such cooperation.

As an attempt to establish a bridge, I have recently implemented round-trip conversion between the CobraPy and ReFramed model objects:

https://github.com/cdanielmachado/reframed/blob/9862739604f61ac0d06dcb3717fb38493131d12b/reframed/external/cobrapy.py

As you can see this only takes a few lines of code since the model structures are already quite similar.

Best regards, Daniel

Midnighter commented 4 years ago

I intended this as a clean slate discussion because I think that an implementation of a core package that closely follows SBML leaves plenty of creative freedom to build on top of. This is not conducive to this discussion but since you had to bring it up, I cannot let your framing here go uncommented.

I have discussed with the cobrapy folks once every few years about a possible reconciliation, which has never happened, one large reason being the insistence on keeping this legacy cobrapy model structure.

In the time that I have been part of this project, the initiative to discuss potential ways to bring framed and cobrapy closer together has come from our side. In particular, during the first discussion that I was present at, we suggested a handful of approaches. One of them was a very similar idea to this one. Create a base package of compatible classes that both cobrapy and framed could inherit from. You declined all options that we had to offer, leaving us with the impression that the only choice you would consider was something that looked like framed. One reason that also seemed very important to you, was that you preferred to make design choices and changes as you saw fit without being encumbered by a larger community. We certainly have to be more conservative with API changes since there are several hundred users whose code will break if we do. However, our discussion was largely decided before even considering too specific technical details.

As an attempt to establish a bridge, I have recently implemented round-trip conversion between the CobraPy and ReFramed model objects

So you are painting a picture of yourself here where you are the one trying to establish relationships when my experience so far has been that you have been closing the doors offered to you. Maybe you have since changed your mind and up until your last comment, I thought you made valuable and helpful contributions to the discussion. I don't mind differences of opinion and it is certainly your right to work as you please but kindly do not neglect your own role in past decisions; positioning us as the party unwilling to compromise.

As you can see this only takes a few lines of code since the model structures are already quite similar.

Going forward, I would like to focus on those similarities that you so rightly point out and on the mutual benefits that we could enjoy.

cdiener commented 4 years ago

For me @draeger summarized my own sentiments perfectly. I think having cobrapy interact with an intermediate data model would fix a lot of issues. And being able to chose different serialization back-ends would be a game changer for a lot of our projects. I would probably keep this layer as a purely data-oriented one though with the leanest modification abilities possible. Model modification can often not be isolated from the solver as @matthiaskoenig pointed out, so this would just lead to duplicated code.

I agree that this is a big endeavor that might fragment the contributor base and might lead to less contributors maintaining cobrapy itself. So it would be important to know how many people would be willing to actually maintain that split in the long term.

matthiaskoenig commented 4 years ago

Hi all, great discussion. Perhaps we should have an online meeting/survey about this.

@Daniel Machado Your code looks like a duplication of the cobra core model with the exception of the first class Compartment object which I want to see forever in cobrapy (but did not happen so far). So definitely one candidate use case (especially if you get all the model IO for free).

I did not want to sound too negative before, but there have to be some actual adopters of the core structure by other tools. Otherwise I don't see much advantage of separating the core out in a separate package (perhaps having some lean install options which only pull in the the most essential packages could be an alternative solution). Probably a survey what the reasons by other cobra tools are for not adopting the cobra core model could be a good start. Just separating the modules will not solve the underlying issues, but we have to also fix what is keeping others from using the cobra core data model.

Best Matthias

On Tue, Jul 14, 2020 at 12:01 AM Christian Diener notifications@github.com wrote:

For me @draeger https://github.com/draeger summarized my own sentiments perfectly. I think having cobrapy interact with an intermediate data model would fix a lot of issues. And being able to chose different serialization back-ends would be a game changer for a lot of our projects. I would probably keep this layer as a purely data-oriented one though with the leanest modification abilities possible. Model modification can often not be isolated from the solver as @matthiaskoenig https://github.com/matthiaskoenig pointed out, so this would just lead to duplicated code.

I agree that this is a big endeavor that might fragment the contributor base and might lead to less contributors maintaining cobrapy itself. So it would be important to know how many people would be willing to actually maintain that split in the long term.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencobra/cobrapy/issues/967#issuecomment-657815230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG33ORLXMTCT3WLO7YLKYLR3N74HANCNFSM4OV7HONQ .

-- Matthias König, PhD. Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt Universität zu Berlin, Institute of Biology, Institute for Theoretical Biology https://livermetabolism.com konigmatt@googlemail.com https://twitter.com/konigmatt https://github.com/matthiaskoenig Tel: +49 30 2093 98435

bdelepine commented 4 years ago

Hello all,

Thanks for such a nice discussion!

As a COBRApy user, I lean on the side of @matthiaskoenig:

Basically, this would mean that current "cobra" package would be intended by design to be a core package for other fancier simulations packages (which is actually what I always believed before reading this thread).

From what was discussed so far, I noted two epic points that were mentioned in favour of a new package but that are, IMO, not incompatible with keeping everything within COPRApy:

  1. "we want faithful SBML I/O to other file format than XML (JSON, YAML, Pickle)": @draeger was very convincing, and I can see why this is a desirable feature. However I don't understand why it would be better in another package 🤔
  2. "we want a core package with light dependencies": Yes!.. and the solvers are heavy, but critical for many uses. I believe we should consider to declare heavy dependencies as setuptools "extras", which would make things as easy as pip install cobra for lightweight uses vs. pip install cobra[all] for the full bundle (see ref, idea mentioned by @matthiaskoenig). Is it reasonable?
dotPiano commented 4 years ago

Hi everyone,

Nice to see such an active discussion! I'll add my view as well:

The bottom line is that, if this goes forward I'll happily adapt my code to use the core model, in the hope that users can smoothly use my tools while doing model manipulation and FBA with their favorite library 😄

Best, Mattia

eladnoor commented 4 years ago
2. "we want a core package with light dependencies": Yes!.. and the solvers are heavy, but critical for many uses. I believe we should consider to declare heavy dependencies as setuptools "extras", which would make things as easy as `pip install cobra` for lightweight uses vs. `pip install cobra[all]` for the full bundle (see [ref](https://setuptools.readthedocs.io/en/latest/setuptools.html#declaring-extras-optional-features-with-their-own-dependencies), idea mentioned by @matthiaskoenig). Is it reasonable?

I think that the question of whether to have the core as a separate package, or as part of the same package with "extras", is a minor design issue. After all, why can't two packages be supported by exactly the same people (and even be maintained in a single git repository). On the other hand, wouldn't making it a separate package make it more likely to be used by other projects that only need the core functionality? I can definitely say that was the case for me when I needed something like a Model class for equilibrator and opted not to depend on cobrapy because it comes with too much baggage. Maybe @bdelepine 's suggestion covers this scenario, sorry if I don't fully understand how pip works.

Midnighter commented 4 years ago

Maybe @bdelepine 's suggestion covers this scenario, sorry if I don't fully understand how pip works.

Baudoin's suggestion does indeed cover this because a normal installation of cobra would in that scenario only install the minimal dependencies. I must admit that I am not in favour of this, though, because many users have enough trouble with Python and package installation already. If we additionally tell them, "oh, FBA only works if you also install xyz" I'm afraid it will add to the confusion and frustrate users further who expect everything to work right away. Conversely, if cobrapy depends directly on a cobra-core (or whatever name) package, its use is not noticeable for the majority of users while developers have more options.

  • (while it's a beautiful idea) I'm sceptical too in the feasibility of getting "one-man shows" onboard,

I actually think that there are a few advantages even just for cobrapy.

  1. SBML compatibility and other file formats are a source of continuous trouble. I believe that the designs presented in the Google Doc can help alleviate some of the pain points by separating functionality more clearly.
  2. Having one package that ensures smooth model conversions and have it be fully tested and reliable is a boon to method developers who simply don't need to care about that aspect. Classes both in a core package and in cobrapy would be smaller and thus easier to understand, and more focused as they would have clearer, separate responsibilities (one is representational, the other to add optimization functionality). They would thus be easier to reason about and easier to modify without error.
  3. Creating such a separation of packages is also an open invitation to future developments. I believe that we would create new opportunities for developers to get creative even when not considering all of the packages who could benefit from this already today.

A potential downside is that bugs in the core package will likely be noticed by cobrapy users but since the two projects will be tightly linked, I think we can handle that.

  • and the priority should be to list the reasons why "one-man shows" did not inherited from COBRApy classes, so that COBRApy could be improved to cover those needs.

This is a very important point. And it is indeed part of the reason why I invited everyone to comment here. Cobrapy has a lot of ugly sides. It started as a port of the COBRA Toolbox, then slowly grew into something more Pythonic with true object-oriented design and expressive definition of the underlying optimization problems. It did grow very organically, though, without a common design and guidance so it is quite messy under the hood. I believe that the difficulty to modify cobrapy is impeding innovation and I hope that such a separation will help. But of course we should not repeat the same mistakes that prevented others from adopting cobrapy as a basis in the first place. If we can make the necessary design changes to get broader adoption and support for it, then that's worth the trouble of finding a consensus.

Midnighter commented 3 years ago

Hey all,

For those interested in continuing this discussion and planning concrete next steps, @bgoli and I will meet online at 11:00 CET on Wednesday March 24 in the context of HARMONY. I'll post a link here when I know which platform we'll use to connect.

cdiener commented 3 years ago

I think this is something that should be voted on by the general community. Summarizing the comments here the general opinion is pretty mixed and pretty much negative in some cases (reframed).

So I would put that up to a vote and also make it clear that this could lead to slightly broken cobrapy for a while since that is a big change that won't just happen without bugs (similar to the optlang transition). Also it definitely needs input from @phantomas1234.

Midnighter commented 3 years ago

While I welcome any input on whether or not this is a wanted/needed/desirable approach, I don't see how it would leave cobrapy in a slightly broken state? I would envision development to start on a new package completely independently. When that is in a stable state, work would start on cobrapy to inherit from that new package. When that is complete a new release would seal the deal. Of course, it is a bunch of work and bugs are likely (but hopefully many caught by our test suite), I see zero impact on users.

cdiener commented 3 years ago

Sorry, of course I don't know if that will happen. It's just based on the experience with the optlang transition that used the same strategy. Even though all tests passed in the first release there were still a lot of new bugs and performance issues that took a while to make the package stable again. So I feel that this should be tackled after a 1.0 release and fixing the already reported issues.

Midnighter commented 3 years ago

I agree that there is an element of risk. I do see the risk as being much lower compared to introducing a completely new solver backend, though. It means for cobrapy that the io module would be moved out and all of the core classes need to be slimmed down in as much as that functionality will be provided to cobrapy by the other package. In general the classes have remained fairly stable so I don't see any surprises there.

phantomas1234 commented 3 years ago

Can we wait with any larger refactoring until there is a dedicated maintainer for cobrapy hired again? It looks like I'll be able to do that soon. Right now I would favor not taking any risks here.

matthiaskoenig commented 3 years ago

After some discussions and thoughts I see this plan much more positive. A big advantage would be a much smaller core package with limited dependencies. The smaller code base would allow for much higher quality code, e.g. full type annotations and static type checking among others.

cdanielmachado commented 3 years ago

Hi Niko (@phantomas1234), any follow up on this?

"Can we wait with any larger refactoring until there is a dedicated maintainer for cobrapy hired again? It looks like I'll be able to do that soon. Right now I would favor not taking any risks here."

I don't have time to maintain reframed anymore, and might start using cobrapy for new projects, but I don't want to do it before any major refactoring or a change of lead developer.

cdiener commented 3 years ago

Not Niko, but I may have some insights. I can't really speak for everybody but based on the last dev meetings and discussions the plan seems to be not to change anything big for now. In terms of the API it looks like everybody feels strongly that it should remain stable so I wouldn't expect any large changes there. Current plan is rather to stabilize everything for a 1.0 release/ a new publication. In terms of the lead developer, cobrapy has transformed into a community project a while ago. So it's mainly a set of core developers that manage the repo and that set can change. It does not depend on a single lead as much as before I would say, but I let others chime in. The advantage is that things keep moving even if @phantomas1234 or @Midnighter have less time. Also major props to Moritz for continuing to contribute a lot in his free time 🙏

Midnighter commented 3 years ago

Hi Niko (@phantomas1234), any follow up on this?

The idea behind this issue is still very much on my mind but I simply do not have the time to spare to even start. I did kick off some email conversations and a Google Doc that I think you've seen already.

Last time I talked with Niko, none of the resources promised to him for COBRApy actually materialized and his focus also shifted slightly as far as I can tell. So for the foreseeable future, there isn't anybody who will work on COBRApy full (or even part) time. I see many places where we could improve things but given the situation, consolidating what we have and pushing out a version 1.0 + publication seems like the best path forward.

What would you miss the most from reframed in terms simulation/analysis methods (I'm sure you'll miss plenty regarding how things are implemented)? Since you already switched to optlang, maybe adding a few missing pieces wouldn't be too much work.

cdanielmachado commented 3 years ago

Thank you @Midnighter and @cdiener !

I would like to establish a long term roadmap (5+ years) for CarveMe and other tools I plan to develop in my group, and I would like to make that in a way that is sustainable and aligned with the roadmap for the rest of the cobra ecosystem (cobrapy, optlang, bigg, escher, cameo, MEWpy, etc).

Regarding cobrapy/reframed I see two use-cases:

  1. As libraries to build other tools upon, they both do the job they are supposed to (read/write sbml, implement solver interfaces, etc). Switching to cobrapy would bring the benefit of not worrying about maintenance.
  2. As end-user interface / teaching tool, like you say, I would miss the way things are implemented. Not to say they are better, but at least in a way that feels more natural to me and more aligned with the way I would like to teach (more clear separation between the representation of a biological system and the mathematical implementation of that representation, between models and a simulation methods, etc). But this is a problem for me to worry about.

Your feedback already helps me have a better perspective. Now I know that things will be stable for a while, so thanks again for the detailed replies.