opencobra / cobrapy

COBRApy is a package for constraint-based modeling of metabolic networks.
http://opencobra.github.io/cobrapy/
GNU General Public License v2.0
461 stars 217 forks source link

Refactor some core funtionality to ease the extension of cobrapy by third parties #1087

Open carrascomj opened 3 years ago

carrascomj commented 3 years ago

Checklist

Related to the discussion on https://github.com/opencobra/cobrapy/issues/967, in the sense that it is about libraries building upon cobrapy.

Is your feature related to a problem? Please describe it.

One of the pains of extending cobrapy is the composition pattern of cobra.Model. For instance, if a package wants to merely extend cobra.Reaction, the library can implement the following:

def CustomReaction(cobra.Reaction):
    def __init__(self, custom_attr, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.custom_attr = custom_attr

    def some_extended_functionality():
        pass

However, even though the new package doesn't add new functionality to the cobra.Model class, it has to rewrite all of the methods of cobra.Model that involve a cobra.Reaction (well, in this case, just add_boundary). Not only that, but it has to rewrite the I/O monolithic functions, replacing cobra.Reaction with CustomReaction. In the same fashion, extending cobra.Metabolite requires reimplementing some methods inside cobra.Reaction.

Describe the solution you would like.

This is not a problem of cobrapy itself, but a common problem when extending python libraries in general that leads to avoiding inheritance through shallow inheritance and code duplication. Nonetheless, I'd like to propose some changes that could facilitate the extension of cobrapy:

  1. Refactoring of the I/O functions. Dividing _sbml_to_model into smaller functions (e.g., parse_annotation, parse_metabolites, parse_reactions, parse_groups) would lead to less code duplication (this is directly related to #967, although I don't think that a core sbml module is needed for this).
  2. Optional type arguments. This is a bit controversial because it is not a python pattern (trying to emulate generics of typed languages), so I don't know if it should be introduced. For example:

    from typing import Type
    
    class Model(Object):
        def __init__(
            self, id_or_model = None, name = None, 
            reaction_type: Type[cobra.Reaction] = cobra.Reaction, Type[cobra.Metabolite] = metabolite_type):
        # ...
        self.Reaction = reaction_type
        self.Metabolite = metabolite_type
    
        # invoked where needed
        def add_boundary(self,
            # ...
        ):
            # ...
            return self.Reaction()

    and the same with Reaction.

Describe alternatives you considered

Shallow inheritance, straight up rewriting the relevant functions and classes, etc.

Additional context

I am writing a package that inherits from cobrapy and involves additional parsing of the SBML document.

Midnighter commented 3 years ago

This is exactly the kind of thing that I'd like to address when tackling #967. I don't know how much time you have for achieving your task but it probably does not make sense to start a separate package yet.

What extra information do you need to parse?

carrascomj commented 3 years ago

More or less, I don't need to make extra libSBML calls, but I am heavily relying on Groups, so that's something that I need to do in _sbml_to_model, which makes me rewrite it just for that, besides using specialized children classes of cobra.Metabolite and cobra.Reaction.

cdiener commented 3 years ago

I think the composition in Model is fine for that. You don't need to store the type of the derived class just use Model.add_reactions/Model.add_metabolites with you derived classes and it will work. Groups are already supported. What feature doesn't work for you there?

carrascomj commented 3 years ago

Groups work, but I add functionality based on the groups at the moment of parsing. In fact, everything is working for me after rewriting _sbml_to_model, inheriting from cobra.Model (because of add_boundary) and inheriting from cobra.Reaction (this would be required for my case anyways, but I could imagine that maybe another package would like to extend cobra.Metabolite and ends up inheriting and rewriting the methods of cobra.Reaction just to replace the Metabolite constructor calls).

What I am trying to highlight is that having to duplicate _sbml_to_model (which is huge) just to add inherited classes hinders extensibility, and that also applies to the core classes of cobrapy. The same applies for _write_sbml_model and I guess the rest I/O functions (I am only using SBML).

As I said, I understand that adding arguments for inherited types everywhere is not the best possible pattern, but these are my afterthoughts after writing a package that inherits from cobrapy.

cdiener commented 3 years ago

Could you expand on what you are trying to achieve a little more?

I agree that the hierarchical structure can make it harder to derive classes, but the type injection would probably create too much overhead and overwriting just the relevant methods seems a bit more clear to me. That said there aren't that many methods you have to overwrite. For Model it's add_boundary and only if you want to support it. For Metabolite you only need to rewrite the Reaction constructor if you don't add the metabolites to the model first. I think this is already much more modular than other packages. For instance, substituting Series with a custom class in pandas would be much harder. Depending on your use case there may be other patterns to achieve what you want. For instance, if your behavior depends on groups it might make sense to store it centrally in Model and not in the Reactions. Or parse first and then transform the model to fit your structure. Or inject attributes into reactions (only works if you wrap model reading, writing and serialization though).

Making the parser more modular is a separate issue and I can see use cases for that. It would also fit well with the design of lisbml where you get a doc and then extract info from part of it. And I could imagine use cases where one only wants to get the metabolites and their annotations from a model for instance.

carrascomj commented 3 years ago

Since the refactoring of IO is more straightforward, that should be prioritized.

It is true that there aren't many methods to overwrite and that this is common in python packages, so we can leave the discussion there. Again, I was just highlighting the pain points I have found of extending cobrapy as a third party package, hoping that can initiate a discussion to ease those points.

Once you know what to overwrite, it is fine, the problem comes when, for instance, someone naively implements a cobra.Reaction children class and, months after that, an user tries to add_boundary(), which leads to the raise of an error or silent unexpected behavior. Of course, someone with experience in making these kinds of extensions in python would know to look up for those cases in the codebase from the start.

cdiener commented 3 years ago

I see, make sense. Thanks for the additional info.