opencobra / schema

xml/rdf schemas for annotating cobra models
Apache License 2.0
2 stars 1 forks source link

Proper encoding of GPR associations #11

Open cdanielmachado opened 6 years ago

cdanielmachado commented 6 years ago

I would like to suggest replacing the boolean representation of GPR associations with a true Gene-Protein-Reaction association schema.

Here are few reasons for this:

  1. Gene expression is not boolean (the boolean notation is a legacy of the first methods used to try to integrate regulation with metabolism, such as rFBA and srFBA).

  2. Many new methods are arising (e.g: transcriptomics-based methods, and methods that apply enzyme usage constraints) that encode gene/protein expression as continuous values.

  3. The current schema is a recursively nested tree of boolean AND and OR operators, which allows the implementation of arbitrary formulas that might not have a trivial interpretation in terms of isozymes and protein complexes.

  4. Things are not treated by their proper names, i.e: an AND denotes a complex, an OR denotes the presence of alternative isozymes, so why not call things by their name instead of calling them by a particular mathematical notation that is used to represent them? (Metabolites and reactions are rows and columns in a stoichiometric matrix, but we call them "species" and "reaction", not "column" and "row").

  5. The boolean format is not able to account for the stoichiometry of subunits in a protein complex. This information would be useful for methods that account for enzyme usage constraints [1,2].

  6. An explicit GPR representation would always allow the generation of the respective boolean formula (for applications where the GPR is interpreted in a purely boolean fashion, e.g: gene knockout simulation).

  7. An explicit representation of Gene and Protein objects would allow annotating each object with the respective attributes (e.g.: gene labels for genes, UniProt ids for proteins).

So, to give an example, here is what I propose:

A GPR which was encoded in the legacy cobra format as: ((A and B) or (C and D)) is currently encoded as:

<fbc:geneProductAssociation>
    <fbc:or> 
        <fbc:and> 
            <fbc:geneProductRef fbc:geneProduct="A"/>
            <fbc:geneProductRef fbc:geneProduct="B"/> 
       </fbc:and> 
       <fbc:and>
            <fbc:geneProductRef fbc:geneProduct="C"/> 
            <fbc:geneProductRef fbc:geneProduct="D"/> 
       </fbc:and> 
    </fbc:or> 
</fbc:geneProductAssociation>

would be encoded as:

<fbc:GPRAssociation>
    <fbc:protein id="Complex_AB"> 
        <fbc:gene id="A" stoichiometry="1"/>
        <fbc:gene id="B" stoichiometry="1"/> 
    </fbc:protein> 
    <fbc:protein id="Complex_CD"> 
        <fbc:gene id="C" stoichiometry="1"/>
        <fbc:gene id="D" stoichiometry="1"/> 
    </fbc:protein> 
</fbc:GPRAssociation>

Furthermore, the GPR association for aldehyde dehydrogenase in E. coli, which is a tetramer of 4 equal subunits, and can only be currently encoded as:

<fbc:geneProductAssociation> 
    <fbc:geneProductRef fbc:geneProduct="aldB"/>
</fbc:geneProductAssociation>

would be able to correctly represent the fact that 4 copies of the gene are required to make one protein:

<fbc:GPRAssociation>
    <fbc:protein id="AldB" fbc:label="Uniprot:P37685"> 
        <fbc:gene id="aldB" fbc:label="b3588" stoichiometry="4"/>
    </fbc:protein> 
</fbc:GPRAssociation>

For convenience, I am setting the Gene and Protein attributes directly inside the GPR, but the correct way to do this would be to define genes and proteins inside listOfGenes and listOfProteins, and just make references inside the GPR (like it is already currently done with fbc:geneProductRef).

[1] http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005140 [2] http://msb.embopress.org/content/13/8/935

draeger commented 6 years ago

Interesting ideas! Are you suggesting to remove the current geneProductAssociation in the next version of FBC entirely or do you intend to have both?

cdanielmachado commented 6 years ago

The idea would be to replace it, there is no need for redundancy, and like I said, one can always build the boolean representation from the explicit GPR Association.

matthiaskoenig commented 6 years ago

Why are you not just defining a complex forming reaction, i.e. parts of the complex react to complex and use the complex in you GPR. This solves all your issues and you can use it today?

On Nov 23, 2017 5:31 PM, "Daniel Machado" notifications@github.com wrote:

The idea would be to replace it, there is no need for redundancy, and like I said, one can always build the boolean representation from the explicit GPR Association.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/11#issuecomment-346660851, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29um2OyNaDbEaQMV1N7sdtE3_9B7vIks5s5Z30gaJpZM4Qo2V4 .

draeger commented 6 years ago

@matthiaskoenig's suggestion would undoubtedly lead to a perfect solution. Unfortunately, it yet requires us to update the FBC specification. The reason is that Version 2 states on page 16 that GeneProductAssociation can have only one of the three possible child nodes, namely And, Or, or GeneProductRef, where the latter can exclusively refer to instances of GeneProduct within the ListOfGeneProducts of the same SBML document. If we want to address complexes that can build up or degrade in reactions, the FBC specification needs to change so that GeneProductRef becomes a more general class. Right now, using it to refer to instances of Species would correctly result in a validation error.

@cdanielmachado, if you have the impression that next generation models will need more involved and more biochemically motivated GPRs, we should follow @matthiaskoenig's suggestion and put the replacement (or generalization) of the GeneProductRef class in FBC on the list of change requests for Version 3 of the package. To this end, we need first to figure out which kinds of objects you think should be usable within GPRs, beyond GeneProduct.

tpfau commented 6 years ago

I agree with @matthiaskoenig , that you could then essentially have a reaction to form the Proteins using combinations of the different Gene Products. Finally you use modifierSpecies, with a "Catalyst" tag (SBOTerm), to associate the formed products with the reactions OR you could even just use Constraints that indicate how much of protein A has to be present to allow a specific flux through Reaction Y.

This essentially comes back to my comment in sbml-flux about encoding of things that could be done with existing SBML properties, but which are not encoded that way.

One thing to remember, when adapting this: We will need some way to be able to convert these new stoichiometric GPRs into boolean GPRs to be able to still use older methods with newer models.

cdanielmachado commented 6 years ago

@matthiaskoenig @tpfau @draeger Thank you for your comments, but I disagree that the proposed solution (adding complex-forming reactions to the model is the right way to go).

This may solve points 5 and 7 from my initial comment (I've just enumerated the points to make them easier to mention), but does not solve points 1, 3 and 4.

Also, I would like to add a few extra points:

  1. The existence of a possible workaround for some of the problems I mentioned that do not involve changing the current specification, is not sufficient claim that the current specification is better than the proposed one.

  2. Since these models are metabolic models, I would avoid changing the network structure by adding extra reactions to encode something that should be properly encoded by the GPR associations. That could lead to a lot more trouble with stoichiometric matrices not being interpreted correctly by many simulation tools.

(I recently published a method that does exactly that, i.e. a stoichiometric representation of GPR associations, and reformulation of several methods based on that representation: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005140 But this is something that can be done optionally, to enable new simulation features, not as a default way of encoding models).

  1. The current GeneProductAssociation is something that sounds a bit strange to me (and maybe to others). Like the name says, it is not a Gene-Protein-Reaction association (like described in the literature), it is something different. Also, a GeneProduct is a strange concept. It is not really a gene and not really a protein. I think that having Gene objects and Protein objects, and a true GeneProteinReactionAssociation would make things much more clear.
tpfau commented 6 years ago

@cdanielmachado I don't see how additional reactions would 'not' allow to formulate GPRs in a non boolean fashion. For 1. Stoichiometries could be added just as you propose with their respective coefficients: e.g

<reaction>
<listOfReactants>
  <speciesReference species="aldB" stoichiometry="4"/>
</listOfReactants>
<listOfProducts>
  <speciesReference species="AldB" stoichiometry="1"/>
</listOfProducts>

where AldB is a species that is annotated by SBO as being a protein (e.g. SBO:0000014), and aldB is annotated as being e.g. an mRNA (SBO:0000312). The reaction itself would have to be encoded as either translation SBO:0000184 (in the instance above or as a complex formation SBO:0000526 if multiple proteins are forming a complex).

For 3: Yes, this would stay, but in general its a question of whoever is encoding the GPR to use a "proper" format (e.g. for the boolean case a DNF form of the boolean rule). Any clause in such a form would correspond either to a single protein or a functional protein complex. The corresponding species from the reactions above could be annotated accordingly, and fullyfill exactly this role.

About 4: With "proper" annotation as above, this is solved, each species has its "correct" name, annotations etc pp. It even has the advantage that the fbc:label is not used in one way by one tool and another by another tool.

Overall: From a COBRA modelling perspective, I agree with you, that an "easy" encoding of COBRA specific information would be convenient. BUT: If we want to bridge between different fields and connect models, the more specialised the FBC package gets the more problematic it will be to merge models, and we essentially create our own standard that is no longer SBML. Currently, if I want to provide a visualisation tool to give an overview of a network, I need to implement code that handles fbc, in order to get GPR information, while this could, in theory be just encoded with basic SBML features (that could be read and understood by any tool that interprets the full infromation available in the SBML (ignoring things like annotations for now, but only those fields and data explicitly in the SBML specification, like SBOTerms). Yes, this would require effort from the side of the Tool development community, BUT the same would be true for an extension of FBC.

About 9: I understand this point of view completely. The thing is, there are 2 options: Either a tool, that does not understand it, completely ignores the information (which would be the case in an update to fbc), or it handles them in an odd way (potentially adding things to the stoichiometric matrix, that should not be there). On the other hand: Even in the variant suggested by @matthiaskoenig or myself, the relations would just be ignored by most tools, as modifierSpecies are commonly just ignored. Thus the information would be lost (the same as for a change in fbc).

About 10: The SBML idea is that all, Protein, Gene, mRNA, or metabolite are all species. Different types of species, but species. With fbc this was broken up (and given the current discussion about non Stoichiometric constraints) I'm not sure, whether this was a good idea. True, its easier to read from the SBML (and easier to have a look at the SBML as a human), but it leads to a conceptual separation that is completely specific to a given modelling field, and that is not interpretable by those outside it (in my opinion contradicting the SBML idea).

cdanielmachado commented 6 years ago

I am not saying that this solution to add extra reactions would not allow encoding the GPRs. I agree it would allow, but I don't think it is the best approach.

If anyone wants to add extra reactions to a model to encode the GPRs, then I would prefer to keep things as they are.

My proposal is actually quite minimalistic, it only requires minor changes the current fbc specification of GPR associations:

Also, if I understand your comment correctly, geneProduct currently means mRNA ? Then the current specification is actually implementing Transcript-Reaction associations. I don't have a problem with that, I just think that since they are called GPRs (Gene-Protein-Reaction associations), we should stick to the definition as closely as possible.

tpfau commented 6 years ago

I don't think there is a clear meaning to geneProduct, it could be anything from mRNA to Protein, but I would claim from the name that it should not be the gene itself. My use of mRNA here was just a illustration how it could be encoded in a non GPR-Fashion. Thinking about it, this is another indicator, that it might be good to drop this...

As I said, from the discussion on non additional constraints on the fbc mailing list, I get ever more convinced, that we should drop as much specialisation from our SBML IO as possible, because the more specialised it gets, the less likely it is, that other tools can understand it (and this includes visualisation tools), and the more information is lost when importing into a more basic tool.

cdanielmachado commented 6 years ago

The fact that no one seems to understand what geneProduct really is, only makes my point that the current GPR specification is confusing even more important.

The goal of the sbml-fbc standard is to suit the cobra community. I don't think that modifying the current GPR specification is gonna hamper or improve the compatibility with other communities.

tpfau commented 6 years ago

I need to backtrack here, given the current disscusion on sbml-flux. You can't currently properly encode GPRs as reactions, since SBO-terms are completely optional, and might or might not be interpreted without the model result being allowed to change.

Under this light: I would advocate four conceptual items that can be used to form GPRs: Gene, Transcript, Protein, Protein Complex Where we can have the following relationships: Genes 1 -> N Transcripts (i.e. 1 gene multiple transcripts. One way to do this is to have a transcript attribute fbc:Gene). Transcript 1 -> 1 Protein (Conversion from the Nucleic acid code to Amino Acid code, Proteins can be used directly as catalysts on reactions). N Proteins -> Protein Complex (protein complexes defined e.g. by the GPR rules), including stoichiometry. A reaction could have either Proteins or Protein Complexes associated as catalysts.

cdanielmachado commented 6 years ago

I think that sounds very reasonable, although it becomes a bit more complex than I had initially proposed. I had not considered the case of alternative splicing (one gene -> N transcripts), good that you point it out, as it might cover the needs of people working on human models (or eukaryotes in general).

Since 1 Transcript directly encodes 1 Protein, maybe we could maybe skip the intermediate Transcript and reduce this to 3 layers:

1 Gene -> N Protein (alternative splicing) N Protein -> 1 Complex (complex formation) N (Protein or Complex) -> 1 Reaction (alternative isozymes)

Anyway, at the end of the day, I am happy if we replace boolean operators with more biologically meaningful constructions, and have an association tree with a fixed depth (2 or 3 layers), instead of an arbitrarily nested tree.

tpfau commented 6 years ago

As mentioned, I would reduce Gene to a attribute of the transcript anyways but yes, you could also have 1 Gene -> multiple proteins instead of the transcript. Essentially I would like to have the possibility to properly annotate transcripts/genes in some way that can be interpreted. And would allow users to either provide genes or transcripts. for their models without the software having to figure out, which is which. In the end, the main caveat would anyways be to have people properly annotate the Proteins/Transcripts/Genes.

bgoli commented 6 years ago

This is a general response to some comments in the thread

There already is a mechanism for associating geneProducts with metabolites/species using the associatedSpecies attribute (FBC spec Page 11 section 3.5) which allows for the Manchester style of encoding GPR's, as used in the Yeast consensus model.

The current FBC GPR format was designed for interoperability, thus encoding all existing models. In essence the GPR association concept is a phenomenological linking the concepts of genes to the activity of a reaction that exists in the absence of a proper mechanistic process description - essentially an annotation that allows modellers the flexibility to implement this linking in their own way.

cdanielmachado commented 6 years ago

@bgoli Can you please elaborate on "absence of a proper mechanistic process description"?

As far as I understand, GPR associations provide a mechanistic description of the association between gene function and enzymatic reactions: OR relations represent isozymes, and AND relations represent complex formation.

So, I repeat my initial question: why not treat things by their name?

We don't call metabolites and reactions rows and columns just because they are represented in a stoichiometric matrix.

bgoli commented 6 years ago

They are not mechanistic as they do not represent the entire transcription, translationand post-translational modification processes.

In the wild various interpretations of these processes were included in the GPR so a more generic name was chosen by the people involved in the V2 specification. All those discussions are archived online.

cdanielmachado commented 6 years ago

Meaning there is no point in further discussing the topic?

tpfau commented 5 years ago

@bgoli As this is currently coming up again: Do you by chance have a link to the relevant discussions?

bgoli commented 5 years ago

@tpfau They are spread over a few months (Dec 2014 - April 2015) of the FBC v2 discussion. The gene encoding discussion started here.