opencobra / memote

memote – the genome-scale metabolic model test suite
https://memote.readthedocs.io/
Apache License 2.0
123 stars 26 forks source link

Missing GPRs #214

Open cdanielmachado opened 6 years ago

cdanielmachado commented 6 years ago

memote gives an error for all intracellular reactions without GPRs. However, it is not clear to me why a GPR should be mandatory. Maybe there could be evidence for the presence of a biochemical reaction occurring inside the cell and the respective mechanism is not known (could be spontaneous or enzymatic) or the respective enzyme was not yet identified.

ChristianLieven commented 6 years ago

Niko and I had started discussing this on the manuscript, but I will just paste the conversation here, so that the main discussion can happen here now.

what about spontaneous, uncatalyzed reactions?

  • Niko's comment on the same test.

My response:

No, right now there is no good way of identifying reactions which aren't supposed to have GPR. At least it is not possible without declaring some sort of tag or suffix to mark them, or by exempting 'known' spontaneous, uncatalyzed reactions. There is a possibility to lower the strictness of the test to accept a certain amount or percentage of reactions without GPR, would that be a better solution? I'd prefer though if we had a way to explicitly mark 'spontaneous' reactions as such. Perhaps that is something worth discussing in the openCOBRA Schema. That way we don't have to assume anything and leave it to the reconstructor to act on the available evidence during reconstruction.

cdanielmachado commented 6 years ago

I think spontaneous does not cover all cases. There could simply be reactions that are known (or supposed) to occur, but the respective protein has not been identified. I guess one could call it a gap-filling reaction. Anyway, my point was that having no GPR should not be considered wrong. I think this is something that could just be given as part of a summary report of the model.

ChristianLieven commented 6 years ago

In the manuscript Ines mentioned that:

The E. coli model used to use a fake gene s0001.

My response was:

That's a bit cryptic in my opinion. Ideally it'd be something verbose, as to avoid confusion and to make it immediately clear. Moritz suggested using SBO terms as a means of namespace agnostic detection of the Biomass reaction. I think it could work here too, although I am a big fan of having human readable tags

And I agree @cdanielmachado, we'd need to be more concise than just naming every reaction 'spontaneous', but I disagree with moving this check to the summary/ statistics category. After all, the majority of metabolic reactions are catalysed enzymatically, and thus we'd lose this level of control.

Here's what I propose:

  1. Check for GPR rules
  2. If none are present check for the following annotation or SBO term to be present:
    • known to be spontaneous
    • known reaction, but unidentified mechanism
    • gap-filled for one reason or another
    • pseudo-reaction necessary for modeling
    • lumped reaction
    • .. (potentially more criteria the community comes up with)
  3. Assert that a reaction has either of the two above
  4. If not, tally up all the reactions that aren't clearly defined and fail the test
  5. If the origin of all reactions in the model is clearly defined the test is passed.
ChristianLieven commented 6 years ago

I should add that I don't know if SBO terms exist for any of these follow-up checks.

tpfau commented 6 years ago

I doubt that SBO terms exist for those, and I highly doubt, that most models do contain this information. One additional check you could do besides GPR as @draeger pointed out elsewhere is to check a EC-number annotation.

draeger commented 6 years ago

You can always request new SBO terms as needed using this form: https://sourceforge.net/p/sbo/term-request/

ChristianLieven commented 6 years ago

I doubt that SBO terms exist for those, and I highly doubt, that most models do contain this information.

I see memote primarily as a tool to bring models to the same level of quality. While we do have the opportunity to influence the trends or even set new conventions with it, I don't want to implement something that isn't backed by the community. So what do you think, in order to classify reactions without GPR giving them an explicit justification, would the above suggestion be a sensible addition?

We do check for an EC-number annotation, but in a different context.

tpfau commented 6 years ago

I would:

  1. Check for GPR Rule

  2. Check for EC-Number Potentially: Check other annotations (e.g. a Uniprot link, would indicate enzymatic reaction).

  3. Check SBO Term - With additional requested Terms.

If all reactions are defined or clearly associated: Succeed, otherwise fail that quality test.

If I get your intention, your tool should serve as a quality check for models. As such you can set some standards, and I think it makes sense to have reaction types associated/properly annotated. And now the but: SBO Terms are very often (not in the Cobra toolbox were its just unassigned by default, but I have seen this in multiple models) set to a specific "basic" Term, without that term being correct. So a model that passes this test only informs, that the type is set, not that it makes sense (which is problematic imo). What I would, in general suggest: If a model has no GPR rule, but an SBO that says Enzyme-catalysed, than something is off. An SBO "unknown mechanism" would be fine, even "enzymatic with unknown enzyme", makes sense, but if you check these things, make sure, that the different fields are consistent. e.g. No GPR -> Should not be enzymatic, or at least indicating SBO:unknown Enzyme SBO:Unknown Enzyme -> not allowed to have a GPR (otherwise people will use unknown Enzyme as default SBO). etc...

draeger commented 6 years ago
  1. Just in case, check also for the presence of a modifierSpeciesReference on that Reaction whose sboTerm attribute is a child of the term catalyst (see http://www.ebi.ac.uk/sbo/main/tree?open=14&nodeId=14#SBO_14_0).
ChristianLieven commented 6 years ago

In response to @tpfau: Yeah, that seems reasonable. I haven't really worked with SBO terms much yet myself, because, as you said, usually they are not included in models or simply just a "basic" term. So they generally don't carry any additional relevant information that cannot otherwise be obtained from the model itself.

I am still a bit reluctant to rely on them because when skimming through a model, one has to 'decode' the terms. I'm a big fan of explicit 'tags' or extra-attributes because then any user (with a bit of a background in Biology) can look at a model's components and immediately know whats going on. However, I don't want to reinvent the wheel either and sboTerms seem to have all cases covered and if not are extensible.

It may be redundant, but I'm fond of having an overview as opposed to storing the information in several containers (GPR in fbc:geneProductAssociation -> EC in annotation -> SBO terms in sboTerm). Perhaps it may be better to re-establish and extend the confidence score attribute/annotation into the system we have been discussing above, and only use the SBO terms as a second sanity check.

But then again, I'm not convinced of that either, because the confidence score, to me, should mark the 'trustworthiness' of a reaction based on hard evidence with literature references, something like (Score: "4" translates to "Purified and characterised enzyme" with attribute DOI: xyz)

@draeger: Would this point back to the genes from the GPR? The modifierSpeciesReference that is.

draeger commented 6 years ago

It is definitely possible to annotate a modifierSpeciesReferences using appropriate qualifiers to express how it is formed. There could be, for instance, a relationship between the modifier and genes using isEncodedBy (see a full list of qualifiers here: http://co.mbine.org/standards/qualifiers). So you could essentially say what forms that modifier, but it is not expressive enough to define logical rules as given by GPRs that would tell you how this is being formed (or under which conditions).

tpfau commented 6 years ago

The problem with confidenceScores is that they are barely existent for anything thats not a model organism, as we often enough just "dont know". About SBOTerms: Yes, you have to decode them, but honestly: I wouldn't want to skim through the contents of a larger SBML file (I only do this if I really have some issues when modelling, that I can't believe stemming from the SBML to check, whether this is actual data from the SBML...). And if you have a GUI, the translation would be quite easy.

wrt the modifierSpecies: You won't have a direct SBML-Specific relation, as a GeneProduct in fbc, cannot serve as modifierSpecies. Thats my main issue. Yes, you can form the connection via annotations, but it would be a very different use to the common use of annotations (i.e. instead of referring to outside databases, you would now point to a specific species/GeneProduct). However, if you have a modifierSpecies and no GPR, that modifierSpecies is either a pseudo-enzyme (for which no GPR is known) or a non enzymatic catalyst (e.g. some metal ion, or some small compound).

draeger commented 6 years ago

Well, you can, of course, write any resource in a MIRIAM annotation that you like, such as a gene identifier within the model. However, the idea is her also to point to external databases. Sorry if this wasn't clear.

tpfau commented 6 years ago

@draeger Essentially, two items in the SBML would indicate that they are identified by the same external object. And, in contrast to the current situation, it would be necessary to check these cross-references. Thats Why I would rather suggest to have an additional optional fbc-field in the species (that the modifierspecies refers to), which indicates that this species is a specific gene product. Because this would be a direct link within the SBML specification, instead of an indirect link that we have to detect.

ChristianLieven commented 6 years ago

The problem with confidenceScores is that they are barely existent for anything thats not a model organism, as we often enough just "dont know".

I suppose requiring them in a well-defined format through tests in memote could mitigate this i.e. make users aware and encourage them to start providing information wherever they can. The lowest category could just be "don't know" or "no data available" by default, but then it is at least explicit that this is the case.

I wouldn't want to skim through the contents of a larger SBML file.

That issue I didn't consider, but I agree. File sizes are exploding enough already. I shall try to approach this issue as we've discussed above then by looking for piecing together the information from different places. SBO seems like a powerful way of doing it, and once we've extended it I may be able to primarily rely on that. With regards to that, I was wondering, what is the most effective way currently to add SBO terms to a model? I'm not sure if cobrapy internally supports that, but I assume the COBRAToolbox does?

Thats Why I would rather suggest to have an additional optional fbc-field in the species (that the modifierspecies refers to), which indicates that this species is a specific gene product. Because this would be a direct link within the SBML specification, instead of an indirect link that we have to detect.

I like this idea. Do I understand it correctly that in the context of reactions this could be used for cofactors that aren't consumed in the actual reaction, or would it be for regulation, or both?

tpfau commented 6 years ago

SBML Modifierspecies, are only indicators that a specific reaction is in some way modified by the indicated Modifierspecies (which itself has to point to a species in the model). So this includes regulatory events as well as catalysts which are not consumed. You can add a SBO-Term to the species to indicate its function. But I don't think anyone in the COBRA field is using them in this way yet. There are models (primarily HMR), which provide their GPR associations as Modifierspecies, but that leads to them being only able to indicate gene association not a real rule (because its unclear whether the modifiers are connected by AND or by OR). As for confidence Scores: Yes, but this will lead to very low confidence scores on many models. Admittedly this might be a good thing given the quality of a lot of models around :)

ChristianLieven commented 6 years ago

As for confidence Scores: Yes, but this will lead to very low confidence scores on many models. Admittedly this might be a good thing given the quality of a lot of models around :)

Yeah, a decent way to express confidence scores is definitely something that should find its way into the test. If not to improve model confidence, then to raise awareness.

ChristianLieven commented 5 years ago

Seems like a decent solution resorting to the use of ECO terms has been proposed and generally accepted here:

https://github.com/draeger-lab/ModelPolisher/issues/5

I believe it only needs to be implemented in COBRApy for memote to start checking for this. @matthiaskoenig will this already be part in the new SBML I/O functionality that you've built for COBRApy?