opencobra / schema

xml/rdf schemas for annotating cobra models

Apache License 2.0

2 stars 1 forks source link

In cobra.io neither sbml.py nor sbml3.py seem to import or export notes. #4

Open Midnighter opened 7 years ago

Midnighter commented 7 years ago

From @ChristianLieven on July 6, 2017 16:6

Problem description

I am currently reconstructing a metabolic model, for which I am adding confidence scores, comments, and literature references in the notes attribute of reactions, metabolites and genes. The importance of confidence scores and related qualitative annotation parameters is discussed in the publications linked above.

I tried importing simple noted by adding the following notes field to the RECON1 model from BiGG. `

This is a TEST

I am wondering if COBRApy is able to import this.

    </notes>

` I was quite surprised that the RECON1 model did not contain the confidence scores upon which some of the results of this research are based on.

I was not able to find the keywords 'confidence', 'score' or 'confidence_score' in cobra.io.sbml nor cobra.io.sbml3. If I saw that right the legacy import looks specifically for charge, GPR, and subsystem in the notes field but doesn't account for the confidence score.

Code Sample

You can find my modified example SMBL3+FBC RECON1 file here. The modification is at R_EX_dopa_e.

Discussion

It seems like the community hasn't decided yet what exactly the notes field should contain and how it should be formatted. Personally, I'd find most useful if there was a clever way of allowing both, short human-readable comment entries, as well as optional, but specifically related machine-readable DOI-styled literature references. In the model object, I suppose this could be a nested dictionary looking something like this: some_model.reaction.SOME_RXN.notes = {"confidence_score":{"value":4, "reference":"some_doi"}}

Based on the referenced publications above, another useful key of the notes-field/attribute would be a simple 'comment' option, which would be limited in length (50 chars? 70 chars? 80 chars?).

some_model.reaction.some_metabolite.notes = {"comment":{"value":"Short string outlining a hypothesis or specific decision for this metabolite", "optional_reference":"some_doi"}}

I don't doubt that there could be a feasible, simple implementation on the python side of things, however I am unfamiliar with the options on the xml specifically SMBL side. A notes field according to the SMBL specifications is allowed to contain...

Almost any wellformed content permitted in XHTML subject to a few restrictions

...which seem pretty straight-forward, namely the notes field ...

must not contain an XML declaration or a DOCTYPE declaration.

Hence, I think a solution here could be to use <ul> from HTML?

What do you think?

Copied from original issue: opencobra/cobrapy#541

Midnighter commented 7 years ago

From @cdiener on July 6, 2017 18:54

That is a good point and one that pops up every once in a while for discussion. There is some ongoing discussion about the meaning of the SBML spec regarding the notes field. SBML only says:

It is intended to serve as a place for storing optional information intended to be seen by humans.

and comparing to annotation:

Whereas Notes is a container for content to be shown directly to humans, Annotation is a container for optional software-generated content not meant to be shown to humans.

The interpretation of the cobrapy maintainers in the past was that since notes should not be "consumed by a machine" it would not be written or read by cobrapy except for supporting the SBML 2 cobra annotations. The argument was that all annotation should go into the annotation tag as described in the spec. For the particular use case of DOIs annotation this is the recommended solution. There is a MIRIAM tag for DOIs so you can just use that. For instance the following is valid SBML and would be read into model.metabolites.h_c.annotation in cobrapy:

<annotation>
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" 
    xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:bqbiol="http://biomodels.net/biology-
    qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
    <rdf:Description rdf:about="#M_h_c">
      <bqbiol:is>
        <rdf:Bag>
          <rdf:li rdf:resource="http://identifiers.org/kegg.compound/C00080"/>
          <rdf:li rdf:resource="http://identifiers.org/doi/10.1038/nbt1156"/>
        </rdf:Bag>
      </bqbiol:is>
    </rdf:Description>
  </rdf:RDF>
</annotation>

However, that only works for direct annotations and not for adding data. For instance if I want to add some other quantity to the species or reaction (confidence scores or charge in various conditions, etc.), there is no way to do that with annotations. This is a shortcoming of SBML IMHO. So I would be in favour of reading and writing the notes field. Could be just raw text of could be a dictionary that is read and written to <ul> tags as you specified and is written into a <p> tag if it's just a string. But that would depend on how others interpret the SBML spec here.

Midnighter commented 7 years ago

From @ChristianLieven on July 11, 2017 14:24

534 Referencing this issue because @draeger, @Midnighter and @hredestig came up with this solution, which I consider quite optimal:

We are not aware of any existing schema or documentation of the annotation tags used in cobra. Our suggestion is to create a new repository under the opencobra organization. That way, any member of the opencobra community (most importantly of the Matlab COBRA Toolbox) can feel free to contribute to the schema, there can be versioned releases of the schema, and for the time being it can be hosted on https://opencobra.github.io/annotations/schema or whatever is decided for the name and URL.

We would then implement in cobrapy whatever is dictated by the schema and there's a chance for other tools in the opencobra community to do the same.

Midnighter commented 7 years ago

From @draeger on July 19, 2017 14:0

Well, there is, of course, another way of storing confidence scores for reactions in a standard-compliant form. You could use Parameter objects for this. These are objects in the listOfParameters directly within the model and have an id, optional name and value. In their id you could prefix the reaction id that confidence score is referring to. However, this would again not be the best solution of storing that sort of information because it is not obvious what these parameters are.

luciansmith commented 7 years ago

I fully support the idea of coming up with your own schema to store information in the 'annotation' child of SBML objects; I think this is a great idea. However, there are a couple things you've mentioned wanting that you could store in SBML packages:

Groups of objects (i.e. metabolites) could be stored using the 'groups' package (This was discussed in https://github.com/opencobra/schema/issues/3).
Confidence intervals could be stored using the 'distrib' package for distributions.

The 'groups' package is released and ready to use today. The 'distrib' package has not yet been finalized, so if there's anything you need that is not yet there, it would be relatively straightforward to add it (I've been in charge of shepherding that package to completion; email me and/or the package working group at sbml-distrib@lists.sourceforge.net if you have questions or requests.)

cdanielmachado commented 6 years ago

I see pros and cons of having the notes field and the annotations field, and the fact that one is supposed to be human-readable and the other machine-readable.

The thing is... what if you want to have something that is both human-readable and machine-readable? It is very nice and convenient just to have the best of both worlds.

I currently added support for having an extended set of metabolite and reaction attributes in framed and carveme.

When reading/writing an SBML file I parse attributes in the form of "key: value" pairs which are stored in the notes field. These are then stored inside the Metabolite and Reaction objects, using an attribute called "metadata" which is just a python dictionary.

This metadata includes things like formulas, ec numbers, manual curation notes, etc. I frequently use these attributes to implement different kinds of methods (e.g.: delta G values for thermodynamic FBA).

I think that constantly extending SBML with new attributes every time someone needs a new attribute is not very sustainable in the long term. You need to wait for a new release of the fbc package, which takes a lot of time, and in the meantime, people already came up with their own workarounds.

One possible solution (not ideal, I know) is to have these dictionaries of extended attributes, and the subset of people who want to use a particular attribute (like delta G value), or implement support for it in their simulation libraries, just come together and agree on a suitable identifier name.

draeger commented 6 years ago

@luciansmith: One comment about the confidence scores. These are not confidence intervals from a distribution. These are typically discrete numbers (often from 0 to 4) indicating the level of knowledge the model creator has that the component should be in the model. The numbers correspond to categories such as "read in a paper," "experimentally verified," "from a related organism," "computationally inferred," or similar. I, therefore, believe that the distrib package is not the right recommendation for storing this kind of information.

@cdanielmachado: I think it would also be good to create a specified new package for adding additional properties to model components. The SBML extension would only introduce an extension to SBase in the sense that you can add a value pair of an ontology term, some value (either a qualitative value or a quantitative one), and a third attribute for the data type of the value. For instance, an ontology term for Gibbs free energy would be one attribute and the value would be a stored as a String. The third attribute would indicate that the value is a floating-point number so that a software package could parse it out. The ontology could be continuously extended and improved, independent from the SBML extension package. In this way, we could systematically add many kinds of values. Best practices should be given in this package's specification to avoid that information is stored there that should better go to other (more specific) fields. For instance, EC-numbers should go to MIRIAM annotations.

matthiaskoenig commented 6 years ago

The confidence score is basically an evidence annotation. Personally I would just annotate this to an evidence ontology, which has a much more fine grained evidence handling (and especially the tree relationship between the different confidence/evidence http://www.evidenceontology.org/browse/

This is a much more universal and reusable solution than using an arbitrary evidence category of 0-4. You could easily map your 0-4 to the respective terms, but at the same time it would others to work with your confidence and use it for inferences.

Basically you have everything you need and

Term id:ECO:0005549 Term name:biological system reconstruction evidence based on homology evidence Definition:A type of biological system reconstruction where the evidence is inferred by homology based on conservation of sequence, function, and composition from an existing experimentally supported model to a process, pathway, or complex. [ECO:SN, PMID:15660128] Comment:Inference may be based on paralogy andor orthology of the genome-encoded components and is made primarily on functional conservation between the two systems. The sequences and number of genome-encoded components are fairly conserved but some divergence is observed. Evidence may originate from a combination of several experiments in the same or another species.

is much cleaner than writing "2 from related organism"

Matthias

On Sun, Nov 19, 2017 at 10:41 AM, Andreas Dräger notifications@github.com wrote:

@luciansmith https://github.com/luciansmith: One comment about the confidence scores. These are not confidence intervals from a distribution. These are typically discrete numbers (often from 0 to 4) indicating the level of knowledge the model creator has that the component should be in the model. The numbers correspond to categories such as "read in a paper," "experimentally verified," "from a related organism," "computationally inferred," or similar. I, therefore, believe that the distrib package is not the right recommendation for storing this kind of information.

@cdanielmachado https://github.com/cdanielmachado: I think it would also be good to create a specified new package for adding additional properties to model components. The SBML extension would only introduce an extension to SBase in the sense that you can add a value pair of an ontology term, some value (either a qualitative value or a quantitative one), and a third attribute for the data type of the value. For instance, an ontology term for Gibbs free energy would be one attribute and the value would be a stored as a String. The third attribute would indicate that the value is a floating-point number so that a software package could parse it out. The ontology could be continuously extended and improved, independent from the SBML extension package. In this way, we could systematically add many kinds of values. Best practices should be given in this package's specification to avoid that information is stored there that should better go to other (more specific) fields. For instance, EC-numbers should go to MIRIAM annotations.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opencobra/schema/issues/4#issuecomment-345504018, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29ugnmnnX6nylrPXWS_xa2MXpYSDX2ks5s3_evgaJpZM4OffGs .

-- Dr. Matthias König Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt-University Berlin, Institute of Biology, Institute for Theoretical Biology https://www.livermetabolism.com konigmatt@googlemail.com Tel: +49 30 20938450 Tel: +49 176 81168480

ChristianLieven commented 6 years ago

Personally I would just annotate this to an evidence ontology, which has a much more fine grained evidence handling (and especially the tree relationship between the different confidence/evidence http://www.evidenceontology.org/browse/

I can get behind using an evidence ontology instead of the rather arbitrary confidence scores that are floating around.

Just to get us back on track, however, my initial question was more aimed at finding the best way of connecting any annotation-information with both a human-readable note AND a machine-readable DOI. So through this schema, I'd like to consolidate a way that this can be done consistently for COBRA models. The whole reason for this is: Using memote, I want to be able to not only gather information on the number of annotations for any given model component but also provide information on the amount and quality of evidence backing up these annotations.

To take up Matthias suggestion for ECO again, I could imagine a possible metric to be the ratio of experimental evidence vs genomic context evidence for a given metabolic model. Or I could simply provide an overview of evidence types.

Edit: Ignore my comment above, I'm retracing all the things said back in July to get back into the discussion, and found that in #3 @draeger has already pointed out a suitable solution for this.

You can do something like this using MIRIAM annotations. This gives you a method to specify an online resource (such as a publication identifier) and state the relationship between the model component and that online resource. For instance, you can say IS_DESCRIBED_BY and then add the resource http://identifiers.org/pubmed/25562137 which is exactly the publication you cited above. For more information, please see http://identifiers.org or http://www.ebi.ac.uk/miriam/main/collections/MIR:00000113

ChristianLieven commented 6 years ago

Looks like the discussion at https://github.com/SBRG/ModelPolisher/issues/5 provided an excellent solution for this issue without necessarily needing to reinvent the wheel with a new schema.

bdelepine commented 6 years ago

Hi all,

From what I read above, in associated issues, and in SBML L3V2 documentation, I understand that we can annotate in <annotations> pretty much anything that refers to a concept or an external resource with the right combination of relation element (bqmodel:is, bqbiol:isDescribedBy, etc.) and ontologies (SBOterm, evidence ontology, etc.) defined in external namespaces (rdf, bqmodel, vcard4, etc.).

But I still can't find a way to encode data, such as Gibbs free energy in <reaction>. Other examples mentioned earlier in this tread can benefit from the use of a readily available ontology (confidence score) or already have their own dedicated SBML feature (curator name, date of modification etc. see History section 6.6).

In my opinion, COBRA should not parse anything within the <notes> to respect "human-only" SBML specification, but still import/export whatever is in <notes> in a blob to make it available to users. This would allow them to hack their way when they don't want to use a separate file to store data.

Note that @draeger proposed to create a SBML package that would be generic enough to solve this kind of problem.

draeger commented 6 years ago

@bdelepine, thanks for pointing out that notes aren't the right place to store machine-readable information. A few additional comments from my side:

Confidence Scores. This discussion has come to a good solution already in a separate thread: https://github.com/draeger-lab/ModelPolisher/issues/5
Additional Data, such as Gibbs energy values: @bgoli and @fbergmann suggested to extend the fbc package for SBML with an additional key-value-pair list where this could go in. Maybe they can provide a link to their proposal?

bgoli commented 6 years ago

@draeger here we go: http://pysces.sourceforge.net/KeyValueData/

Note that in the "practical example" some of the terms can be written as MIRIAM uri's or are now encoded in FBC, the keys are arbitrary.

I've been using this for a few years in my tools and it is simple to parse as an SBML annotation and extremely flexible. In general I've found the "type" attribute to be practically redundant. One extension I'm considering is to add a "url" attribute to the element that will act as a optional/supplementary controlled key.