On linking fundamental information like taxonomy, genome sequence and phenotypical data in SBML

Problem description

With the memote API it would be quite simple to integrate memote into a pipeline that automatically addresses erroneous test results. There is a plethora of tools available which may help in fixing certain issues, GlobalFit and other functions in the COBRA toolbox come to mind, but with the diversity of metabolic models that might be run through memote, it is difficult to make that choice for our users. As an example, gap-filling eukaryotic metabolic models ought to draw from different resources than those for prokaryotic. As for memote transitioning from a passive suite to an active polisher, it could certainly be possible if fundamental information like taxonomy, genome sequence and phenotypical data were more readily available/ linked within the SBML format.

An excerpt from our response to the reviewers

I'm mentally often returning to this point. Should we kick-start or investigate the state of the discussion in the SBML community?

@Midnighter

Yeah, but I do I remember having short discussions with Matthias König and vaguely remember him commenting on ways of achieving this already using existing SBML plugins. Perhaps it is worth reviewing these before we kick off a call for updates? +brettolivier@gmail.com and Frank Bergman would know where to start best I believe.

@ChristianLieven

I'll try help out a bit, Frank can jump in if I forget something. SBML has two generic annotation mechanisms. The first is using RDF encoded in XML where any element (reaction, species, gene) can be annotated. This MIRIAM compliant mechanism uses controlled qualifiers (http://co.mbine.org/freelinking/standards/qualifiers) to a registry of resources (http://co.mbine.org/standards/miriam_uris). For older, but still relevant information see here: http://sbml.org/Community/Wiki/About_annotations_in_Level_2 So for example an organism (the model) Acidipila rosea' taxonomy could easily be annotated with the statement model --> hasTaxon --> Acidipila rosea which in SBML is encoded as:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
xmlns:bqbiol="http://biomodels.net/biology-qualifiers/">
<rdf:Description rdf:about="#MyModelElement">
<bqbiol:hasTaxon>
<rdf:Bag>
<rdf:li rdf:resource="http://identifiers.org/taxonomy/768535" />
</rdf:Bag>
</bqbiol:hasTaxon>
</rdf:Description>
</rdf:RDF>

Similarly any element gene/reaction/etc can be linked to unique identifiers independent of what their "sbml:id" or "sbml:name" is In addition to the above SBML has an XML "annotation" mechanism which allows the extension of the standard with tool (or whatever) specific XML based annotation. This can also be used to encode "non-standard" information in a semi-controlled way. HTH

@bgoli

Thank you very much. +clie@biosustain.dtu.dk to me this looks like another discussion issue, what do you think? Probably something we should tackle for a next release and not for the revision.

@Midnighter

Yes, this is potentially quite a complex implementation issue. A recent paper looked into this stuff: https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bby087/5164345

@bgoli

Very interesting. There's also this pre-print which could be interesting, although manual annotation upon model reconstruction makes so much more sense. https://www.biorxiv.org/content/10.1101/532473v1

@Midnighter

Indeed, but I would suggest that this is a version 2 or version 3 exercise as the whole community will have to decide whether you can "trust" an automated method (vs manual annotation) even if encoding is currently possible.

@bgoli

I've copied this conversation from our response google doc during the review phase of memote. I'm curious to hear what the community here could add to this. @draeger and @matthiaskoenig, your seasoned input is much appreciated.

Should those be three separate discussions?

I talked to @matthiaskoenig and annotating a model with taxonomy information can already be done although not mandatory, nor widely practiced, nor well documented how it should be done. This is an issue we can push by introducing a scored model annotation in memote.
I've also talked to @zakandrewking about links to genome sequences. He mentioned that @draeger has a working first version that is used by the E. coli ME model.
For phenotypical data, I suppose we could look into enforcing something like a COMBINE archive or similar FAIR practices.

The taxon can be readily encoded via a model annotation using the hasTaxon biological model qualifier. Also tissues and cell types can be easily encoded using is in combination with tissue ontologies like BTO or OMIT. See below an example on how I handle such information (in combination with provenance) via an SBML annotation on the model element.

    <annotation>
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:vCard="http://www.w3.org/2001/vcard-rdf/3.0#" xmlns:vCard4="http://www.w3.org/2006/vcard/ns#" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/" xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
        <rdf:Description rdf:about="#meta_caffeine_pkpd_v13">
          <dcterms:creator>
            <rdf:Bag>
              <rdf:li rdf:parseType="Resource">
                <vCard:N rdf:parseType="Resource">
                  <vCard:Family>Koenig</vCard:Family>
                  <vCard:Given>Matthias</vCard:Given>
                </vCard:N>
                <vCard:EMAIL>koenigmx@hu-berlin.de</vCard:EMAIL>
                <vCard:ORG rdf:parseType="Resource">
                  <vCard:Orgname>Humboldt-University Berlin, Institute for Theoretical Biology</vCard:Orgname>
                </vCard:ORG>
              </rdf:li>
            </rdf:Bag>
          </dcterms:creator>
          <dcterms:created rdf:parseType="Resource">
            <dcterms:W3CDTF>2018-04-19T16:05:32Z</dcterms:W3CDTF>
          </dcterms:created>
          <dcterms:modified rdf:parseType="Resource">
            <dcterms:W3CDTF>2018-04-19T16:05:32Z</dcterms:W3CDTF>
          </dcterms:modified>
          <bqbiol:hasTaxon>
            <rdf:Bag>
              <rdf:li rdf:resource="http://identifiers.org/taxonomy/9606"/>
            </rdf:Bag>
          </bqbiol:hasTaxon>
          <bqbiol:is>
            <rdf:Bag>
              <rdf:li rdf:resource="http://identifiers.org/bto/BTO:0001489"/>
              <rdf:li rdf:resource="http://identifiers.org/omit/0003300"/>
            </rdf:Bag>
          </bqbiol:is>
        </rdf:Description>
      </rdf:RDF>
    </annotation>

Also sequence information can be referenced via an annotation (as long as it is located in an external database. There is a biomodel qualifier for that (http://co.mbine.org/standards/qualifiers). isEncodedBy, encoder

The biological entity represented by the model element is encoded, directly or transitively, by the subject of the referenced resource (biological entity B). This relation may be used to express, for example, that a protein is encoded by a specific DNA sequence.

One could combine this with the information located in external files and use the rdf:resource to reference the information (the rdf:resource is not limited to identifiers.org links but can also reference infromation in files located with the SBML file). All files should be combined in a combine archive for simple exchange:

         <bqbiol:isEncodedBy>
            <rdf:Bag>
              <rdf:li rdf:resource="./gene_sequence.xml/sequence1234>
            </rdf:Bag>
          </bqbiol:isEncodedBy>

If you want to encode the sequence directly in the SBML file a good solution would be an annotation. In the SBML-fbc-v3 draft we made a proposal for a general purpose KeyValuePair annotation (working like an advanced python dictionary), which could work for such recurring information like gene sequences or protein sequences. This would allow for easy parsable key:value data in annotations. Same mechanism could apply for phenotypical data.

As @matthiaskoenig points out, taxonomy is no problem. All models in the BiGG database should make use of that feature.

A more natural way of encoding sequence information is using the specialized format SBoL (Synthetic Biology Open Language). Like @matthiaskoenig mentioned, it is possible to address external files. When we worked with @zakandrewking and the team from UC San Diego to propose how ME models can be encoded in SBML (which require the explicit inclusion of sequence information), we used a similar approach. The main idea was to not only ship an SBML file but a COMBINE archive comprising SBML, SBoL, and further files as needed (for instance, the model in JSON or MAT format, etc., SBGN-ML files, or SED-ML files for execution).

Please have a look at the SBMLme project. There you can also find an Example COMBINE Archive For the first version, we suggested a customized annotation, for example

<annotation>
  <sbmlme:meSpeciesPlugin sequence="http://cobramens.url/sbol/RNA_b0001" genomePosition="2042572" />
</annotation>

so that we could also store the position within the sequence where the relevant information started. This was a requirement of the ME model (encoded as JSON file).

opencobra / memote

On linking fundamental information like taxonomy, genome sequence and phenotypical data in SBML #621

Problem description