Update storage of annotations

matthiaskoenig commented 6 years ago

Hi all, there are currently some issues with how annotations are stored in cobrapy.

[1] MIRIAM qualifiers are dropped RDF annotations are triples of the form (subject - predicate - object), with the object being the database which the subject (element in cobrapy) is annotated to. An crucial part of the annotation is the predicate which defines the relationship between the annotated element and the annotation.

http://co.mbine.org/standards/qualifiers

It makes a big difference (for all algorithms working with annotations) if something has a is relationship or hasPart, or isHomologTo.

So instead of storing things like

{
  'ImaginaryCompDB':'SpecificCompIdentifier',
  "uniprot": ["Q12345", "P12345"]
}

We should also store the predicates (otherwise a round trip will change the annotations).

{
  ''ImaginaryCompDB': ('is', 'SpecificCompIdentifier'),
  "uniprot": [('hasPart', "Q12345"), ('hasPart', "P12345")]
}

[2] Database ids should be identifiers.org collection identifiers https://www.ebi.ac.uk/miriam/main/collections http://identifiers.org/documentation Basically most of the annotations reference a database collections from identifiers.org. The lookup key should be the collection namespace. This is the case for most databases, but is not the case for SBO currently which should be sbo. This is a problem because not all users use the sboTerm for sbo annotations, but the alternative use of a RDF annotation to sbo collections is also valid. It should not matter if one wrote an SBO term or a SBO annotation, these should be treated the same.

Solution: change SBO to sbo key in annotation map.

This would affect code working with annotations.

Midnighter commented 6 years ago

Sounds very reasonable to me. We want to make a push for a cobrapy 1.0 release after this summer. There will be the GSoC projects, you want to work on a better SBML parser, and we will also contribute in a more focused manner. So then it is okay to introduce breaking changes.

cdiener commented 6 years ago

I also think that sounds reasonable. Might also think of a pandas DataFrame for annotations here with columns database, relation and id.

matthiaskoenig commented 5 years ago

There needs to be a decision on how annotations should look as a data structure.

Important points for the annotations are:

easily accessible, i.e. via Object.annotation field on first-class cobra objects, i.e., Model, Gene, Metabolite, Reaction and to come Compartment and Group
encoding of full annotation information, i.e., triples of the form (provider, qualifier, entity). Currently, the qualifiers are not supported yet, but are crucial for annotations and the serialization into RDF. I.e. the predicates must be fully supported.
backwards compatibility (current annotation should be read and easily converted to new format)
reading and writing annotation to SBML (and other formats)

My suggestion is:

cobra annotations will be dictionaries of the form:
    object.annotation = {
        'provider' : [(qualifier, entity), ...]
    }
A concrete example for a metabolite would look like the following
    metabolite.annotation = {
        'chebi': [(isVersionOf, "CHEBI:17234), (is, "CHEBI:4167),],
        'kegg.compound': [(is, "C00031")]
    }
The providers are hereby MIRIAM registry keys for collections
https://www.ebi.ac.uk/miriam/main/collections
The qualifiers are biomodel qualifiers
https://co.mbine.org/standards/qualifiers

The allowed qualifier types are hereby (ignore the libsbml part, this is just for serialization)

QUALIFIER_TYPES = {
     "is": libsbml.BQB_IS,
     "hasPart": libsbml.BQB_HAS_PART,
     "isPartOf": libsbml.BQB_IS_PART_OF,
     "isVersionOf": libsbml.BQB_IS_VERSION_OF,
     "hasVersion": libsbml.BQB_HAS_VERSION,
     "isHomologTo": libsbml.BQB_IS_HOMOLOG_TO,
     "isDescribedBy": libsbml.BQB_IS_DESCRIBED_BY,
     "isEncodedBy": libsbml.BQB_IS_ENCODED_BY,
     "encodes": libsbml.BQB_ENCODES,
     "occursIn": libsbml.BQB_OCCURS_IN,
     "hasProperty": libsbml.BQB_HAS_PROPERTY,
     "isPropertyOf": libsbml.BQB_IS_PROPERTY_OF,
     "hasTaxon": libsbml.BQB_HAS_TAXON,
     "unknown": libsbml.BQB_UNKNOWN,
     "bqm_is": libsbml.BQM_IS,
     "bqm_isDescribedBy": libsbml.BQM_IS_DESCRIBED_BY,
     "bqm_isDerivedFrom": libsbml.BQM_IS_DERIVED_FROM,
     "bqm_isInstanceOf": libsbml.BQM_IS_INSTANCE_OF,
     "bqm_hasInstance": libsbml.BQM_HAS_INSTANCE,
     "bqm_unknown": libsbml.BQM_UNKNOWN,
}

All existing annotations without qualifier/predication are assumed to have a is relationship, because only this subset was read from SBML so far (and BiGG and others only write is annotations (even if these are not completely correct).

This is very easy to implement on the SBML read and write part, but in addition probably other parsers have to be updated (json, mat, ?) and especially the tests have to be updated (because things like writing old annotations to SBML and reading them will fail tests due to the changed format).

matthiaskoenig commented 5 years ago

Hi all, @Midnighter @cdiener @ChristianLieven We need a decision on this as soon as possible. This is blocking all remaining SBML issues in cobrapy and blocks important tests in memote (many tests are based on annotations, without good and correct representation such tests can not be performed correctly). There has to be an agreement/decision to move forward. Best M

Hemant27031999 commented 4 years ago

Hi all, @matthiaskoenig @draeger @Midnighter @cdiener @ChristianLieven and other members of cobrapy. As part of one of the possible GSoC 2020 project "Implementation of SBML-JSON converter and SBML-JSON scheme" upon which I am working, one of the important tasks is to add the complete support of meta-information (annotation and notes) to SBML and then further to JSON representation of SBML models in cobrapy. This thread seems best to discuss the issues which exist with the current representation of annotation and also the other issues which are difficult to handle by the above-proposed format for annotation by @matthiaskoenig. Please have a look at it as this issue is existing for a long time and it is blocking other SBML issues in cobrapy and some tests also (marked as xfail).

First of all, comes the biological relational qualifiers which define the kind of relationship between annotated component and annotation resources. This has already been discussed in the above thread and the proposed format for handling it is as follows:
```
object.annotation = {
    'provider' : [(qualifier, entity), ...]
}
```
Now, though it handles the biological qualifier, it makes the storing of the following annotation format very complex.

Consider the below example, given in the SBML L3V2 specification doc (page 110), where the specie presents two alternative annotations of the human hemoglobin, the first with ChEBI heme and the second with KEGG heme

<species id="heme" metaid="heme" compartment="C"
hasOnlySubstanceUnits="false" boundaryCondition="false"
constant="false">
<annotation>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:bqbiol="http://biomodels.net/biology-qualifiers/"
>
  <rdf:Description rdf:about="#heme">
    <bqbiol:hasPart>
      <rdf:Bag>
        <rdf:li rdf:resource="https://identifiers.org/uniprot/P69905"/>
        <rdf:li rdf:resource="https://identifiers.org/uniprot/P68871"/>
        <rdf:li rdf:resource="https://identifiers.org/chebi/CHEBI:17627"/>
      </rdf:Bag>
    </bqbiol:hasPart>
    <bqbiol:hasPart>
      <rdf:Bag>
        <rdf:li rdf:resource="https://identifiers.org/uniprot/P69905"/>
        <rdf:li rdf:resource="https://identifiers.org/uniprot/P68871"/>
        <rdf:li rdf:resource="https://identifiers.org/kegg.compound/C00032"/>
      </rdf:Bag>
    </bqbiol:hasPart>
  </rdf:Description>
</rdf:RDF>
</annotation>
</species>

Now if we parse it in the above format, it will look as follows:

specie.annotation = {
    'uniport' : [(hasPart, P69905), (hasPart, P68871), (hasPart, P69905), (hasPart, P68871)],
    'chebi' : [(hasPart, 17627)],
    'kegg.compound' : [(hasPart, C00032)]
}

The "uniport" provider contains two same pairs of resources, and now there is no way to divide these alternative sets of annotations to get back the original format. All the resources are listed corresponding to the "provider" name and do not seem to have any relationship with each other, and hence hard to distinguish.

Moving to the next issue i.e., when we have nested annotations for storing the evidence score corresponding to the resource provided. Consider the following example, again from doc (pg 111):

<species id="S1" metaid="_000004" compartment="lysosome"
hasOnlySubstanceUnits="false" boundaryCondition="false"
constant="false">
<annotation>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:bqbiol="http://biomodels.net/biology-qualifiers/"
xmlns:bqmodel="http://biomodels.net/model-qualifiers/">
  <rdf:Description rdf:about="#_000004">
    <bqbiol:occursIn>
      <rdf:Bag>
        <rdf:li rdf:resource="https://identifiers.org/go/GO:0005764"/>
          <bqbiol:isDescribedBy>
            <rdf:Bag>
              <rdf:li rdf:resource="https://identifiers.org/pubmed/1111111"/>
            </rdf:Bag>
          </bqbiol:isDescribedBy>
          <bqbiol:isDescribedBy>
            <rdf:Bag>
              <rdf:li rdf:resource="https://identifiers.org/eco/ECO:0000004"/>
            </rdf:Bag>
          </bqbiol:isDescribedBy>
      </rdf:Bag>
    </bqbiol:occursIn>
  </rdf:Description>
</rdf:RDF>
</annotation>
</species>

So here also, the parsing is not so easy to above format because the nested annotation have to be accommodated.

One more issue that exists is not storing the prefix part of the resource URI. The current implementation only handles those annotations resources which have "http://identifiers.org/" as the prefix name. Now though, "http://identifiers.org/" is a majorly used and most recommended resolver, but, if one provides a direct URL of the resource or a valid MIRIAM URI, support for handling them should also be enabled. For eg. "https://www.kegg.jp/entry/C00001" is a valid MIRIAM URI (also a URL) having the structure of the form "resource: identifier" but it will not get parsed to the cobra model because cobrapy parse only those resources which have "http://identifiers.org/" as the prefix name. It is hence important to handle this case also, otherwise, the information will be lost. One simple way to handle this is by storing the "resolver+provider" part as one single string instead of only storing the "provider". It won't require many changes and can be made even backward compatible by just adding prefix "http://identifiers.org/" to all those providers which do not have one.

So, according to me, simple dictionaries for handling the annotation with providers as key and array of pairs of qualifier and identifier as value are insufficient for handling these complex annotation structures. We need proper dedicated classes to handle the annotation and special methods to verify its internal content. Talking about the JSON representation, I would go with the thought of releasing a version 2 for JSON representation of SBML models. It will have support for handling the model history and the notes part also (along with the incorporation of additional SBML packages features like groups and fbc-v3) and a new representation of the annotation part. I really can't figure out why "providers" are used as keys instead of biological qualifiers (which is a better option because than the tree structure would be the same as that of SBML). With qualifiers as keys, it will be very easy to handle all the above issues listed and also to incorporate any further addons to annotations in the future. Its structure would be somewhat like follows:

component.annotation = {
  "qualifier" : [
    {
      "resources" : [
          ("resolver + provider", "identifier"), 
           ... 
        ],
        [NESTED_CONTENT]
    },
    ...
  ],
  ...
}

The string ‘ ... ’ is a placeholder for zero or more elements of the same form as the immediately preceding element and [NESTED_CONTENT] represents the nested annotation and has the same format and restrictions as the "qualifier". So, suppose we have the following SBML annotation that is to be represented in JSON:

<species id="heme" metaid="heme" compartment="C"
hasOnlySubstanceUnits="false" boundaryCondition="false"
constant="false">
  <annotation>
    <rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bqbiol="http://biomodels.net/biology-qualifiers/"
    >
      <rdf:Description rdf:about="#heme">
        <bqbiol:hasPart>
          <rdf:Bag>
            <rdf:li rdf:resource="https://identifiers.org/uniprot/P69905"/>
            <rdf:li rdf:resource="https://www.uniprot.org/uniprot/P68871"/>
            <rdf:li rdf:resource="https://identifiers.org/chebi/CHEBI:17627"/>
              <bqbiol:isDescribedBy>
                <rdf:Bag>
                  <rdf:li rdf:resource="https://identifiers.org/pubmed/1111111"/>
                </rdf:Bag>
              </bqbiol:isDescribedBy>
              <bqbiol:isDescribedBy>
                <rdf:Bag>
                  <rdf:li rdf:resource="https://identifiers.org/eco/ECO:0000004"/>
                </rdf:Bag>
              </bqbiol:isDescribedBy>
          </rdf:Bag>
        </bqbiol:hasPart>
        <bqbiol:hasPart>
          <rdf:Bag>
            <rdf:li rdf:resource="https://identifiers.org/uniprot/P69905"/>
            <rdf:li rdf:resource="https://identifiers.org/uniprot/P68871"/>
            <rdf:li rdf:resource="https://identifiers.org/kegg.compound/C00032"/>
          </rdf:Bag>
        </bqbiol:hasPart>
        <bqbiol:hasVersion>
          <rdf:Bag>
            <rdf:li rdf:resource="https://identifiers.org/uniprot/Q9UQM7"/>
            <rdf:li rdf:resource="https://identifiers.org/uniprot/Q13554"/>
          </rdf:Bag>
        </bqbiol:hasVersion>
      </rdf:Description>
    </rdf:RDF>
  </annotation>
</species>

Its JSON representation would be as follows:

annotation = {
  "hasPart" : [
    {
      "resources" : [
          ["https://identifiers.org/uniprot", "P69905"],
          ["https://www.uniprot.org/uniprot", "P68871"],
          ["https://identifiers.org/chebi", "CHEBI:17627"],
        ],
        "isDescribedBy" : [
          {
            "resouces" : [
              ["https://identifiers.org/pubmed", "1111111"]
            ]
          },
          {
            "resouces" : [
              ["https://identifiers.org/eco", "000000"]
            ]
          }
        ]
    },
    {
      "resources" : [
          ["https://identifiers.org/uniprot", "P69905"],
          ["https://identifiers.org/uniprot", "P68871"],
          ["https://identifiers.org/kegg.compound", "C00032"],
        ]
    },
  ],
  "hasVersion" : [
    {
      "resources" : [
          ["https://identifiers.org/uniprot", "Q9UQM7"],
          ["https://identifiers.org/uniprot", "Q13554"],
      ]
}

Now though this structure will be completely different from the existing format, it will be the most general format handling all the existing issues and can also incorporate any future add-ons. To add support for handling the already existing JSON-SBML models, what best we can do is to enable the reading of both formats from JSON but write only in the above format. Default values for qualifiers (i.e "is") and resolver (i.e "http://identifiers.org/") will be used to parse the current JSON format (version1 ) to the above format (version 2). Members, please give your valuable suggestions and your views regarding the above subject. I tried to find a way to handle the above issues in the currently existing format, but could not find one, so I came up with this new more general format.

draeger commented 4 years ago

Thanks for working this out, @Hemant27031999!

Question: Would splitting the identifier from the rest of the URI give much of an advantage? Otherwise, one could directly add the URIs in a list. The data structure could then be slightly simpler and the API could provide direct access to the ID part by splitting the URI when needed.

Hemant27031999 commented 4 years ago

I was also thinking about it. Separating the identifier from the main URI is just bringing extra overhead and I really can't figure out what is the purpose of doing so? I Did it because, in the current version, they are separated first and then put into the JSON. I thought that maybe, the members wanted to be more verbose about the provider and identifier. If there is no special purpose of separating them, I would go with the idea of putting the complete URI as a single string instead of separating them and then putting them in an array. It will be a lot simpler and straight forward in this way.

Midnighter commented 4 years ago

First of all, thank you for a well worked-out proposal. This looks very promising!

Separating the identifier from the main URI is just bringing extra overhead and I really can't figure out what is the purpose of doing so?

To me, this has always been about mapping data. A very typical format for metabolite concentrations, gene expression, or protein abundances is a simple table where one column is an identifier in a particular namespace and a second column contains values with a particular unit. I want it to be as easy as possible to decide which components of the model are affected by a given data set.

@Hemant27031999 can you provide a link to your actual proposal as well, please? I do have some thoughts on a software design for handling input/output. I would like to see what you have planned and discuss it with you and @matthiaskoenig.

Hemant27031999 commented 4 years ago

Sure @Midnighter, here is the link to the google doc: https://docs.google.com/document/d/1f0PknoPUpcWRHS8jljEOX5439us4ZD9qjNlbbfGk8Y8/edit?usp=sharing

cdiener commented 4 years ago

I think the separation of provider and identifier is historically just there to push it into a key-value format. The key-value format was there since SBML is just one of the supported file formats and JSON and YAML do not impose as many restrictions on annotations as SBML. Many other libraries (BIGG, ME models, MICOM soon) use the JSON format because it is faster but also because it allows to add custom annotations to the cobra objects and this is usually done in a key-value store way. So I would keep that in mind and try to find a schema that is compatible with all file formats and not only SBML. The nested format definitely looks complicated and I am not sure what to do about that. The scheme suggested by @Hemant27031999 would probably work since it would basically be a dictionary but I would not limit it to the bqbiol tags.

draeger commented 4 years ago

@cdiener: Just a very small comment on the side; custom annotations have been part of SBML since its inception. Here is an example from the SBML Level 1 Version 1 specification from March 2nd 2001 showing how to implement arbitrarily complex key-value pairs and more:

...
<annotations xmlns:mysim="http://www.mysim.org/ns">
 <mysim:nodecolors mysim:bgcolor="green" mysim:fgcolor="white"/>
 <mysim:timestamp>2000-12-18 18:31 PST</mysim:timestamp>
</annotations>
...

(Note: SBML Level 1 is, of course, outdated and should not be used anymore, but this feature, custom annotations, has been part of every SBML release.)

Now, with the next version of the FBC package, key-value pairs will become even more straightforward to use.

But @Hemant27031999's project aims at developing a lossless representation of SBML in COBRApy-compliant JSON so that in the end, it won't matter anymore if the serialization goes to XML or JSON as long as all required data structures are adequately represented.

cdiener commented 4 years ago

@draeger Oh, that is very cool. Could you point me to a doc link how to do that in SBML 3?

Hemant27031999 commented 4 years ago

@cdiener, regarding the key-value storing way, we are going to have that also in the JSON format as this is one of the newly introduced feature in fbc-v3 package, the support of which is to be added in the JSON as part of milestone 1.1 discussed in the proposal. What I have discussed in this thread is only that part of annotation which refers to the external resources (basically the CVTerm class of libSBML).

The scheme suggested by @Hemant27031999 would probably work since it would basically be a dictionary but I would not limit it to the bqbiol tags.

I took the reference of SBML L3V2 specification doc for this part which only says that the nested content has the same format and restrictions as RELATION_ELEMENT (page 101).

draeger commented 4 years ago

Sure. In SBML Level 3 the element is called <annotation> (singular), but the principle is still identical. You need to define a namespace for your tool, here mysim is the tool and then some arbitrary URL-like identifier, here http://www.mysim.org/ns. And then you can write any key-value pair or other attributes into the annotation that you need.

Here is an example from SBML Level 3 Version 2 Release 2, page 17, lines 9-12 that corresponds to the one above:

<annotation>
  <mysim:molecule xmlns:mysim="http://www.mysim.org/ns" mysim:weight="18.02" mysim:atomCount="3"/>
</annotation>

cdiener commented 4 years ago

Thank you! Will play around with that for MICOM a bit...

draeger commented 4 years ago

@cdiener and @Hemant27031999: It is always the preferred way to use a data structure that is defined in the specification directly. The example for using custom annotations is only recommended to use in two fundamental cases:

You intend to express information that really can't go anywhere else at this time.
You want to try out how a future SBML extension package could store certain pieces of information.

In the any case, one should write some sort of a documentation what exactly goes there. Using the new key-value pair data structure from the upcoming FBC Version 3 package is going to be the preferred way of storing that kind of information.

danolson1 commented 3 years ago

Is there any description of best practices regarding annotations in the cobrapy documentation? I looked and couldn't find any.

ChristianLieven commented 3 years ago

@danolson1 not in the cobrapy documentation as far as I'm aware (it's been a while that I've read it comprehensively) but Memote has an entire module dedicated to best-practise annotations.

We don't mention it in-depth in the publication but the source code gives you some indication: https://github.com/opencobra/memote/blob/develop/src/memote/support/annotation.py

To summarize:

Annotations are dictionaries with
MIRIAM-styled database prefix as keys and
LUI-conform patterns for each identifier as the values.

MIRIAM => http://www.ebi.ac.uk/miriam/ now called https://identifiers.org/

Hope that helps!

cdiener commented 3 years ago

Yep exactly like @ChristianLieven said. MEMOTE is awesome for checking if your model is well annotated and will tell you what should be added as well.

dtmvandenberg commented 2 years ago

I have a related problem with the SBML file output in cobrapy 0.25.0 which ensures that MEMOTE does not recognize the annotations which are stored in the SBML file as such:

species id="M_atp_c" name="ATP" compartment="C_c" hasOnlySubstanceUnits="false" boundaryCondition="false" constant="false" fbc:charge="-4" fbc:chemicalFormula="C10H12N5O13P3">

    <notes>
      <html xmlns="http://www.w3.org/1999/xhtml">
        <p>BioCyc: META:ATP</p>
        <p>BioPath Molecule: Adenosine-5-prime-triphosphate</p>
        <p>FORMULA: C10H12N5O13P3</p>
        <p>Human Metabolome Database: HMDB00538</p>
        <p>KEGG Compound: C00002;D08646</p>
        <p>MetaNetX (MNX) Chemical: MNXM3</p>
        <p>Reactome: 211579;389573</p>
        <p>SEED Compound: cpd00002</p>
        <p>UniPathway Compound: UPC00002</p>
      </html>
    </notes>
  </species><

Now when loading this model all the annotations are in the notes of the metabolites, so I ran the following code: ` for m in model.metabolites: m.annotation.update(m.notes) m.notes = {}

cobra.io.write_sbml_model(model, 'test_annotation.xml')`

And ran a snapshot with memote and it still does not recognize it and I found out why:

< There is a / instead of a : after each database entry and the file (which was generated with CarveMe) has some misnomers for the database identifiers. I can fix the database identifiers but not the way that cobrapy converts these key:value pairs to key/value in the SBML file.

dtmvandenberg commented 2 years ago

I figured out why it converts the key:value pairs to key/value. My values are strings and not lists. Why is this a thing? And if it is necessary, why is this not in the documentation? And while the documentation is updated (either in cobrapy or memote) maybe also include a link to https://registry.identifiers.org/registry so people use the proper MIRIAM code for their database of interest?

cdiener commented 2 years ago

@dtmvandenberg I think this would have been better in a discussion since it isn't really related to the issue which discusses the new annotation interface...

Both forms are valid on identifiers.org, .../provider:id or .../provider/id, the latter allowing IDs with colons in them. See for instance https://identifiers.org/BioCyc/META:ATP which resolves just fine. It's not the in the docs because the SBML writer takes care of it for you. As long as your provider-id pair is MIRIAM-compliant it will create a working URL.

The reason MIRIAM was not mentioned explicitly is that we wanted to support all kinds of annotations, not just MIRIAM-compliant ones, but honestly that did not work very well.

The new interface will change annotations quite a bit and will clearly separate standardized from custom annotations. I agree that the docs for the new annotations should explain what MIRIAM-compliant annotations are and link to the registry.

SBML is quite adamant that annotations should never be stored in the notes. Which software/tool created the notes like this? I thought CARVEME propagates annotations from BIGG which stores them in the right format...

dtmvandenberg commented 2 years ago

@cdiener Thanks for the quick and clear reply. My apologies for posting it in the wrong place, I thought the suggestions in my comments were most suited to this topic. I did use Carveme 1.50 to generate the original model and did not change anything about the annotations yet. It was also only modified in cobrapy. I checked the original model and see the same problem.

cdiener commented 2 years ago

No problem, this was just so it does not get ignored. That is interesting with the notes in CARVEME. Maybe that is for compatibility with reframed...

opencobra / cobrapy

Update storage of annotations #684