identifiers-org / identifiers-org.github.io

MIT License
8 stars 1 forks source link

Inconsistency between how {$id} and the ID pattern are related in the case of CHEBI #127

Closed JervenBolleman closed 4 years ago

JervenBolleman commented 4 years ago

CHEBI has an id pattern CHEBI:\d+ but all the IRI patterns end with CHEBI:{$id}. For consistency that should not be the case because that would give IRI's of the form CHEBI:CHEBI:{$id} which don't work of course.

This makes the API results unusable :(

JervenBolleman commented 4 years ago

This is a general problem and affects records like the GO as well.

mbdebian commented 4 years ago

Thanks for pointing this out to us @JervenBolleman

I just wanted to let you know that we are looking into it.

Kind regards, Manuel

mbdebian commented 4 years ago

I've looked at the details of this issue.

As you mention, for both CHEBI and GO prefixes, the compact identifier should look like CHEBI:CHEBI:{$id} or GO:GO:{$id}, but these are special cases where the LUI, CHEBI:{$id} or GO:{$id} already looked like a compact identifier, and their corresponding communities didn't like how it would end up looking in identifiers.org registry, e.g. chebi:CHEBI:{$id}, so we decided to add a beautifying feature to cover this corner case, but not to be mainstream.

We call these cases "namespace embedded in LUI", and we allow for their LUIs to hit the resolution services straight away, instead of using their compact identifier version.

This information is made available to our users through our resolution API, e.g. for CHEBI:36927

$ curl -i https://resolver.api.identifiers.org/CHEBI:36927
HTTP/2 200 
content-type: application/json;charset=UTF-8
date: Thu, 15 Oct 2020 10:44:38 GMT
via: 1.1 google
alt-svc: clear

{
    "apiVersion": "1.0",
    "errorMessage": null,
    "payload": {
        "resolvedResources": [
            {
                "id": 3,
                "mirId": "MIR:00100009",
                "providerCode": "ebi",
                "compactIdentifierResolvedUrl": "https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:36927",
                "description": "ChEBI (Chemical Entities of Biological Interest)",
                "institution": {
                    "id": 2,
                    "name": "European Bioinformatics Institute",
                    "homeUrl": "https://www.ebi.ac.uk",
                    "description": "At EMBL-EBI, we make the world’s public biological data freely available to the scientific community via a range of services and tools, perform basic research and provide professional training in bioinformatics. \nWe are part of the European Molecular Biology Laboratory (EMBL), an international, innovative and interdisciplinary research organisation funded by 26 member states and two associate member states.",
                    "location": {
                        "countryCode": "GB",
                        "countryName": "United Kingdom"
                    }
                },
                "location": {
                    "countryCode": "GB",
                    "countryName": "United Kingdom"
                },
                "official": true,
                "resourceHomeUrl": "https://www.ebi.ac.uk/chebi/",
                "recommendation": {
                    "recommendationIndex": 100,
                    "recommendationExplanation": "Function based recommendation"
                },
                "namespacePrefix": "chebi",
                "deprecatedNamespace": false,
                "namespaceDeprecationDate": null,
                "deprecatedResource": false,
                "resourceDeprecationDate": null
            },
            {
                "id": 4,
                "mirId": "MIR:00100158",
                "providerCode": "ols",
                "compactIdentifierResolvedUrl": "https://www.ebi.ac.uk/ols/ontologies/chebi/terms?obo_id=CHEBI:36927",
                "description": "ChEBI through OLS",
                "institution": {
                    "id": 2,
                    "name": "European Bioinformatics Institute",
                    "homeUrl": "https://www.ebi.ac.uk",
                    "description": "At EMBL-EBI, we make the world’s public biological data freely available to the scientific community via a range of services and tools, perform basic research and provide professional training in bioinformatics. \nWe are part of the European Molecular Biology Laboratory (EMBL), an international, innovative and interdisciplinary research organisation funded by 26 member states and two associate member states.",
                    "location": {
                        "countryCode": "GB",
                        "countryName": "United Kingdom"
                    }
                },
                "location": {
                    "countryCode": "GB",
                    "countryName": "United Kingdom"
                },
                "official": false,
                "resourceHomeUrl": "https://www.ebi.ac.uk/ols/ontologies/chebi",
                "recommendation": {
                    "recommendationIndex": 40,
                    "recommendationExplanation": "Function based recommendation"
                },
                "namespacePrefix": "chebi",
                "deprecatedNamespace": false,
                "namespaceDeprecationDate": null,
                "deprecatedResource": false,
                "resourceDeprecationDate": null
            },
            {
                "id": 6,
                "mirId": "MIR:00100565",
                "providerCode": "bptl",
                "compactIdentifierResolvedUrl": "http://purl.bioontology.org/ontology/CHEBI/CHEBI:36927",
                "description": "ChEBI through BioPortal",
                "institution": {
                    "id": 5,
                    "name": "National Center for Biomedical Ontology, Stanford",
                    "homeUrl": "CURATOR_REVIEW",
                    "description": "CURATOR_REVIEW",
                    "location": {
                        "countryCode": "US",
                        "countryName": "United States"
                    }
                },
                "location": {
                    "countryCode": "US",
                    "countryName": "United States"
                },
                "official": false,
                "resourceHomeUrl": "http://bioportal.bioontology.org/ontologies/CHEBI",
                "recommendation": {
                    "recommendationIndex": 40,
                    "recommendationExplanation": "Function based recommendation"
                },
                "namespacePrefix": "chebi",
                "deprecatedNamespace": false,
                "namespaceDeprecationDate": null,
                "deprecatedResource": false,
                "resourceDeprecationDate": null
            }
        ],
        "parsedCompactIdentifier": {
            "providerCode": null,
            "namespace": "chebi",
            "localId": "CHEBI:36927",
            "rawRequest": "CHEBI:36927",
            "namespaceEmbeddedInLui": true,
            "deprecatedNamespace": false,
            "namespaceDeprecationDate": null
        }
    }
}

At the bottom of the response JSON payload, there is a section

{
    "parsedCompactIdentifier": {
            "providerCode": null,
            "namespace": "chebi",
            "localId": "CHEBI:36927",
            "rawRequest": "CHEBI:36927",
            "namespaceEmbeddedInLui": true,
            "deprecatedNamespace": false,
            "namespaceDeprecationDate": null
        }
}

that contains a flag called namespaceEmbeddedInLui, which reveals the fact that for this namespace, LUIs are the ones used in the resolution API, instead of compact identifiers.

Our registry API also offers this information, e.g. for CHEBI namespace, which has MIR ID MIR:00000002

$ curl -i https://registry.api.identifiers.org/restApi/namespaces/search/findByMirId?mirId=MIR:00000002

HTTP/2 200 
last-modified: Tue, 11 Jun 2019 14:15:26 GMT
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
cache-control: no-cache, no-store, max-age=0, must-revalidate
pragma: no-cache
expires: 0
strict-transport-security: max-age=31536000 ; includeSubDomains
x-frame-options: DENY
content-type: application/hal+json;charset=UTF-8
date: Thu, 15 Oct 2020 10:52:48 GMT
via: 1.1 google
alt-svc: clear

{
  "prefix" : "chebi",
  "mirId" : "MIR:00000002",
  "name" : "ChEBI",
  "pattern" : "^CHEBI:\\d+$",
  "description" : "Chemical Entities of Biological Interest (ChEBI) is a freely available dictionary of molecular entities focused on 'small' chemical compounds.",
  "created" : "2019-06-11T14:15:26.925+0000",
  "modified" : "2019-06-11T14:15:26.925+0000",
  "deprecated" : false,
  "deprecationDate" : null,
  "sampleId" : "36927",
  "namespaceEmbeddedInLui" : true,
  "_links" : {
    "self" : {
      "href" : "https://registry.api.identifiers.org/restApi/namespaces/1"
    },
    "namespace" : {
      "href" : "https://registry.api.identifiers.org/restApi/namespaces/1"
    },
    "contactPerson" : {
      "href" : "https://registry.api.identifiers.org/restApi/namespaces/1/contactPerson"
    }
  }
}

Although the resolver is case insensitive for prefixes, i.e. chebi:36927 would be recognised as CHEBI namespace, for these special namespaces we are not validating the compact identifier, but the LUI, thus, chebi:36927 would not pass the validation stage because chebi is not as specified in the ID pattern (where it's uppercase), and the request would result in error. In a similar way, chebi:CHEBI:36927 would be recognised as chebi namespace, but being this namespace a special case, the whole string, chebi:CHEBI:36927, is considered to be the LUI, and it is validated against the provided regular expression, thus, failing.

When it comes to building a compact identifier, given a prefix and a LUI, we use the information from the registry, specially prefix, pattern and namespaceEmbeddedInLui attributes:

If building a sample compact identifier for a given prefix, using the sample ID from the registry, you'll notice that in the registry we have a partial LUI that doesn't contain the prefix matching part. This is an internal representation trick as a result of adapting the existing logic to this special case, that arised later down the road of the resolution services, when the registry service was already running for a while.

In this case, what identifiers.org does to generate, in the Web UI, a sample compact identifier and sample URL for the namespace, is to take, from the regular expression that defines the LUI pattern, whatever is between '^' and ':' verbatim, because for LUIs it is case sensitive, e.g. VariO:0294.

These sample URLs and compact identifiers are not offered through the REST API, because this API is about navigating through the registry own data model.

On the other hand, we think it could probably be useful for the community to have a few new endpoints in identifiers.org for:

I hope this information helps clarifying what's behind the scenes of a subset of special namespaces in identifiers.org like CHEBI, VariO, GO and, in general, ontologies.

Please, don't hesitate to let us know if we could be of more help or if you'd have any additional questions.

Kind regards, Manuel