Proposal for enhancement to "id_prefixes"

hsolbrig commented 3 years ago

1) id_prefixes currently says "the identifier of this class or slot must begin with one of the URIs referenced by this prefix".

This sort of implies that a prefix can reference more than one URI. I'm hoping that we are dealing with a model where every prefix maps to exactly one URI (note, however, that the reverse may not necessarily be true... I need to check whether we guarantee uniqueness on URI's per prefix)

2) when it comes to actually validating data, I would think that the following:

classes:
    HighClass:
        id_prefixes:
            - NCIt
            - SCT

Would assert that a YAML or JSON representation of the id of an instance of HighClass would necessarily start with "NCIt:" or "SCT:", while an RDF instance would start with https://nci.....org/ncit/... or http://snomed.org/id/.

What I would propose, however, is that we extend the definition of id_prefixes to support the following:

classes:
   HighClass:
       id_prefixes:
          NCIt:
          SCT:

Which would be the same as the above. We would extend the definition slightly, to allow:

classes:
    HighClass:
       id_prefixes:
           NCIt: ^C\d{5,6}$
           SCT: ^\d{6,18}$

Which would assert that the local name of a Curie or URI must begin with "C" and have 5 or 6 digits if it began w/ NCIt or it must be a 6 to 18 digit number if it were SCT.

This would be a minimal change to the LinkML model itself, and, as of yet, the loaders do not do anything with ID prefixes so it would be no additions.

Questions: 1) Do we really need the "^...$" pattern or can we assume them? 2) Would we ever want two or more patterns and, if so, would we want something of the form

    id_prefixes:
        NCIt:
          - C\d{5}
          - M\d{7}

or would "(C\d{5}|M\d{7})" be ok?

3) It should be noted that SNOMED CT, in particular, includes a check digit and other formatting information that isn't expressible as a simple RE. Should we provide a hook for future use that names an algorithm or just let it slide.

My suggested answers are: 1) Assume them, 2) single RE is fine and 3) nah - not now

cmungall commented 3 years ago

I think this is a bug in the docs. It should just say "must begin with one of the URIs referenced by this prefix". Or we could frame it in terms of CURIEs.
- let's avoid NCIT as an example though, as there are in fact different URI interpretations of the CURIE (OBO and native)
I like the general idea but some drawbacks:
- people who are currently using id_prefixes will have to change their code
- violates DRY. Some prefixes are 1:1 with classes. Others (NCIT, SCTID, KEGG) cover many classes, e.g in biolink. To complicate, some (e.g. KEGG have a datatype-specific localId pattern), others don't

hsolbrig commented 3 years ago

people who are currently using id_prefixes will have to change their code

Are you talking python code or yaml? The yaml is backwards compatible, but you are correct -- the python will change from a list a dictionary...

hsolbrig commented 3 years ago

violates DRY. Some prefixes are 1:1 with classes. Others (NCIT, SCTID, KEGG) cover many classes, e.g in biolink. To complicate, some (e.g. KEGG have a datatype-specific localId pattern), others don't

Hmmm... this makes me suspect that we may not be viewing this exactly the same way. What does it mean to "cover many classes"?

cmungall commented 3 years ago

See biolink

  pathway:
    is_a: biological process
    mixins:
      - ontology class
    exact_mappings:
      - PW:0000001
      - WIKIDATA:Q4915012
    narrow_mappings:
      - SIO:010526
      - GO:0007165
    id_prefixes:
      - GO
      - REACT
      - KEGG
      - SMPDB
      - MSigDB
      - PHARMGKB.PATHWAYS
      - WIKIPATHWAYS
      - FB  # FlyBase FBgg*
      - PANTHER.PATHWAY
...

  anatomical entity:
    is_a: organismal entity
    mixins:
      - thing with taxon
      - physical essence
    description: >-
      A subcellular location, cell type or gross anatomical part
    exact_mappings:
      - UBERON:0001062
      - WIKIDATA:Q4936952
      # UMLS Semantic Group "Anatomy"
      - UMLSSG:ANAT
    narrow_mappings:
      # Body System
      - UMLSSC:T022
      - UMLSST:bdsy
      # Body Location or Region
      - UMLSSC:T029
      - UMLSST:blor
      # Body Space or Junction
      - UMLSSC:T030
      - UMLSST:bsoj
      # Body Substance
      - UMLSSC:T031
      - UMLSST:bdsu
    id_prefixes:
      - UBERON
      - GO
      - CL
      - UMLS
      - MESH
      - NCIT
    in_subset:
      - model_organism_database

GO IDs are somewhat heteregeneous. They can be used for 3 bl classes (process, activity, anatomy/cell component)

NCIT IDs are very heterogeneous

cmungall commented 3 years ago

can we close this and move any outstanding issues to https://github.com/linkml/linkml/issues/194

linkml / linkml-model

Proposal for enhancement to "id_prefixes" #28