Closed hsolbrig closed 3 years ago
people who are currently using id_prefixes will have to change their code
Are you talking python code or yaml? The yaml is backwards compatible, but you are correct -- the python will change from a list a dictionary...
violates DRY. Some prefixes are 1:1 with classes. Others (NCIT, SCTID, KEGG) cover many classes, e.g in biolink. To complicate, some (e.g. KEGG have a datatype-specific localId pattern), others don't
Hmmm... this makes me suspect that we may not be viewing this exactly the same way. What does it mean to "cover many classes"?
See biolink
pathway:
is_a: biological process
mixins:
- ontology class
exact_mappings:
- PW:0000001
- WIKIDATA:Q4915012
narrow_mappings:
- SIO:010526
- GO:0007165
id_prefixes:
- GO
- REACT
- KEGG
- SMPDB
- MSigDB
- PHARMGKB.PATHWAYS
- WIKIPATHWAYS
- FB # FlyBase FBgg*
- PANTHER.PATHWAY
...
anatomical entity:
is_a: organismal entity
mixins:
- thing with taxon
- physical essence
description: >-
A subcellular location, cell type or gross anatomical part
exact_mappings:
- UBERON:0001062
- WIKIDATA:Q4936952
# UMLS Semantic Group "Anatomy"
- UMLSSG:ANAT
narrow_mappings:
# Body System
- UMLSSC:T022
- UMLSST:bdsy
# Body Location or Region
- UMLSSC:T029
- UMLSST:blor
# Body Space or Junction
- UMLSSC:T030
- UMLSST:bsoj
# Body Substance
- UMLSSC:T031
- UMLSST:bdsu
id_prefixes:
- UBERON
- GO
- CL
- UMLS
- MESH
- NCIT
in_subset:
- model_organism_database
GO IDs are somewhat heteregeneous. They can be used for 3 bl classes (process, activity, anatomy/cell component)
NCIT IDs are very heterogeneous
can we close this and move any outstanding issues to https://github.com/linkml/linkml/issues/194
1) id_prefixes currently says "the identifier of this class or slot must begin with one of the URIs referenced by this prefix".
This sort of implies that a prefix can reference more than one URI. I'm hoping that we are dealing with a model where every prefix maps to exactly one URI (note, however, that the reverse may not necessarily be true... I need to check whether we guarantee uniqueness on URI's per prefix)
2) when it comes to actually validating data, I would think that the following:
Would assert that a YAML or JSON representation of the id of an instance of HighClass would necessarily start with "NCIt:" or "SCT:", while an RDF instance would start with
https://nci.....org/ncit/...
orhttp://snomed.org/id/
.What I would propose, however, is that we extend the definition of id_prefixes to support the following:
Which would be the same as the above. We would extend the definition slightly, to allow:
Which would assert that the local name of a Curie or URI must begin with "C" and have 5 or 6 digits if it began w/ NCIt or it must be a 6 to 18 digit number if it were SCT.
This would be a minimal change to the LinkML model itself, and, as of yet, the loaders do not do anything with ID prefixes so it would be no additions.
Questions: 1) Do we really need the "^...$" pattern or can we assume them? 2) Would we ever want two or more patterns and, if so, would we want something of the form
or would "(C\d{5}|M\d{7})" be ok?
3) It should be noted that SNOMED CT, in particular, includes a check digit and other formatting information that isn't expressible as a simple RE. Should we provide a hook for future use that names an algorithm or just let it slide.
My suggested answers are: 1) Assume them, 2) single RE is fine and 3) nah - not now