Open turbomam opened 9 months ago
No, these are identifiers assigned by EMSL so we can't change these is my understanding.
OK. We will have to use some prefix other than UUID. Can any of you help me see these EMSL Biosample identifiers in the wild? Like on a web page, an API, or a downloadable file?
There's 134 of them.
The 'UUID:CPER-CB-T-b8027938-7ff4-46c1-8575-e23584f1e898'
to <urn:uuid:CPER-CB-T-b8027938-7ff4-46c1-8575-e23584f1e898>
conversion takes place in nmdc_schema/anyuri_strings_to_iris.py
, which is called by the local/mongo_as_nmdc_database_cuire_repaired.ttl
target in project.Makefile
That's required because linkml-convert
isn't converting xsd:anyUri
strings to CURIes
nmdc_schema/anyuri_strings_to_iris.py
take one or more --jsonld-context-jsons
arguments, which are JSON-LD context files. One is deprived from the schema, and another is added to handle prefixes that are defined as upper case in the schema but were used lowercase in the data.
--jsonld-context-jsons project/jsonld/nmdc.context.jsonld \
--jsonld-context-jsons assets/misc/data_prefix_expansions.context.jsonld \
This is all in addition to the temporary CURIe fixing in nmdc_schema/migration_recursion.py
.
I have replaced this principled UUID conversion with special case handling
@turbomam I know we made an EMSL study prefix, did we make one for biosamples? I would suggest we make an "nmdc stored EMSL biosample identifier" prefix. That's clear that this is an EMSL ID stored by NMDC, but is not created by EMSL to provide the correct prefix for these IDs. I can check with the NEXUS team tomorrow about doing this & confirm there's no issues here like we did with studies. I'll get back to you.
Delayed. EMSL fire drill shorted the NEXUS meeting. I'll send a message, but delayed until I get a chance to talk with EMSL team. ~Oct 5
@turbomam I know we made an EMSL study prefix, did we make one for biosamples?
We don't really have one well-defined prefix or expansion for EMSL Biosample
s yet.
Most of the links below are for SPARQL queries. These all run slower than some I've shared in other issues. Up to 30 seconds.
Here are the nmdc-schema prefixes that currently include the string 'emsl'
"http://example.org/emsl_biosample_in_mongodb/"
,"https://bioregistry.io/emsl.project:"
,"http://example.org/emsl_biosample_uuid_like/"
,I arbitrarily created the 1st one to account for any identifiers in any part of MongoDB that used the "emsl" prefix. It looks like I was mistaken in thinking that all of those were Biosample
s%0Awhere%20%7B%0A%20%20%20%20%3Fo%20a%20%3Fot%20.%0A%20%20%20%20filter(strstarts(str(%3Fo)%2C%20%22http%3A%2F%2Fexample.org%2Femsl_biosample_in_mongodb%2F%22))%0A%7D%0Agroup%20by%20%3Fot%0Aorder%20by%20desc(count(%3Fo))). There are also 2558 DataObject
s and 1236 OmicsProcessing
s that use the under-specified "emsl" prefix
Would you consider the 2nd prefix and expansion to be for Study
s? I don't see it in use anywhere yet.%0Awhere%20%7B%0A%20%20%20%20%3Fs%20a%20%3Fst%20%3B%0A%20%20%20%20%20%20%20%3Fp%20%3Fo%20.%0A%20%20%20%20filter(strstarts(str(%3Fo)%2C%20%22https%3A%2F%2Fbioregistry.io%2Femsl.project%3A%22))%0A%7D%0Agroup%20by%20%3Fst%20%3Fp%0Aorder%20by%20desc(count(%3Fo))%0A)
I arbitrarily created the 3rd one to account for any emsl_biosample_identifiers
values that use the "UUID" prefix.%0Awhere%20%7B%0A%20%20%20%20%3Fs%20a%20%3Fst%20%3B%0A%20%20%20%20%20%20%20%3Fp%20%3Fo%20.%0A%20%20%20%20filter(strstarts(str(%3Fo)%2C%20%22http%3A%2F%2Fexample.org%2Femsl_biosample_uuid_like%22))%0A%7D%0Agroup%20by%20%3Fst%20%3Fp%0Aorder%20by%20desc(count(%3Fo)))
1 ("emsl") and 3 ("emsl_biosample_uuid_like") are different in the sense that
id
s) UUID
prefix because they aren't UUIDsI did notice the 2558 DataObjects and 1236 OmicsProcessings that use the under-specified "emsl" prefix, its a bit confusing to look at these as SPARQL query output since they are of other types. Not a big priority but if you could update the prefix extension for emsl
to something more generic it would be clearer. These prefixes will stay on as alternative identifiers after the re-iding.
I arbitrarily created the 1st one to account for any identifiers in any part of MongoDB that used the "emsl" prefix. It looks like [I was mistaken in thinking that all of those were Biosamples]. There are also 2558 DataObjects and 1236 OmicsProcessings that use the under-specified "emsl" prefix
So, this shouldn't be "emsl_biosample_in_mongodb" because it's also omics processing records... ? If so, need to fix that by either
Remove this from the omics processing records, make a new one for omics processing records, and assign this to ONLY biosmaples
Would you consider the 2nd prefix and expansion to be for Studys? [I don't see it in use anywhere yet.]
Yes
I arbitrarily created the 3rd one to account for [any emsl_biosample_identifiers values that use the "UUID" prefix.]
I don't think this should've been done. And we should stop making random fixes in a vaccum. This is a repeat of the 1st one, depending on the decision we make for managing the omics processing vs biosamples.
the "emsl"-prefixed values are being used as primary identifiers for things that are defined in the schema (although they should be using Napa ids)
Agreed
the "UUID" prefixed values are being used as external identifiers, so it's OK that they don't follow the Napa protocol. They just can't use the UUID prefix because they aren't UUIDs
That makes sense. But the prefix should be whatever we decide on for the "emsl biosample identifier" not a weird "uuid like" prefix.
related to #1130
Partial Biosample metadata:
Converting that to RDF with the
make-rdf
makefile target gives excerpts like this:but Jena
riot
saysWikipedia says that UUIDs should use the 8-4-4-4-12 format. We can ignore the angle brackets and the
urn:uuid:
authority assertion. Theemsl_biosample_identifiers
value of 'CPER-CB-T-b8027938-7ff4-46c1-8575-e23584f1e898' does include a 8-4-4-4-12, but only after an illegal 'CPER-CB-T-' prefix.Will this be addressed as part of the Napa
id
squad @mslarae13 @aclum @SamuelPurvine ?