microbiomedata / nmdc-schema

National Microbiome Data Collaborative (NMDC) unified data model
https://microbiomedata.github.io/nmdc-schema/
Creative Commons Zero v1.0 Universal
26 stars 8 forks source link

`emsl_biosample_identifiers` links some `Biosample`s to values that are prefixed with 'UUID:' but aren't legal UUIDs #1117

Open turbomam opened 9 months ago

turbomam commented 9 months ago

Partial Biosample metadata:

{
  "id": "nmdc:bsm-11-znb2tv24",
  "emsl_biosample_identifiers": [
    "UUID:CPER-CB-T-b8027938-7ff4-46c1-8575-e23584f1e898"
  ]
}

Converting that to RDF with the make-rdf makefile target gives excerpts like this:

nmdc:bsm-11-znb2tv24 a nmdc:Biosample ;
    nmdc:emsl_biosample_identifiers <urn:uuid:CPER-CB-T-b8027938-7ff4-46c1-8575-e23584f1e898>  .

but Jena riot says

Bad IRI: Not a valid UUID string: urn:uuid:CPER-CB-T-b8027938-7ff4-46c1-8575-e23584f1e898

Wikipedia says that UUIDs should use the 8-4-4-4-12 format. We can ignore the angle brackets and the urn:uuid: authority assertion. The emsl_biosample_identifiers value of 'CPER-CB-T-b8027938-7ff4-46c1-8575-e23584f1e898' does include a 8-4-4-4-12, but only after an illegal 'CPER-CB-T-' prefix.

Will this be addressed as part of the Napa id squad @mslarae13 @aclum @SamuelPurvine ?

aclum commented 9 months ago

No, these are identifiers assigned by EMSL so we can't change these is my understanding.

turbomam commented 9 months ago

OK. We will have to use some prefix other than UUID. Can any of you help me see these EMSL Biosample identifiers in the wild? Like on a web page, an API, or a downloadable file?

turbomam commented 9 months ago

There's 134 of them.

turbomam commented 9 months ago

The 'UUID:CPER-CB-T-b8027938-7ff4-46c1-8575-e23584f1e898'

to <urn:uuid:CPER-CB-T-b8027938-7ff4-46c1-8575-e23584f1e898> conversion takes place in nmdc_schema/anyuri_strings_to_iris.py, which is called by the local/mongo_as_nmdc_database_cuire_repaired.ttl target in project.Makefile

That's required because linkml-convert isn't converting xsd:anyUri strings to CURIes

nmdc_schema/anyuri_strings_to_iris.py take one or more --jsonld-context-jsons arguments, which are JSON-LD context files. One is deprived from the schema, and another is added to handle prefixes that are defined as upper case in the schema but were used lowercase in the data.

        --jsonld-context-jsons project/jsonld/nmdc.context.jsonld \
        --jsonld-context-jsons assets/misc/data_prefix_expansions.context.jsonld \

This is all in addition to the temporary CURIe fixing in nmdc_schema/migration_recursion.py.

turbomam commented 9 months ago

I have replaced this principled UUID conversion with special case handling

mslarae13 commented 9 months ago

@turbomam I know we made an EMSL study prefix, did we make one for biosamples? I would suggest we make an "nmdc stored EMSL biosample identifier" prefix. That's clear that this is an EMSL ID stored by NMDC, but is not created by EMSL to provide the correct prefix for these IDs. I can check with the NEXUS team tomorrow about doing this & confirm there's no issues here like we did with studies. I'll get back to you.

mslarae13 commented 9 months ago

Delayed. EMSL fire drill shorted the NEXUS meeting. I'll send a message, but delayed until I get a chance to talk with EMSL team. ~Oct 5

turbomam commented 9 months ago

@turbomam I know we made an EMSL study prefix, did we make one for biosamples?

We don't really have one well-defined prefix or expansion for EMSL Biosamples yet.

Most of the links below are for SPARQL queries. These all run slower than some I've shared in other issues. Up to 30 seconds.

Here are the nmdc-schema prefixes that currently include the string 'emsl'

  1. "emsl": "http://example.org/emsl_biosample_in_mongodb/",
  2. "emsl.project": "https://bioregistry.io/emsl.project:",
  3. "emsl_biosample_uuid_like": "http://example.org/emsl_biosample_uuid_like/",

I arbitrarily created the 1st one to account for any identifiers in any part of MongoDB that used the "emsl" prefix. It looks like I was mistaken in thinking that all of those were Biosamples%0Awhere%20%7B%0A%20%20%20%20%3Fo%20a%20%3Fot%20.%0A%20%20%20%20filter(strstarts(str(%3Fo)%2C%20%22http%3A%2F%2Fexample.org%2Femsl_biosample_in_mongodb%2F%22))%0A%7D%0Agroup%20by%20%3Fot%0Aorder%20by%20desc(count(%3Fo))). There are also 2558 DataObjects and 1236 OmicsProcessings that use the under-specified "emsl" prefix

Would you consider the 2nd prefix and expansion to be for Studys? I don't see it in use anywhere yet.%0Awhere%20%7B%0A%20%20%20%20%3Fs%20a%20%3Fst%20%3B%0A%20%20%20%20%20%20%20%3Fp%20%3Fo%20.%0A%20%20%20%20filter(strstarts(str(%3Fo)%2C%20%22https%3A%2F%2Fbioregistry.io%2Femsl.project%3A%22))%0A%7D%0Agroup%20by%20%3Fst%20%3Fp%0Aorder%20by%20desc(count(%3Fo))%0A)

I arbitrarily created the 3rd one to account for any emsl_biosample_identifiers values that use the "UUID" prefix.%0Awhere%20%7B%0A%20%20%20%20%3Fs%20a%20%3Fst%20%3B%0A%20%20%20%20%20%20%20%3Fp%20%3Fo%20.%0A%20%20%20%20filter(strstarts(str(%3Fo)%2C%20%22http%3A%2F%2Fexample.org%2Femsl_biosample_uuid_like%22))%0A%7D%0Agroup%20by%20%3Fst%20%3Fp%0Aorder%20by%20desc(count(%3Fo)))

1 ("emsl") and 3 ("emsl_biosample_uuid_like") are different in the sense that

aclum commented 9 months ago

I did notice the 2558 DataObjects and 1236 OmicsProcessings that use the under-specified "emsl" prefix, its a bit confusing to look at these as SPARQL query output since they are of other types. Not a big priority but if you could update the prefix extension for emsl to something more generic it would be clearer. These prefixes will stay on as alternative identifiers after the re-iding.

mslarae13 commented 9 months ago

I arbitrarily created the 1st one to account for any identifiers in any part of MongoDB that used the "emsl" prefix. It looks like [I was mistaken in thinking that all of those were Biosamples]. There are also 2558 DataObjects and 1236 OmicsProcessings that use the under-specified "emsl" prefix

So, this shouldn't be "emsl_biosample_in_mongodb" because it's also omics processing records... ? If so, need to fix that by either

  1. Change "emsl_biosample_in_mongodb" to something else for omics processing
  2. Remove this from the omics processing records, make a new one for omics processing records, and assign this to ONLY biosmaples

    Would you consider the 2nd prefix and expansion to be for Studys? [I don't see it in use anywhere yet.]

Yes

I arbitrarily created the 3rd one to account for [any emsl_biosample_identifiers values that use the "UUID" prefix.]

I don't think this should've been done. And we should stop making random fixes in a vaccum. This is a repeat of the 1st one, depending on the decision we make for managing the omics processing vs biosamples.

the "emsl"-prefixed values are being used as primary identifiers for things that are defined in the schema (although they should be using Napa ids)

Agreed

the "UUID" prefixed values are being used as external identifiers, so it's OK that they don't follow the Napa protocol. They just can't use the UUID prefix because they aren't UUIDs

That makes sense. But the prefix should be whatever we decide on for the "emsl biosample identifier" not a weird "uuid like" prefix.

aclum commented 8 months ago

related to #1130