w3c / hcls-fhir-rdf

Sketching out an RDF representation for FHIR
39 stars 15 forks source link

Lack of concept URIs for CodableConcepts -- Concept IRIs #94

Closed dbooth-boston closed 11 months ago

dbooth-boston commented 2 years ago

Individual concepts do not necessarily have canonical URIs to identify them. See example. Should we do something about that? Should we concatenate the fhir:Coding.system with the fhir:Coding.code in some way, to produce a canonical URI for the concept?

gaurav commented 2 years ago

Note that fhir:Coding.system must be a URI, while fhir:Coding.code must be a code, which may include spaces (as per the spec). Looking through the examples, it looks like system URIs are not intended be used as prefixes. For example (from https://www.hl7.org/fhir/datatypes-examples.html#coding, but please point me to other examples to add here):

Vocabulary fhir:Coding.system fhir:Coding.code Actual prefix Actual URL
ICD-10 http://hl7.org/fhir/sid/icd-10 G44.1 http://purl.bioontology.org/ontology/ICD10/ http://purl.bioontology.org/ontology/ICD10/G44.1
SNOMED CT http://snomed.info/sct 128045006:{363698007=56459004} http://purl.bioontology.org/ontology/SNOMEDCT/ http://purl.bioontology.org/ontology/SNOMEDCT/128045006

(Note that I wasn't able to find an online service that includes the more complex SNOMED code used in the example above.)

Possible outcomes: AFAICT, this means that we can't use the fhir:Coding.system as a prefix, and have to either:

  1. Provide a web service that can be queried with a fhir:Coding.system and fhir:Coding.code combination in order to return a concept IRI, or
  2. Create a definitive mapping of fhir:Coding.system values to IRI prefixes, such that SNOMED CT (with a fhir:Coding.system of http://snomed.info/sct) will always be mapped to http://purl.bioontology.org/ontology/SNOMEDCT/. This could then be concatenated with the fhir:Coding.code (we would need to decide if spaces should be encoded as + or %20) to provide a full concept IRI. This could be stored in GitHub so that changes to it can be tracked, and hopefully could eventually be integrated into the FHIR specification.
gaurav commented 2 years ago

It might be useful to make a list of every Coding system used in the FHIR examples, however this list is not exhaustive.

gaurav commented 2 years ago

Can we put this into http://registry.fhir.org/ somehow? @gaurav to investigate.

gaurav commented 2 years ago

Some healthcare systems also have their own internal coding systems -- how do we handle that?

ericprud commented 2 years ago

Harold and I had decided that we could put them in if we had a mapping for them (assuming the mappings were reasonable to code generically). This means we can map SNOMED-CT, LoINC, etc using pretty official URLs. Others, we could "host" in an HL7 namespace until the org behind them saw the value and said "gimme!" At that point, you have a bit of a prob 'cause you don't want to maintain utterly enormous tables of OWL:sameIndividualAs links. I suspect the answer there would be writing custom code for the platform stuck with obsolete URLs.

UML-S could provide some basis for a hosted namespace for un-Web-ified vocabs.

gaurav commented 2 years ago

It might be useful to make a list of every Coding system used in the FHIR examples, however this list is not exhaustive.

I haven't had time to extract these yet, however, a list of system URIs that can be used in FHIR Codings is available at https://build.fhir.org/terminologies-systems.html

Some additional code systems are listed on the FHIR Terminology Service at http://tx.fhir.org/r5/ and on the HL7 Terminology Service at https://terminology.hl7.org/codesystems.html

gaurav commented 2 years ago

I have learned a few more things:

So, I think there are a series of potential solutions we can implement:

  1. The ideal solution would be to add prefix as an identifier type to NamingSystem.identifier.type and fill in prefixes for the 255 naming systems currently published to terminology.hl7.org. We can then use the hl7-terminology NPM package to read this information and fill in prefixes when given a system and code pair.
  2. If this is not doable, or would take too much time, we can temporarily include a list of these 255 naming systems in our fhircat tool with mappings to prefixes or other information regarding how to construct a concept IRI for coding systems. We can develop tools to compare our list with the list in the hl7-terminology to check for unmapped naming systems.

Do you all think this would cover all our needs?

gaurav commented 2 years ago

NamingSystem -> non-authoritative third-party annotation about a code system CodingSystem -> authoritative annotation by the publisher of a code system

Might want to have the prefix in CodingSystem -- there should only be one authoritative prefix/format for each coding system

CodingSystem URLs are based on hl7.org (e.g. http://hl7.org/fhir/sid/ndc), but the goal is probably to replace this with an authoritative URL when the resource wants to take over.

Gaurav to dig into CodeSystem to figure out where the prefix could go there.

gaurav commented 2 years ago

The prefix could potentially go into the CodeSystem.identifier, which is an Identifier with both a IdentifierType (named type) and IdentifierUse (named use). We might consider prefix as a potential value for use. There is also a generic CodeSystem.property field that we could use, but I think Identifier would be more specific.

So I think the next step is to write all of this up somewhere and then submit it to the FHIR writers to see what they think?

gaurav commented 2 years ago

type might be better to use here, since it is Extensible -- we can make up new types as needed.

gaurav commented 2 years ago
gaurav commented 2 years ago

I downloaded and executed the code in https://github.com/HL7/UTG using Java 11. It generated the HTML documentation you see at https://terminology.hl7.org/. In doing so, it appears to use both tx.fhir.org (“Connect to Terminology Server at http://tx.fhir.org”, “-tx: Connect to http://tx.fhir.org/r4”) and hl7-terminology (“Installing hl7.terminology#3.0.0 to the package cache”, which I haven’t figured out where that is). I'll open an issue at https://github.com/HL7/UTG to hopefully get to the bottom of this, and am hoping that other FHIRCat team members like @ericprud or @dksharma might know as well.

Once I figure out how to modify those CodeSystem/NamingSystem files, I'm planning to create a (forked?) repository with prefixes added to some of those files, and write a little demonstration tool that uses that information to convert FHIR codings into RDF concept URLs and vice versa.

In the meantime, I'm also writing up a more formal description of this issue and possible solutions. This might be useful later on if we do need to explain what we're doing to people outside our team. I'll set it to be view-only since I'm posting that URL publicly, but please do request editing rights to that document if you would like to help!

gaurav commented 2 years ago

Current strategy:

Note that the fallback plan -- if HL7/FHIR refuse to put this into terminology.fhir.org -- would probably want to maintain this list separately.

Make sure that this works with US Core terminology: http://www.hl7.org/fhir/us/core/terminology.html -- they require specific URLs in that system, so we don't want to overwrite that or mess with it.

gaurav commented 2 years ago
gaurav commented 2 years ago

Here are eight candidates for coding system/naming systems mentioned in the FHIR R5 examples that we can provide prefixes for:

Resource System URI Prefix Example
SNOMED CT http://snomed.info/sct http://snomed.info/id/ 385221006
LOINC http://loinc.org https://loinc.org/ 10160-0
ISO 3166 urn:iso:std:iso:3166 https://www.omg.org/spec/LCC/Countries/ISO3166-1-CountryCodes/ CA (not resolvable, but RDF file at prefix)
DICOM http://dicom.nema.org/resources/ontology/DCM http://dicom.nema.org/resources/ontology/DCM/ 110127 (not resolvable, but see BioPortal)
RxNorm http://www.nlm.nih.gov/research/umls/rxnorm http://purl.bioontology.org/ontology/RXNORM/ 1160593
MeSH https://meshb.nlm.nih.gov/ https://id.nlm.nih.gov/mesh/ D000328
PubMed https://pubmed.ncbi.nlm.nih.gov https://pubmed.ncbi.nlm.nih.gov/ 32876694
NCBI Nucleotide http://www.ncbi.nlm.nih.gov/nuccore https://www.ncbi.nlm.nih.gov/nuccore/ NC_000009.11

All of these have ten or more mentions in the FHIR R5 examples, so we could further check on resolvability by (for e.g.) looking up all the referenced codes to see if they work as expected.

@ericprud @balhoff You both have a lot more experience with RDF prefixes than I do, so if you see something I can do better here, please let me know!

gaurav commented 2 years ago

Weekly update:

Next steps:

Tasks further down the line:

gaurav commented 2 years ago

Re: the SNOMED 128045006:{363698007=56459004} compositional syntax, just URL-encoding it for now seems fine. But note that this is unneeded in FHIR, since you can express this in other ways. Also: it's good to push people towards prefixes rather than trying to do this in a more complicated way.

Do we need to canonicalize blank spaces/pipes/etc in the code value? Probably not -- we can leave them as is and leave it to downstream processing.

gaurav commented 2 years ago
gaurav commented 2 years ago

I've uploaded to Google Drive the lists of all system codes in R4/R5 (system-codes-r[45].tsv) and the unique system/code pairs (unique-codes-r[45].tsv). I'm trying to figure out some way to validate whether the IRIs being generated are correct -- for now, I'm trying to see whether those IRIs are resolvable (resolved-r[45].tsv). For the FHIR JSON examples for R5, I got 370 unique system values with a total of 1,968 unique system-code pairs, of which I could generate 789 concept IRIs using the five examples described above. Out of 789 IRIs I attempted to resolve, I got 671 successes (HTTP 200), 112 not found (HTTP 404), 3 server errors (HTTP 500) and 2 request timeouts. So it looks like this approach might be worth pursuing? Some of those 404s are IRIs that are not intended to resolve, so we might want to try resolving them against the OLS instead.

I'm going to pause the software development work here to finish writing up the problem discussion I was working on earlier so we can check to see if there's anything missing here.

gaurav commented 2 years ago
gaurav commented 2 years ago

I've updated the files (see Google Drive directory and resolved-r5 sheet) to include the display field from the FHIR Examples.

gaurav commented 2 years ago

I've writing up a brief summary of the problem and our proposed solution on Google Docs -- you can only comment on the document with that link, but please do request editor access if you'd like to help make it better and prepare it for submission to the FHIR chat! Before we submit it there, I'd love to link to it from https://github.com/HL7/UTG/issues/7 and ask Chris Mungall to have a look at it, as he might be interested in this as well.

gaurav commented 2 years ago

As per our discussion last Thursday, I've asked chat.fhir.org for suggestions on sources of Coding.system/code pairs that are in use "in the wild": https://chat.fhir.org/#narrow/stream/179202-terminology/topic/Getting.20lists.20of.20CodeSystem.2FNamingSystems.20currently.20in.20use

gaurav commented 2 years ago

Grahame suggested checking system/code pairs from Synthea, which is available as software code (https://github.com/synthetichealth/synthea) or synthetic data sets (https://synthea.mitre.org/fhir-api).

gaurav commented 2 years ago
gaurav commented 2 years ago
dbooth-boston commented 1 year ago

Putting IRI stems into the HL7 repo would only be adding identifiers to that repo, so it does not need to be R5 balloted. But we do need to change the spec for R5 to say that "if the concept IRI is known, then add it to the RDF".

dbooth-boston commented 1 year ago

On today's call we made two decisions:

gaurav commented 1 year ago

Now that TSMG and the RDF subgroup have both voted on this, I think these are the next steps:

  1. To add IRIs as an identifier system. I thought this might require modifying the "Identifier Registry" page on FHIR (https://build.fhir.org/identifier-registry.html), but as per https://jira.hl7.org/browse/FHIR-17440 it looks like we need to submit a UTG ticket for this.
  2. I like the idea of submitting the change for a single CodeSystem (e.g. SNOMED) so we can make sure we're complying with UTG's change guidelines correctly.
  3. Once that's done, we can either make a single large change with all the IRI stems we can find for current external terminologies on terminology.hl7.org, or make separate changes for each IRI stem. We can use a Google spreadsheet to coordinate this work. Since the RDF subgroup is currently busy with R5 balloting changes, we'll probably start work on this in earnest once we git the R5 ballot deadline in a few weeks.
dbooth-boston commented 11 months ago

Done, though addition of some more IRI stems continues.