related-sciences / nxontology-data

NXOntology data: making ontologies accessible as simple JSON files
Other
10 stars 3 forks source link

MeSH: should we exclude non-English labels? #12

Closed dhimmel closed 1 year ago

dhimmel commented 1 year ago

From the June 18, 2015 MeSH RDF release notes:

Users now must specify the language tag @en when searching rdfs:label or any other string literal. See the sample queries page (queries 5 and 6) for examples. One preferred MeSH Heading, Central Nervous System which is D002493, has non-English strings as a proof-of-concept example. This sample will remain in the beta version but may not be included in the production MeSH RDF version.

We already filter out non-English matches in our identifiers query:

https://github.com/related-sciences/nxontology-data/blob/e55c903ccc663b785591111f4c52208137ec0233/nxontology_data/mesh/queries/identifiers.rq#L14-L18

But not in our synonyms table.

dhimmel commented 1 year ago

Noting a relevant comment from https://github.com/HHS/meshrdf/issues/165#issuecomment-842660807:

Our string literals also use a Language code, which believe is likely to be the issue in your case. In any case, despite the clean-up effort in 2019, there are definitely terms still in MeSH containing parenthesis in their labels - T082417 is one example.

I can query for it by using a string literal, but even though we do not have any alternate languages of MeSH, we still use language-typed string literals to allow for this in fiture. That is likely the issue you have run into. Here is an example query:

PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>

SELECT ?term
FROM <http://id.nlm.nih.gov/mesh>
WHERE {
  ?term a meshv:Term .
  ?term meshv:prefLabel "3-methyl-s-triazolo(3,4-a)phthalazine"@en .
}

Without the @en after the label, it would not match.

Alternative languages for D002493 appear to have been added by https://github.com/HHS/meshrdf/pull/41 / https://github.com/HHS/meshrdf/commit/d5f8ef931fd0804e217ea9d8686f644c267fb209. See also https://github.com/HHS/meshrdf/commit/e47dbbc12c8cc9bec9ffb1bba7ab2a171646f336 and https://github.com/HHS/meshrdf/commit/33208c1f6a2dd054daae5202ce69a4ee616938bc.

dhimmel commented 1 year ago

Here's a query to find all string literals with a language other than English (excludes null language):

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh/2023>
WHERE {
  ?mesh_uri ?predicate ?mesh_literal.
  BIND (lang(?mesh_literal) AS ?mesh_literal_lang).
  FILTER (isLiteral(?mesh_literal)).
  # FILTER (datatype(?mesh_literal) = xsd:string)
  # FILTER (!isBlank(?mesh_literal_lang)).
  FILTER (?mesh_literal_lang != "").
  FILTER (!langMatches(?mesh_literal_lang, "EN")).
}

For 2023 MeSH, this returns:

mesh_uri predicate mesh_literal mesh_literal_lang
mesh2023:D002493 rdfs:label Centrala nervsystemets sjukdomar sv
mesh2023:D002493 rdfs:label Choroby OUN pl
mesh2023:D002493 rdfs:label Doenças do Sistema Nervoso Central pt
mesh2023:D002493 rdfs:label Enfermedades del Sistema Nervioso Central es
mesh2023:D002493 rdfs:label Keskushermoston sairaudet fi
mesh2023:D002493 rdfs:label Maladie du système nerveux central fr
mesh2023:D002493 rdfs:label Malattie del sistema nervoso centrale it
mesh2023:D002493 rdfs:label SREDIŠNJI ŽIVČANI SUSTAV, BOLESTI hr
mesh2023:D002493 rdfs:label Sykdommer i sentralnervesystemet no
mesh2023:D002493 rdfs:label Zentralnervensystemkrankheiten de
mesh2023:D002493 rdfs:label Ziekte, centraalzenuwstelsel- nl
mesh2023:D002493 rdfs:label nemoci centrálního nervového systému cs
mesh2023:D002493 rdfs:label НЕРВНОЙ СИСТЕМЫ ЦЕНТРАЛЬНОЙ БОЛЕЗНИ ru
mesh2023:D002493 rdfs:label 中枢神経系疾患 ja

So it looks like only D002493 has non-English language. Furthermore, the following query returns no results, such that we know all rdfs:label properties have language set:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh/2023>
WHERE {
  ?mesh_uri rdfs:label|meshv:prefLabel|meshv:altLabel ?mesh_literal.
  BIND (lang(?mesh_literal) AS ?mesh_literal_lang).
  FILTER (?mesh_literal_lang = "").
}

Noting the documentation for SPARQL lang function:

simple literal LANG (literal ltrl) Returns the language tag of ltrl, if it has one. It returns "" if ltrl has no language tag. Note that the RDF data model does not include literals with an empty language tag.

dhimmel commented 1 year ago

Given that only D002493 (Central Nervous System Diseases) has non-English language labels and that this appears to have been a proof of concept with no expanded implementation on the MeSH roadmap, we will filter labels to English only in our SPARQL queries and ignore all non-English labels.

Non-English labels result in a many-labels to single identifier mapping, which can be confusion for output datasets.