Closed dhimmel closed 1 year ago
Noting a relevant comment from https://github.com/HHS/meshrdf/issues/165#issuecomment-842660807:
Our string literals also use a Language code, which believe is likely to be the issue in your case. In any case, despite the clean-up effort in 2019, there are definitely terms still in MeSH containing parenthesis in their labels - T082417 is one example.
I can query for it by using a string literal, but even though we do not have any alternate languages of MeSH, we still use language-typed string literals to allow for this in fiture. That is likely the issue you have run into. Here is an example query:
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#> SELECT ?term FROM <http://id.nlm.nih.gov/mesh> WHERE { ?term a meshv:Term . ?term meshv:prefLabel "3-methyl-s-triazolo(3,4-a)phthalazine"@en . }
Without the
@en
after the label, it would not match.
Alternative languages for D002493
appear to have been added by https://github.com/HHS/meshrdf/pull/41 / https://github.com/HHS/meshrdf/commit/d5f8ef931fd0804e217ea9d8686f644c267fb209. See also https://github.com/HHS/meshrdf/commit/e47dbbc12c8cc9bec9ffb1bba7ab2a171646f336 and https://github.com/HHS/meshrdf/commit/33208c1f6a2dd054daae5202ce69a4ee616938bc.
Here's a query to find all string literals with a language other than English (excludes null language):
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh/2023>
WHERE {
?mesh_uri ?predicate ?mesh_literal.
BIND (lang(?mesh_literal) AS ?mesh_literal_lang).
FILTER (isLiteral(?mesh_literal)).
# FILTER (datatype(?mesh_literal) = xsd:string)
# FILTER (!isBlank(?mesh_literal_lang)).
FILTER (?mesh_literal_lang != "").
FILTER (!langMatches(?mesh_literal_lang, "EN")).
}
For 2023 MeSH, this returns:
mesh_uri | predicate | mesh_literal | mesh_literal_lang |
---|---|---|---|
mesh2023:D002493 | rdfs:label | Centrala nervsystemets sjukdomar | sv |
mesh2023:D002493 | rdfs:label | Choroby OUN | pl |
mesh2023:D002493 | rdfs:label | Doenças do Sistema Nervoso Central | pt |
mesh2023:D002493 | rdfs:label | Enfermedades del Sistema Nervioso Central | es |
mesh2023:D002493 | rdfs:label | Keskushermoston sairaudet | fi |
mesh2023:D002493 | rdfs:label | Maladie du système nerveux central | fr |
mesh2023:D002493 | rdfs:label | Malattie del sistema nervoso centrale | it |
mesh2023:D002493 | rdfs:label | SREDIŠNJI ŽIVČANI SUSTAV, BOLESTI | hr |
mesh2023:D002493 | rdfs:label | Sykdommer i sentralnervesystemet | no |
mesh2023:D002493 | rdfs:label | Zentralnervensystemkrankheiten | de |
mesh2023:D002493 | rdfs:label | Ziekte, centraalzenuwstelsel- | nl |
mesh2023:D002493 | rdfs:label | nemoci centrálního nervového systému | cs |
mesh2023:D002493 | rdfs:label | НЕРВНОЙ СИСТЕМЫ ЦЕНТРАЛЬНОЙ БОЛЕЗНИ | ru |
mesh2023:D002493 | rdfs:label | 中枢神経系疾患 | ja |
So it looks like only D002493
has non-English language. Furthermore, the following query returns no results, such that we know all rdfs:label
properties have language set:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh/2023>
WHERE {
?mesh_uri rdfs:label|meshv:prefLabel|meshv:altLabel ?mesh_literal.
BIND (lang(?mesh_literal) AS ?mesh_literal_lang).
FILTER (?mesh_literal_lang = "").
}
Noting the documentation for SPARQL lang
function:
simple literal LANG (literal ltrl)
Returns the language tag ofltrl
, if it has one. It returns "" ifltrl
has no language tag. Note that the RDF data model does not include literals with an empty language tag.
Given that only D002493
(Central Nervous System Diseases) has non-English language labels and that this appears to have been a proof of concept with no expanded implementation on the MeSH roadmap, we will filter labels to English only in our SPARQL queries and ignore all non-English labels.
Non-English labels result in a many-labels to single identifier mapping, which can be confusion for output datasets.
From the June 18, 2015 MeSH RDF release notes:
We already filter out non-English matches in our identifiers query:
https://github.com/related-sciences/nxontology-data/blob/e55c903ccc663b785591111f4c52208137ec0233/nxontology_data/mesh/queries/identifiers.rq#L14-L18
But not in our synonyms table.