monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

Better handling of ORDO labels with `en` language tags #512

Open twhetzel opened 7 months ago

twhetzel commented 7 months ago

Based on the work to update the RD subset based on ORDO information, there may be some duplicate labels due to the presence of @en on some ORDO labels. See comment https://github.com/monarch-initiative/mondo-ingest/pull/510/files#r1588599159

matentzn commented 6 months ago

I had already reported this to Orphanet a while back: https://github.com/Orphanet/ORDO/issues/33

For now, @joeflack4, just remove all labels with @en on them I think?

joeflack4 commented 6 months ago

Sure. This one is low priority, so I don't mind deferring discussion of this until later. Just trying to think through how I'd do this. Here are my initial thoughts.

I suppose what I'd do is add another SPARQL query to the component/ordo.owl goal, where I delete all the labels that contain @ (that way it will catch all language labels that might incidentally slip in), or @en.

Note to self: If I do this via a SPARQL query, then I'll probably want to update the script for #510 so that it also runs this query before running the Python script, for DRYness, rather than the way #510 is currently working, where it removes the @en via Python code.

twhetzel commented 6 months ago

The labels with the @en language tag can be removed from Ordo using this sparql query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

DELETE {
  ?class rdfs:label ?label .
}
WHERE {
  ?class rdfs:label ?label .
  FILTER(LANG(?label) = "en")
}

and will remove the english language tag labels from:

Disease Malformation syndrome Biological anomaly Morphological anomaly Clinical syndrome Particular clinical situation in a disease or syndrome (these are the same classes in the earlier comment from Nico referencing this open Ordo ticket https://github.com/Orphanet/ORDO/issues/33)
joeflack4 commented 6 months ago

Oh cool I didn't even think / know about LANG(?label).

@twhetzel Looks like probably the only work needed then is to create this query and then add a single line to the component goal. It's marked as low priority but maybe this will take like 10 min, lemme know if / when you want me to handle it.