Open dhimmel opened 1 year ago
Here's a visualization by @ravwojdyla on why knowing close/exact (or equivalent/related, green/red in visualization) could help refine mappings to be bijective in certain situations like:
Also noting how an axiom appears in the EFO OWL source:
<owl:Axiom>
<owl:annotatedSource rdf:resource="http://www.ebi.ac.uk/efo/EFO_0000640"/>
<owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasDbXref"/>
<owl:annotatedTarget>Orphanet:319298</owl:annotatedTarget>
<oboInOwl:source>MONDO:equivalentTo</oboInOwl:source>
</owl:Axiom>
@dhimmel I managed to recreate the database cross reference
section that appears on the website by using axioms from the .owl
file for EFO:0000479 and EFO:0000640. However, I noticed that for EFO:0000640
, there are two extra xrefs MeSH:C538614
and UMLS:C2931899
, that are not displayed on the website, but are present in the xrefs
query.
Do you know any examples for which it's more difficult to retrieve axioms?
I also noticed that sometimes the axiom has multiple oboInOwl:source
values and sometimes a single cross referance has multiple axioms. For example for ICD9:238.71
in EFO:0000479
<owl:Axiom>
<owl:annotatedSource rdf:resource="http://www.ebi.ac.uk/efo/EFO_0000479"/>
<owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasDbXref"/>
<owl:annotatedTarget>ICD9:238.71</owl:annotatedTarget>
<oboInOwl:source>DOID:2224</oboInOwl:source>
<oboInOwl:source>EFO:0000479</oboInOwl:source>
<oboInOwl:source>MONDO:equivalentTo</oboInOwl:source>
<oboInOwl:source>MONDO:i2s</oboInOwl:source>
</owl:Axiom>
<owl:Axiom>
<owl:annotatedSource rdf:resource="http://www.ebi.ac.uk/efo/EFO_0000479"/>
<owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasDbXref"/>
<owl:annotatedTarget>ICD9:238.71</owl:annotatedTarget>
<oboInOwl:source>DOID:2224</oboInOwl:source>
<oboInOwl:source>EFO:0000479</oboInOwl:source>
<oboInOwl:source>MONDO:equivalentTo</oboInOwl:source>
<oboInOwl:source>i2s</oboInOwl:source>
</owl:Axiom>
It looks like on the website the last source
value used is to describe the cross reference. The ordering of these sources seems to be alphabetical, though. I'm not sure what approach we should use if there is more than one source. Do you have any suggestions?
Here is a query I used to retrieve the axioms from the owl file:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
SELECT ?efo_id ?xref (MAX(?source) AS ?axiom)
WHERE {
?axiom_element rdf:type owl:Axiom ;
owl:annotatedSource ?annotatedSource ;
owl:annotatedProperty ?annotatedProperty ;
owl:annotatedTarget ?xref ;
oboInOwl:source ?source .
FILTER(?annotatedProperty = oboInOwl:hasDbXref)
BIND( REPLACE( STR(?annotatedSource), "^http.+/([^:]+)_(.+)$", "$1:$2" ) AS ?efo_id )
}
GROUP BY ?axiom_element ?efo_id ?annotatedProperty ?xref
And here is also a code snippet that I used in a jupyter notebook to retrieve and compare axioms:
Nice work @bfoltyn.
I think we'll want to preserve all sources provided by axioms rather than taking the max. So the output would be keyed on ?efo_id ?xref ?axiom_source
. Could also consider making ?axiom_source optional such that we still match xrefs without axioms.
sometimes a single cross reference has multiple axioms
Hmm, so it appears that an axiom can provide multiple sources for a cross-reference. I am not sure why in the (EFO:0000479
subject, ICD9:238.71
object, oboInOwl:hasDbXref
predicate) triplet has multiple axioms, which duplicate all but one sources across them. Perhaps @zoependlington or @matentzn would know?
I think we'll want to preserve all sources provided by axioms rather than taking the max.
Absolutely the order has no meaning at all.
sometimes a single cross-reference has multiple axioms
This is due to the fact that the cross references have not been normalised.
:a :hasDbXref :b {source: "X"}
:a :hasDbXref :b {source: "Y"}
Is allowed in the OWL data model (which is good in some cases, think of provenance!).
We have a special method in mondo called "normalisation" that turns this into:
:a :hasDbXref :b {source: "X", source: "Y"}
But this is not at all consistently applied to all ontologies.
TLDR: There is no requirement for normalising axiom annotation, so you have to be able to deal with the unnormlised case!
TLDR: There is no requirement for normalising axiom annotation, so you have to be able to deal with the unnormlised case!
Thanks @matentzn !
I think we'll want to preserve all sources provided by axioms rather than taking the max. So the output would be keyed on
?efo_id ?xref ?axiom_source
. Could also consider making ?axiom_source optional such that we still match xrefs without axioms.
@dhimmel If we want to preserve all sources, I think that we can use the following query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
SELECT ?efo_id ?xref ?axiom_source
WHERE {
?axiom_element rdf:type owl:Axiom ;
owl:annotatedSource ?source ;
owl:annotatedProperty oboInOwl:hasDbXref ;
owl:annotatedTarget ?xref ;
OPTIONAL { ?axiom_element oboInOwl:source ?axiom_source }
BIND( REPLACE( STR(?source), "^http.+/([^:]+)_(.+)$", "$1:$2" ) AS ?efo_id )
}
GROUP BY ?efo_id ?xref ?axiom_source
For efo_id=EFO:0000479
and xref=ICD9:238.71
, the results are:
efo_id | xref | axiom_source |
---|---|---|
EFO:0000479 | ICD9:238.71 | DOID:2224 |
EFO:0000479 | ICD9:238.71 | EFO:0000479 |
EFO:0000479 | ICD9:238.71 | MONDO:equivalentTo |
EFO:0000479 | ICD9:238.71 | MONDO:i2s |
EFO:0000479 | ICD9:238.71 | i2s |
About the optional source, there are 46 rows with null axiom_source
compared to 117675 where it has a value. Sometimes axioms can have attributes other than oboInOwl:source
like oboInOwl:hasDbXref
or skos:closeMatch
Should we keep the details which predicates are used within axioms? Or is just using the oboInOwl:source
ok?
@dhimmel For mondo:exactMatch
and mondo:closeMatch
, slightly modified version of the query, you used in in https://github.com/EBISPOT/efo/issues/935 should work:
PREFIX mondo: <http://purl.obolibrary.org/obo/mondo#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
SELECT ?efo_id ?efo_uri ?predicate_id ?match ?predicate_uri
WHERE {
VALUES ?predicate_uri {mondo:closeMatch mondo:exactMatch}
?efo_uri ?predicate_uri ?match
BIND( REPLACE( STR(?efo_uri), "^http.+/([^:]+)_(.+)$", "$1:$2" ) AS ?efo_id )
BIND( REPLACE( STR(?predicate_uri), "^http://purl.obolibrary.org/obo/mondo#(.+)$", "$1" ) AS ?predicate_id )
}
I validated the results against the OLS API and they're correct. Here's a snippet that compares mondo:exactMatch
and mondo:closeMatch
:
The current results return URLs like http://purl.obolibrary.org/obo/Orphanet_98576
. Should we keep this format or transform them?
@dhimmel Lastly, what format should the axioms, mondo:exactMatch
and mondo:closeMatch
have in the output json file?
The current results return URLs like
http://purl.obolibrary.org/obo/Orphanet_98576
. Should we keep this format or transform them?
That URL is the class URI and we often assign it to a variable with a _uri
suffix. The corresponding CURIE (compact URI) version is Orphanet:98576
and we often use an _id
suffix for this. The SPARQL query can include both the URI and CURIES as separate output fields.
What we are after is for each oboInOwl:hasDbXref
:
A tabular output from a SPARQL query is the ideal first output here. Not sure if you can fit everything in one query/table or you need multiple. I leave that up to your investigation.
To complicate things further (hehe), we should consider whether the python oaklib
, which can extract mappings to the SSSOM format is a better approach here than writing our own SPARQL queries. SSSOM stands for Simple Standard for Sharing Ontological Mappings (publication).
Possibly best to transition to PRs at this point to enable easier review of the SPARQL queries. PR can be draft and incomplete.
@dhimmel regarding including xref_sources
and mapping_properties
in node data, I have a couple of ideas:
Option 1: xref_properties
field with a list with the following schema:
xref_id: str
sources: list[str]
mapping_properties: list[str]
Option 2: Separate xref_sources
and mapping_properties
xref_sources
schema:
xref_id: str
axiom_source: str
mapping_properties
schema:
xref_id: str
axiom_source: str
Option 3: Second option inside xref_properties
field:
Please let me know your thoughts on these options, or if there are any other ideas you have.
I like option 1. Will there be a slight imprecision where one source have one property and another source could have a conflicting property? For example, an xref being classified as both an exactMatch and closeMatch from different resources?
Just FYI: what you are trying to do here is much much harder than you think right now - and not necessary.
EFO is not a good source for mappings, because it mixes old (ancient) with new (harmonised) xrefs, and makes strange distinctions like "mondo:exactMatch" (which is not even a thing in Mondo). What you should do instead is:
Just my two cents as someone driving by :D
I like option 1. Will there be a slight imprecision where one source have one property and another source could have a conflicting property? For example, an xref being classified as both an exactMatch and closeMatch from different resources?
@dhimmel There are cases where xref is classified as both exactMatch and closeMatch. For example in EFO:0000095
xref meddra:10008958
has mondo:closeMatch
and skos:exactMatch
(
pd.read_json(
"https://github.com/related-sciences/nxontology-data/raw/output/efo/efo_otar_profile_mapping_properties.json.gz"
).pipe(
lambda df: df[
(df["efo_id"] == "EFO:0000095") & (df["xref_id"] == "meddra:10008958")
]
)
)
efo_id | xref_id | mapping_property_id | efo_uri | xref_uri | mapping_property_uri |
---|---|---|---|---|---|
EFO:0000095 | meddra:10008958 | mondo:closeMatch | http://www.ebi.ac.uk/efo/EFO_0000095 | http://identifiers.org/meddra/10008958 | http://purl.obolibrary.org/obo/mondo#closeMatch |
EFO:0000095 | meddra:10008958 | skos:exactMatch | http://www.ebi.ac.uk/efo/EFO_0000095 | http://identifiers.org/meddra/10008958 | http://www.w3.org/2004/02/skos/core#exactMatch |
There are 102 cases like this:
what you are trying to do here is much much harder than you think right now - and not necessary
Thanks @matentzn for these insights. I'm looking forward to exploring the SSSOM Mondo mappings combined with semra to convert them to EFO-keyed mappings. For now I think it makes sense to continue our current approach, since we're close to having it complete and being evaluable, at least as a good reference for the SSSOM/Mondo alternative.
There are cases where xref is classified as both exactMatch and closeMatch
@bfoltyn I think we could make exactMatch
higher priority than closeMatch
as an easy way to label an xref as either exact or close.
@bfoltyn I think we could make
exactMatch
higher priority thancloseMatch
as an easy way to label an xref as either exact or close.
@dhimmel What do you mean by higher priority? I thought we would include all mapping properties as list in the node data, as in option 1 in comment https://github.com/related-sciences/nxontology-data/issues/18#issuecomment-1761612946. Are you suggesting we include only one mapping property value exactMatch
or closeMatch
? Should we also include mondo:
or skos:
?
What do you mean by higher priority?
I think it might be best if we simplify/aggregate the xref metadata that goes into the nxontology node attribute data to something like (written here in YAML for ease):
xrefs:
- xref_id: meddra:10008958
xref_uri: http://identifiers.org/meddra/10008958
relation: skos:exactMatch # converting mondo:exactMatch to skos:exactMatch if applicable
sources: [MONDO:equivalentTo, DOID:2224] # haven't cleaned this up yet
With this design, an xref_id
would only appear once per node and all other metadata would be aggregated.
xref metadata that goes into the nxontology node attribute data to something like (written here in YAML for ease)
@dhimmel Currently xrefs
field in the node data is a list of strings. Do we want to replace it with the example you suggested? The reason I suggested introducing a new field with these properties was to not introduce a breaking change.
relation: skos:exactMatch # converting mondo:exactMatch to skos:exactMatch if applicable
@dhimmel Should we use the following logic?
skos:exactMatch
in mapping properties we set the value to => skos:exactMatch
mondo:exactMatch
in mapping properties we set the value to => skos:exactMatch
skos:closeMatch
in mapping properties we set the value to => skos:closeMatch
monde:closeMatch
in mapping properties we set the value to => skos:closeMatch
null
Currently
xrefs
field in the node data is a list of strings. Do we want to replace it with the example you suggested
We could either replace it or create a new field like xref_details
. Slightly leaning towards a new field.
Should we use the following logic?
That logic sounds good. If there are other interesting values in the otherwise set, we can support those later.
We could either replace it or create a new field like
xref_details
. Slightly leaning towards a new field.
I think we can add new field. xref_details
sounds good. Should this field also include xrefs from xrefs
query or just from mapping_properties
and xref_sources
?
Should this field also include xrefs from
xrefs
query or just frommapping_properties
andxref_sources
Ideally all of them, such that a user only needs xref_details
.
@dhimmel I've noticed that sometimes xref_sources
in xref_details
contains null. For example in MONDO:0020507
"xref_details": [
{
"xref_id": "DOID:0070374",
"relation": "skos:exactMatch",
"sources": [
null
]
},
Should we make axiom_source
required in the axiom_sources
query? https://github.com/related-sciences/nxontology-data/blob/fb93d9e7bffd3ae5bb773cce51b127f31cb5c14b/nxontology_data/efo/queries/xref_sources.rq#L12
Another way would be to filter out null
values after the aggregation in EfoProcessor.get_xref_details
method. https://github.com/related-sciences/nxontology-data/blob/fb93d9e7bffd3ae5bb773cce51b127f31cb5c14b/nxontology_data/efo/efo.py#L256-L279
Should we make
axiom_source
required in theaxiom_sources
query?
This is the solution I prefer unless you advocate for a different one. Potentially leave a comment in that query that OPTIONAL will include extra results where axiom_source
is missing.
background in https://github.com/EBISPOT/efo/issues/935
We currently extract database cross-references for EFO using the
oboInOwl:hasDbXref
predicate. However, MONDO is providing xrefs with greater specificity using themondo:exactMatch
andmondo:closeMatch
predicates. Furthermore, there are axioms (withrdf:type owl:Axiom
) that annotateoboInOwl:hasDbXref
instances with values likeMONDO:equivalentTo
.EFO:0000479
is a good example of a class that has all types of xrefs:oboInOwl:hasDbXref
without axiomsoboInOwl:hasDbXref
with axiomsmondo:exactMatch
andmondo:closeMatch
It would be nice to further understand the relation between 2 and 3.