EFO cross-references: classify as exact/close when possible

dhimmel commented 1 year ago

background in https://github.com/EBISPOT/efo/issues/935

We currently extract database cross-references for EFO using the oboInOwl:hasDbXref predicate. However, MONDO is providing xrefs with greater specificity using the mondo:exactMatch and mondo:closeMatch predicates. Furthermore, there are axioms (with rdf:type owl:Axiom) that annotate oboInOwl:hasDbXref instances with values like MONDO:equivalentTo.

EFO:0000479 is a good example of a class that has all types of xrefs:

oboInOwl:hasDbXref without axioms
oboInOwl:hasDbXref with axioms
mondo:exactMatch and mondo:closeMatch

It would be nice to further understand the relation between 2 and 3.

dhimmel commented 1 year ago

Here's a visualization by @ravwojdyla on why knowing close/exact (or equivalent/related, green/red in visualization) could help refine mappings to be bijective in certain situations like:

Also noting how an axiom appears in the EFO OWL source:

<owl:Axiom>
    <owl:annotatedSource rdf:resource="http://www.ebi.ac.uk/efo/EFO_0000640"/>
    <owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasDbXref"/>
    <owl:annotatedTarget>Orphanet:319298</owl:annotatedTarget>
    <oboInOwl:source>MONDO:equivalentTo</oboInOwl:source>
</owl:Axiom>

bfoltyn commented 1 year ago

@dhimmel I managed to recreate the database cross reference section that appears on the website by using axioms from the .owl file for EFO:0000479 and EFO:0000640. However, I noticed that for EFO:0000640, there are two extra xrefs MeSH:C538614 and UMLS:C2931899 , that are not displayed on the website, but are present in the xrefs query.

Do you know any examples for which it's more difficult to retrieve axioms?

I also noticed that sometimes the axiom has multiple oboInOwl:source values and sometimes a single cross referance has multiple axioms. For example for ICD9:238.71 in EFO:0000479

<owl:Axiom>
    <owl:annotatedSource rdf:resource="http://www.ebi.ac.uk/efo/EFO_0000479"/>
    <owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasDbXref"/>
    <owl:annotatedTarget>ICD9:238.71</owl:annotatedTarget>
    <oboInOwl:source>DOID:2224</oboInOwl:source>
    <oboInOwl:source>EFO:0000479</oboInOwl:source>
    <oboInOwl:source>MONDO:equivalentTo</oboInOwl:source>
    <oboInOwl:source>MONDO:i2s</oboInOwl:source>
</owl:Axiom>
<owl:Axiom>
    <owl:annotatedSource rdf:resource="http://www.ebi.ac.uk/efo/EFO_0000479"/>
    <owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasDbXref"/>
    <owl:annotatedTarget>ICD9:238.71</owl:annotatedTarget>
    <oboInOwl:source>DOID:2224</oboInOwl:source>
    <oboInOwl:source>EFO:0000479</oboInOwl:source>
    <oboInOwl:source>MONDO:equivalentTo</oboInOwl:source>
    <oboInOwl:source>i2s</oboInOwl:source>
</owl:Axiom>

It looks like on the website the last source value used is to describe the cross reference. The ordering of these sources seems to be alphabetical, though. I'm not sure what approach we should use if there is more than one source. Do you have any suggestions?

Here is a query I used to retrieve the axioms from the owl file:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>

SELECT ?efo_id ?xref (MAX(?source) AS ?axiom)
WHERE {
  ?axiom_element rdf:type owl:Axiom ;
         owl:annotatedSource ?annotatedSource ;
         owl:annotatedProperty ?annotatedProperty ;
         owl:annotatedTarget ?xref ;
         oboInOwl:source ?source .

  FILTER(?annotatedProperty = oboInOwl:hasDbXref)

  BIND( REPLACE( STR(?annotatedSource), "^http.+/([^:]+)_(.+)$", "$1:$2" ) AS ?efo_id )
}
GROUP BY ?axiom_element ?efo_id ?annotatedProperty ?xref

And here is also a code snippet that I used in a jupyter notebook to retrieve and compare axioms:

Snippet

```python # type: ignore %load_ext autoreload %autoreload 2 import jupyter_black jupyter_black.load() import pandas as pd from nxontology_data.efo.efo import EfoProcessor pd.set_option("display.max_colwidth", None) efo_processor = EfoProcessor(version="v3.57.0", name="efo_otar_profile") # efo_processor.download_owl() rdf = efo_processor.load_rdf() xrefs = efo_processor.run_query("xrefs", cache=False) xrefs axioms = efo_processor.run_query("axioms", cache=False) axioms cross_reference_efo_0000479 = { "MESH:D013920 (Orphanet:3318/e)", "Orphanet:3318 (MONDO:equivalentTo)", "EFO:0000479 (MONDO:equivalentTo)", "Orphanet:71493 (MONDO:relatedTo)", "ICDO:9962/3 (NCIT:C3407)", "SCTID:109994006 (MONDO:equivalentTo)", "UMLS:C0040028 (Orphanet:3318/e)", "OMIM:614521", "NCIT:C3407 (exact-label-match)", "UMLS:C0040028 (Orphanet:3318)", "OMIM:601977", "MESH:D013920 (Orphanet:3318)", "ONCOTREE:ET (MONDO:equivalentTo)", "NCIT:C3407 (MONDO:exact-label-match)", "MONDO:0005029", "ICD10:D47.3 (Orphanet:3318)", "GARD:0006594 (MONDO:equivalentTo)", "ICD9:238.71 (i2s)", "OMIM:187950", "ICD9:238.71 (MONDO:i2s)", "COHD:438383 (MONDO:equivalentTo)", "DOID:2224 (MONDO:equivalentTo)", "MedDRA:10015493 (Orphanet:3318)", "MedDRA:10015493 (Orphanet:3318/e)", } cross_reference_efo_000640 = { "UMLS:CN205129 (MONDO:equivalentTo)", "GARD:0009575 (shared-umls-xref)", "GARD:0009572 (MONDO:equivalentObsolete)", "ICD10:C64 (Orphanet:47044)", "Orphanet:47044 (OMIM:605074)", "Orphanet:319298 (MONDO:equivalentTo)", "NCIT:C6975 (MONDO:equivalentTo)", "GARD:0009572 (MONDO:equivalentTo)", "OMIM:605074 (Orphanet:47044)", "GARD:0009575 (MONDO:shared-umls-xref)", "UMLS:C1306837 (Orphanet:319298)", "DOID:4465 (MONDO:equivalentTo)", "ONCOTREE:PRCC (MONDO:equivalentTo)", "EFO:0000640 (MONDO:equivalentTo)", "SCTID:733608000 (MONDO:equivalentTo)", "MONDO:0017884", "UMLS:C1336078 (MONDO:equivalentTo)", "UMLS:C1306837 (Orphanet:319298/e)", } axioms[axioms["efo_id"] == "EFO:0000479"] xrefs_with_axiom_efo_479 = ( xrefs[xrefs["efo_id"] == "EFO:0000479"] .merge(axioms, on=["efo_id", "xref"], how="left") .sort_values("xref") .assign( desc=lambda df: df.apply( lambda row: row["xref"] if pd.isnull(row["axiom"]) else f"{row['xref']} ({row['axiom']})", axis=1, ) ) ) display(xrefs_with_axiom_efo_479) display(set(xrefs_with_axiom_efo_479["desc"]) - cross_reference_efo_0000479) display(cross_reference_efo_0000479 - set(xrefs_with_axiom_efo_479["desc"])) display(set(xrefs_with_axiom_efo_479["desc"]) == cross_reference_efo_0000479) xrefs_with_axiom_efo_640 = ( xrefs[xrefs["efo_id"] == "EFO:0000640"] .merge(axioms, on=["efo_id", "xref"], how="left") .sort_values("xref") .assign( desc=lambda df: df.apply( lambda row: row["xref"] if pd.isnull(row["axiom"]) else f"{row['xref']} ({row['axiom']})", axis=1, ) )[["efo_id", "xref", "xref_prefix", "xref_accession", "axiom", "desc"]] ) display(set(xrefs_with_axiom_efo_640["desc"]) == cross_reference_efo_000640) display(set(xrefs_with_axiom_efo_640["desc"]) - cross_reference_efo_000640) display(cross_reference_efo_000640 - set(xrefs_with_axiom_efo_640["desc"])) display(xrefs_with_axiom_efo_640) ```

dhimmel commented 1 year ago

Nice work @bfoltyn.

I think we'll want to preserve all sources provided by axioms rather than taking the max. So the output would be keyed on ?efo_id ?xref ?axiom_source. Could also consider making ?axiom_source optional such that we still match xrefs without axioms.

sometimes a single cross reference has multiple axioms

Hmm, so it appears that an axiom can provide multiple sources for a cross-reference. I am not sure why in the (EFO:0000479 subject, ICD9:238.71 object, oboInOwl:hasDbXref predicate) triplet has multiple axioms, which duplicate all but one sources across them. Perhaps @zoependlington or @matentzn would know?

matentzn commented 1 year ago

I think we'll want to preserve all sources provided by axioms rather than taking the max.

Absolutely the order has no meaning at all.

sometimes a single cross-reference has multiple axioms

This is due to the fact that the cross references have not been normalised.

:a :hasDbXref :b {source: "X"}
:a :hasDbXref :b {source: "Y"}

Is allowed in the OWL data model (which is good in some cases, think of provenance!).

We have a special method in mondo called "normalisation" that turns this into:

:a :hasDbXref :b {source: "X", source: "Y"}

But this is not at all consistently applied to all ontologies.

TLDR: There is no requirement for normalising axiom annotation, so you have to be able to deal with the unnormlised case!

bfoltyn commented 1 year ago

TLDR: There is no requirement for normalising axiom annotation, so you have to be able to deal with the unnormlised case!

Thanks @matentzn !

I think we'll want to preserve all sources provided by axioms rather than taking the max. So the output would be keyed on ?efo_id ?xref ?axiom_source. Could also consider making ?axiom_source optional such that we still match xrefs without axioms.

@dhimmel If we want to preserve all sources, I think that we can use the following query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>

SELECT ?efo_id ?xref  ?axiom_source
WHERE {
  ?axiom_element rdf:type owl:Axiom ;
  owl:annotatedSource ?source ;
  owl:annotatedProperty oboInOwl:hasDbXref ;
  owl:annotatedTarget ?xref ;

  OPTIONAL { ?axiom_element oboInOwl:source ?axiom_source }

  BIND( REPLACE( STR(?source), "^http.+/([^:]+)_(.+)$", "$1:$2" ) AS ?efo_id )
}

GROUP BY ?efo_id ?xref ?axiom_source

For efo_id=EFO:0000479 and xref=ICD9:238.71, the results are:

efo_id	xref	axiom_source
EFO:0000479	ICD9:238.71	DOID:2224
EFO:0000479	ICD9:238.71	EFO:0000479
EFO:0000479	ICD9:238.71	MONDO:equivalentTo
EFO:0000479	ICD9:238.71	MONDO:i2s
EFO:0000479	ICD9:238.71	i2s

About the optional source, there are 46 rows with null axiom_source compared to 117675 where it has a value. Sometimes axioms can have attributes other than oboInOwl:source like oboInOwl:hasDbXref or skos:closeMatch

Examples

- this Axiom has `oboInOwl:hasDbXref` with `MONDO:equivalentTo` value ```owl OMIMPS:142340 MONDO:equivalentTo ``` - this Axiom has `skos:closeMatch` without value: ```owl MESH:D065632 ```

Should we keep the details which predicates are used within axioms? Or is just using the oboInOwl:source ok?

@dhimmel For mondo:exactMatch and mondo:closeMatch, slightly modified version of the query, you used in in https://github.com/EBISPOT/efo/issues/935 should work:

PREFIX mondo: <http://purl.obolibrary.org/obo/mondo#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
SELECT ?efo_id ?efo_uri ?predicate_id ?match ?predicate_uri
WHERE {
  VALUES ?predicate_uri {mondo:closeMatch mondo:exactMatch}
  ?efo_uri ?predicate_uri ?match

  BIND( REPLACE( STR(?efo_uri), "^http.+/([^:]+)_(.+)$", "$1:$2" ) AS ?efo_id )
  BIND( REPLACE( STR(?predicate_uri), "^http://purl.obolibrary.org/obo/mondo#(.+)$", "$1" ) AS ?predicate_id )
}

I validated the results against the OLS API and they're correct. Here's a snippet that compares mondo:exactMatch and mondo:closeMatch:

Snippet

```python # type: ignore %load_ext autoreload %autoreload 2 import jupyter_black jupyter_black.load() import pandas as pd from nxontology_data.efo.efo import EfoProcessor efo_processor = EfoProcessor(version="v3.58.0", name="efo_otar_profile") # efo_processor.download_owl() rdf = efo_processor.load_rdf() matches = efo_processor.run_query("matches", cache=False) matches import functools import requests import urllib.parse @functools.lru_cache(maxsize=None) def api_request(efo_uri: str): encoded = urllib.parse.quote_plus(urllib.parse.quote_plus(efo_uri)) return requests.get( url=f"https://www.ebi.ac.uk/ols4/api/ontologies/efo/terms/{encoded}" ).json() api_request.cache_clear() def get_api_matches(efo_id: str): res = api_request(efo_id) return { "close_match": set(res["annotation"].get("closeMatch", [])), "exact_match": set(res["annotation"].get("exactMatch", [])), } pivot_matches = ( matches.groupby(["efo_id", "efo_uri", "predicate_id"])["match"] .apply(list) .reset_index() .pivot(index=["efo_id", "efo_uri"], columns="predicate_id", values="match") .reset_index() .rename(columns={"exactMatch": "exact_match", "closeMatch": "close_match"}) ) pivot_matches pd.isnull(pivot_matches["close_match"]).value_counts() pd.isnull(pivot_matches["exact_match"]).value_counts() sample_matches = pivot_matches.sample(200).fillna("") sample_matches def safe_call(x): try: return get_api_matches(x) except Exception as e: print(f"Error for {x}: {e}") return {"closeMatch": set(), "exactMatch": set()} compare_df = ( sample_matches.fillna("") .assign( api_close_match=lambda df: df["efo_uri"].apply( lambda x: safe_call(x)["close_match"] ), api_exact_match=lambda df: df["efo_uri"].apply( lambda x: safe_call(x)["exact_match"] ), exact_match=lambda df: df["exact_match"].apply(set), close_match=lambda df: df["close_match"].apply(set), ) .assign( exact_match_equal=lambda df: df.apply( lambda row: row["exact_match"] == row["api_exact_match"], axis=1 ), close_match_equal=lambda df: df.apply( lambda row: row["close_match"] == row["api_close_match"], axis=1 ), extra_exact_match_in_api=lambda df: df.apply( lambda row: row["api_exact_match"] - row["exact_match"], axis=1 ), extra_exact_match_in_efo=lambda df: df.apply( lambda row: row["exact_match"] - row["api_exact_match"], axis=1 ), extra_close_match_in_api=lambda df: df.apply( lambda row: row["api_close_match"] - row["close_match"], axis=1 ), extra_close_match_in_efo=lambda df: df.apply( lambda row: row["close_match"] - row["api_close_match"], axis=1 ), ) ) compare_df (compare_df.groupby(["exact_match_equal", "close_match_equal"]).size()) ```

The current results return URLs like http://purl.obolibrary.org/obo/Orphanet_98576. Should we keep this format or transform them?

@dhimmel Lastly, what format should the axioms, mondo:exactMatch and mondo:closeMatch have in the output json file?

dhimmel commented 1 year ago

The current results return URLs like http://purl.obolibrary.org/obo/Orphanet_98576. Should we keep this format or transform them?

That URL is the class URI and we often assign it to a variable with a _uri suffix. The corresponding CURIE (compact URI) version is Orphanet:98576 and we often use an _id suffix for this. The SPARQL query can include both the URI and CURIES as separate output fields.

What we are after is for each oboInOwl:hasDbXref:

what are all the sources providing that xref
what mapping property applies to the xref, e.g. exactMatch or closeMatch

A tabular output from a SPARQL query is the ideal first output here. Not sure if you can fit everything in one query/table or you need multiple. I leave that up to your investigation.

To complicate things further (hehe), we should consider whether the python oaklib, which can extract mappings to the SSSOM format is a better approach here than writing our own SPARQL queries. SSSOM stands for Simple Standard for Sharing Ontological Mappings (publication).

Possibly best to transition to PRs at this point to enable easier review of the SPARQL queries. PR can be draft and incomplete.

bfoltyn commented 1 year ago

@dhimmel regarding including xref_sources and mapping_properties in node data, I have a couple of ideas:

Option 1: xref_properties field with a list with the following schema:

xref_id: str
sources: list[str]
mapping_properties: list[str]

Example

```json { "xref_properties": [ { "xref_id": "orphanet:319298", "axiom_sources": ["MONDO:equivalentTo"], "mapping_properties": ["mondo:exactMatch"] } ] } ```

Option 2: Separate xref_sources and mapping_properties

xref_sources schema:

xref_id: str
axiom_source: str

mapping_properties schema:

xref_id: str
axiom_source: str

Example

```json { "axiom_sources": [ { "xref_id": "orphanet:319298", "axiom_source": "MONDO:equivalentTo" } ], "mapping_properties": [ { "xref_id": "orphanet:319298", "mapping_source": "mondo:exactMatch" } ] } ```

Option 3: Second option inside xref_properties field:

Example

```json { "xref_properties": { "axiom_sources": [ { "xref_id": "orphanet:319298", "axiom_source": "MONDO:equivalentTo" } ], "mapping_properties": [ { "xref_id": "orphanet:319298", "mapping_source": "mondo:exactMatch" } ] } } ```

Please let me know your thoughts on these options, or if there are any other ideas you have.

dhimmel commented 1 year ago

I like option 1. Will there be a slight imprecision where one source have one property and another source could have a conflicting property? For example, an xref being classified as both an exactMatch and closeMatch from different resources?

matentzn commented 1 year ago

Just FYI: what you are trying to do here is much much harder than you think right now - and not necessary.

EFO is not a good source for mappings, because it mixes old (ancient) with new (harmonised) xrefs, and makes strange distinctions like "mondo:exactMatch" (which is not even a thing in Mondo). What you should do instead is:

ETL the primary SSSOM file for mondo mappings: https://github.com/monarch-initiative/mondo/blob/master/src/ontology/mappings/mondo.sssom.tsv
use semra (see also https://github.com/biopragmatics/semra/blob/main/notebooks/umls-inference-analysis.ipynb), cc @cthoyt to chain the mappings together in a way to get the correct EFO to X mappings
export the mappings to sssom and feed that into your system

Just my two cents as someone driving by :D

bfoltyn commented 11 months ago

I like option 1. Will there be a slight imprecision where one source have one property and another source could have a conflicting property? For example, an xref being classified as both an exactMatch and closeMatch from different resources?

@dhimmel There are cases where xref is classified as both exactMatch and closeMatch. For example in EFO:0000095 xref meddra:10008958 has mondo:closeMatch and skos:exactMatch

(
    pd.read_json(
        "https://github.com/related-sciences/nxontology-data/raw/output/efo/efo_otar_profile_mapping_properties.json.gz"
    ).pipe(
        lambda df: df[
            (df["efo_id"] == "EFO:0000095") & (df["xref_id"] == "meddra:10008958")
        ]
    )
)

efo_id	xref_id	mapping_property_id	efo_uri	xref_uri	mapping_property_uri
EFO:0000095	meddra:10008958	mondo:closeMatch	http://www.ebi.ac.uk/efo/EFO_0000095	http://identifiers.org/meddra/10008958	http://purl.obolibrary.org/obo/mondo#closeMatch
EFO:0000095	meddra:10008958	skos:exactMatch	http://www.ebi.ac.uk/efo/EFO_0000095	http://identifiers.org/meddra/10008958	http://www.w3.org/2004/02/skos/core#exactMatch

There are 102 cases like this:

All cases

dhimmel commented 11 months ago

what you are trying to do here is much much harder than you think right now - and not necessary

Thanks @matentzn for these insights. I'm looking forward to exploring the SSSOM Mondo mappings combined with semra to convert them to EFO-keyed mappings. For now I think it makes sense to continue our current approach, since we're close to having it complete and being evaluable, at least as a good reference for the SSSOM/Mondo alternative.

There are cases where xref is classified as both exactMatch and closeMatch

@bfoltyn I think we could make exactMatch higher priority than closeMatch as an easy way to label an xref as either exact or close.

bfoltyn commented 11 months ago

@bfoltyn I think we could make exactMatch higher priority than closeMatch as an easy way to label an xref as either exact or close.

@dhimmel What do you mean by higher priority? I thought we would include all mapping properties as list in the node data, as in option 1 in comment https://github.com/related-sciences/nxontology-data/issues/18#issuecomment-1761612946. Are you suggesting we include only one mapping property value exactMatch or closeMatch? Should we also include mondo: or skos:?

dhimmel commented 11 months ago

What do you mean by higher priority?

I think it might be best if we simplify/aggregate the xref metadata that goes into the nxontology node attribute data to something like (written here in YAML for ease):

xrefs:
  - xref_id: meddra:10008958
    xref_uri: http://identifiers.org/meddra/10008958
    relation: skos:exactMatch  # converting mondo:exactMatch to skos:exactMatch if applicable
    sources: [MONDO:equivalentTo, DOID:2224]  # haven't cleaned this up yet

With this design, an xref_id would only appear once per node and all other metadata would be aggregated.

bfoltyn commented 11 months ago

xref metadata that goes into the nxontology node attribute data to something like (written here in YAML for ease)

@dhimmel Currently xrefs field in the node data is a list of strings. Do we want to replace it with the example you suggested? The reason I suggested introducing a new field with these properties was to not introduce a breaking change.

bfoltyn commented 11 months ago

    relation: skos:exactMatch  # converting mondo:exactMatch to skos:exactMatch if applicable

@dhimmel Should we use the following logic?

If there is skos:exactMatch in mapping properties we set the value to => skos:exactMatch
If there is mondo:exactMatch in mapping properties we set the value to => skos:exactMatch
If there is skos:closeMatch in mapping properties we set the value to => skos:closeMatch
If there is monde:closeMatch in mapping properties we set the value to => skos:closeMatch
otherwise we set the value to null

dhimmel commented 11 months ago

Currently xrefs field in the node data is a list of strings. Do we want to replace it with the example you suggested

We could either replace it or create a new field like xref_details. Slightly leaning towards a new field.

Should we use the following logic?

That logic sounds good. If there are other interesting values in the otherwise set, we can support those later.

bfoltyn commented 11 months ago

We could either replace it or create a new field like xref_details. Slightly leaning towards a new field.

I think we can add new field. xref_details sounds good. Should this field also include xrefs from xrefs query or just from mapping_properties and xref_sources?

dhimmel commented 11 months ago

Should this field also include xrefs from xrefs query or just from mapping_properties and xref_sources

Ideally all of them, such that a user only needs xref_details.

bfoltyn commented 11 months ago

@dhimmel I've noticed that sometimes xref_sources in xref_details contains null. For example in MONDO:0020507

"xref_details": [
  {
    "xref_id": "DOID:0070374",
    "relation": "skos:exactMatch",
    "sources": [
      null
    ]
  },

Should we make axiom_source required in the axiom_sources query? https://github.com/related-sciences/nxontology-data/blob/fb93d9e7bffd3ae5bb773cce51b127f31cb5c14b/nxontology_data/efo/queries/xref_sources.rq#L12

Another way would be to filter out null values after the aggregation in EfoProcessor.get_xref_details method. https://github.com/related-sciences/nxontology-data/blob/fb93d9e7bffd3ae5bb773cce51b127f31cb5c14b/nxontology_data/efo/efo.py#L256-L279

dhimmel commented 11 months ago

Should we make axiom_source required in the axiom_sources query?

This is the solution I prefer unless you advocate for a different one. Potentially leave a comment in that query that OPTIONAL will include extra results where axiom_source is missing.

related-sciences / nxontology-data

EFO cross-references: classify as exact/close when possible #18