Duplicate mappings from individual classes to classes in other ontologies

mdorf commented 3 years ago

Issue #115 addressed the mapping counts between ontologies being reported higher than the actual counts. An issue still remains, where mappings from individual classes in an ontology to classes in other ontologies appear with multiple duplicate entries.

In BioPortal UI, this behavior is evident when browsing individual class mappings (vs the global “Mappings” tab for each ontology).

The bug affects these cases:

a) mappings from an individual ontology to ALL other ontologies b) mappings from a class within an ontology to the mapped classes from ALL other ontologies

For example:

https://bioportal.bioontology.org/ontologies/DOID?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0001062#mappings

Click on the “Class Mappings (158)” tab for the “anatomy” class and scroll all the way down. You will see a number of duplicate entries there. For example: “Mapping of Drug Names, ICD-11 and MeSH 2021” or “Intelligence Task Ontology” or four identical mappings to “Mapping of Epilepsy Ontologies”.

mdorf commented 3 years ago

The issue stems from a faulty SPARQL query that returns a paginated list of mappings for a particular ontology (to ALL other ontologies) or a list of mappings from a given class in an ontology to classes in other ontologies:

SELECT DISTINCT ?s1 ?s2 ?g ?source ?o
WHERE {
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://bioportal.bioontology.org/ontologies/umls/cui> ?o .
    }
    GRAPH ?g {
        ?s2 <http://bioportal.bioontology.org/ontologies/umls/cui> ?o .
    }
    BIND ('CUI' AS ?source)
  }
  UNION
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://data.bioontology.org/metadata/def/mappingSameURI> ?o .
    }
    GRAPH ?g {
        ?s2 <http://data.bioontology.org/metadata/def/mappingSameURI> ?o .
    }
    BIND ('SAME_URI' AS ?source)
  }
  UNION
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
    }
    GRAPH ?g {
        ?s2 <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
    }
    BIND ('LOOM' AS ?source)
  }
  UNION
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://data.bioontology.org/metadata/def/mappingRest> ?o .
    }
    GRAPH ?g {
        ?s2 <http://data.bioontology.org/metadata/def/mappingRest> ?o .
    }
    BIND ('REST' AS ?source)
  }
  FILTER ((?s1 != ?s2) || (?source = 'SAME_URI'))
  FILTER (!STRSTARTS(str(?g),'http://data.bioontology.org/ontologies/MONDO'))
} 
OFFSET 20 LIMIT 20

The problem with this query in that it doesn’t account for the latest (the highest id with the status RDF) submissions. Instead, it queries ALL of them, resulting in many duplicate/irrelevant mappings.

A query below yields the IDs of all the LATEST submissions:

SELECT (CONCAT(?ontology, "/submissions/", (MAX(?submissionId))) as ?id)
WHERE { 
  ?id <http://data.bioontology.org/metadata/ontology> ?ontology .
  ?id <http://data.bioontology.org/metadata/submissionId> ?submissionId .
  ?id <http://data.bioontology.org/metadata/submissionStatus> ?submissionStatus .
  ?submissionStatus <http://data.bioontology.org/metadata/code> "RDF" . 
  OPTIONAL { 
    ?id <http://data.bioontology.org/metadata/ontology> ?ontJoin .  
  } 
  OPTIONAL { 
    ?ontJoin <http://data.bioontology.org/metadata/viewOf> ?viewOf .  
  } 
  FILTER(!BOUND(?viewOf)) 
}
GROUP BY ?ontology

However, combining these two queries isn't trivial.

mdorf commented 3 years ago

Alternate solutions explored:

Running the second query separately in code, and then adding a large FILTER IN (or FILTER (... || ...) block to the first query:

FILTER(?g in (<http://data.bioontology.org/ontologies/ICO/submissions/16> , <http://data.bioontology.org/ontologies/DRPSNPTO/submissions/1>, ...))
OR
FILTER (?g = <http://data.bioontology.org/ontologies/ICO/submissions/16> || ?g = <http://data.bioontology.org/ontologies/DRPSNPTO/submissions/1> || ?g = ...)

Both of these do work, but they slow the original query down to a halt. There are over 1200 IDs that are added inside this filter.

Running the original query as is and then filter out the mappings from old submissions in code. This performs well but breaks the pagination, which is done in SPARQL itself.
Combining the two SPARQL queries as I would do in SQL. I get errors from 4store: SubSELECTs are not implemented.

mdorf commented 3 years ago

Here is a version of the original query corrected with FILTER clauses, which produces the correct results but is extremely slow:

SELECT DISTINCT ?s1 ?s2 ?g ?source ?o
WHERE {
  {
    GRAPH <http://data.bioontology.org/ontologies/MONDO/submissions/41> {
        ?s1 <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
    }
    GRAPH ?g {
        ?s2 <http://data.bioontology.org/metadata/def/mappingLoom> ?o .
    }
    BIND ('LOOM' AS ?source)
  }
  FILTER ((?s1 != ?s2) || (?source = 'SAME_URI'))
  FILTER (!STRSTARTS(str(?g),'http://data.bioontology.org/ontologies/MONDO'))
  FILTER (?g = <http://data.bioontology.org/ontologies/ICO/submissions/16> || ?g = <http://data.bioontology.org/ontologies/DRPSNPTO/submissions/1> || ?g = <http://data.bioontology.org/ontologies/GEOSPECIES/submissions/2> || ?g = <http://data.bioontology.org/ontologies/TEO/submissions/4> || ?g = <http://data.bioontology.org/ontologies/OMV/submissions/1> || ?g = <http://data.bioontology.org/ontologies/TMO/submissions/13> || ?g = <http://data.bioontology.org/ontologies/OPMI/submissions/16> || ?g = <http://data.bioontology.org/ontologies/OFSMR/submissions/19> || ?g = <http://data.bioontology.org/ontologies/MOOCCUADO/submissions/2> || ?g = <http://data.bioontology.org/ontologies/DISTEST/submissions/2> || ?g = <http://data.bioontology.org/ontologies/LIFO/submissions/1> || ?g = <http://data.bioontology.org/ontologies/CORON/submissions/30> || ?g = <http://data.bioontology.org/ontologies/MATRCOMPOUND/submissions/1> || ?g = <http://data.bioontology.org/ontologies/AGRO/submissions/3> || ?g = <http://data.bioontology.org/ontologies/HEIO/submissions/17> || ?g = <http://data.bioontology.org/ontologies/GAMUTS/submissions/23> || ?g = <http://data.bioontology.org/ontologies/EGO/submissions/1> || ?g = <http://data.bioontology.org/ontologies/CIDIO_V1/submissions/2> || ?g = <http://data.bioontology.org/ontologies/ISO19115ROLES/submissions/6> || ?g = <http://data.bioontology.org/ontologies/IDO/submissions/13> || ?g = <http://data.bioontology.org/ontologies/MARC-RELATORS/submissions/1> || ?g = <http://data.bioontology.org/ontologies/CDPEO/submissions/1> || ?g = <http://data.bioontology.org/ontologies/ICD10-CN/submissions/6> || ?g = <http://data.bioontology.org/ontologies/FB-CV/submissions/29> || ?g = <http://data.bioontology.org/ontologies/ILLNESSINJURY/submissions/1> || ?g = <http://data.bioontology.org/ontologies/NIFDYS/submissions/16> || ?g = <http://data.bioontology.org/ontologies/RCTV2/submissions/1> || ?g = <http://data.bioontology.org/ontologies/EMAPA/submissions/41> || ?g = <http://data.bioontology.org/ontologies/ONTOAD/submissions/2> || ?g = <http://data.bioontology.org/ontologies/TMA/submissions/1> || ?g = <http://data.bioontology.org/ontologies/HIVMT/submissions/6> || ?g = <http://data.bioontology.org/ontologies/HIVO004/submissions/27> || ?g = <http://data.bioontology.org/ontologies/ONTOBIOTOPE51/submissions/2> || ?g = <http://data.bioontology.org/ontologies/READMISSIONDIAB/submissions/1> || ?g = <http://data.bioontology.org/ontologies/SIO/submissions/86> || ?g = <http://data.bioontology.org/ontologies/PSIMOD/submissions/22>)   
}

mdorf commented 3 years ago

This is the method that generates the faulty query: https://github.com/ncbo/ontologies_linked_data/blob/master/lib/ontologies_linked_data/mappings/mappings.rb#L131

graybeal commented 3 years ago

OK, here's another approach.

Add an attribute to every graph (well, probably for every graph, putting it in the metadata graph or similar) that indicates whether it is the most recent submission for that ontology.

This can be set/reset via a script, using the 'latest ID of all submissions' query to find those graphs. (To reset the less recent attributes, reset any graph that is not in the list of most recent submissions, or more efficiently, for each ontology, clear the attribute all submissions that match that ontology but aren't the latest.) I prefer this process, as it could be run daily, so only manually submitted ontologies are getting old mappings, and only for the rest of that day.
Or it can be set/reset every time an ontology submission is processed—reset the attribute of previous submission(s) to false, set the attribute of the current submission to true. (Even if you occasionally miss resetting a previous attribute, we can spot it when we see a doubled mapping and delete it manually.)

Now, when running the main query, you don't have to filter every WHERE evaluation with a FILTER against 1200 graphs. Instead, you just test whether the attribute is true. And you can perform that test before the mapping query is performed (would that be an outer WHERE clause?), so the mapping query only gets performed against the most recent graphs (instead of getting filtered out after running the query).

That should be extremely fast to run the main query on the fly, it's running a lot fewer mapping queries.

The submission graph attributes could all be maintained in an entirely separate graph, if we want to avoid adding an attribute to each graph. (It is more like metadata than content, so maybe it needs to be in a metadata graph.) But there has to be one entry for every submission that's in the triple store.

mdorf commented 2 years ago

Another side effect of this issue is that because mappings from older submissions are returned, if a term had been removed between an earlier and later submission, the mappings to that term are still materialized, resulting in broken links leading to the term in question. This issue was reported in a separate ticket: ncbo/ontologies_api#85.

mdorf commented 2 years ago

As a documentation point, the original BioPortal design assumed that only the latest submission graphs will be stored in the triple store, so the original mappings code was not written to filter out multiple submissions. As the size of the data grew, we had discovered that deleting previous submissions' graphs from 4store was highly resource intensive, and the CRON job responsible for those deletions had been paused.

When this bug was discovered, I had made a number of attempts to modify the underlying SPARQL query to filter out orphan data. Unfortunately, none of my experiments (see above) yielded a performant result. Once we move to AllegroGraph, we hope that its scalable backend will allow us to resume the job of deleting the orphan submission graphs, which will automatically alleviate this issue.

ncbo / ontologies_linked_data

Duplicate mappings from individual classes to classes in other ontologies #117