ncbo / bioportal-project

Serves to consolidate (in Zenhub) all public issues in BioPortal
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

AG: class tree visualization issues with AG backend #264

Open alexskr opened 1 year ago

alexskr commented 1 year ago

A number of ontologies have class tree visualization problems when BioPortal runs with AllegroGraph backend. The preferred name is missing so the class tree has blank entries.

image

API shows perfLabel: null image

alexskr commented 1 year ago

image

alexskr commented 1 year ago
image
alexskr commented 1 year ago

updated ncbo_cron to the latest codebase in staging env and reprocessed ontologies. Missing perfLabel for purl.obolibrary.org/obo/GO_0008150 in GO ontology is fixed.

graybeal commented 1 year ago

labels still missing in Mondo, e.g., https://stage.bioontology.org/ontologies/MONDO/?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMONDO_0005200 ('viral dilated cardiomyopathy')

graybeal commented 1 year ago

see also https://github.com/ncbo/ontologies_linked_data/issues/137 for example from production (fixed by re-parsing in that case)

graybeal commented 1 year ago

Note that the MONDO case does not have prefLabels in the original ontology, it only has regular labels throughout the XML/RDF file. I believe the rdf:label is displayed as the Preferred Label annotation when there is no prefLabel.

graybeal commented 1 year ago

In case this gives any clue: If I click on the missing label (highlighted under herpes zoster in right side of screen shot), the page may not resolve (big WAITING box); but if I hit reload in the browser I get this display.

image

After further testing: If I click on the empty label location, I get LOADING CLASS spinner, nothing happens, and the network trace reads per the first gif below (a 500 error on that class).

If I then hit Reload, the page refreshes to display the item, the network trace shows a normal 200 response. The only visible difference in the call is there is no callback=load in the second case.

image

graybeal commented 1 year ago

I think we found the responsible code for this problem. Well, we have a good theory, anyway. It's all in Slack for now, I'll let Misha decide what is worth summarizing in this thread.

mdorf commented 1 year ago

I was able to identify the cause of this issue. It has to do with the fact that AllegroGraph does not impose a default ordering of records for paginated results, which results in duplicate values to be included when iterating over the entire record set:

SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/VTO/submissions/14> WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 0 LIMIT 2500

While each run of this query does not produce duplicates, the TOTAL run over the entire graph does. Because of these duplicates, many of the legitimate classes are omitted and are left without a label.

The attached file contains a good illustration of the issue. It includes both the queries run as well as the results of each run right below it: vto_id_queries_with_results_run1.txt

If you grep for the term VTO_0009953, you will see that it’s returned by two of the queries from the set:

SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/VTO/submissions/14> WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 2500 LIMIT 2500

and

SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/VTO/submissions/14> WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 102500 LIMIT 2500

4store does the internal ordering correctly, so we’ve never encountered this issue until AG. Because the internal ordering of records in AG is not deterministic, you end up getting random labels missing from one run to the next.

Here is another run to compare to the first one with different duplicates and different missing terms: vto_id_queries_with_results_run2.txt

Per the selected answer in this StackOverflow thread: https://stackoverflow.com/questions/55146844/offset-in-sparql,

[In a triple store] Rows may be delivered in any order, and this ordering may change from query-to-query, if you don't include an ORDER BY. This can mean that multiple queries with different OFFSET may not get you all rows, and may deliver duplicate rows, when all the partial result sets are combined. So -- anytime you're using OFFSET and/or LIMIT, it's best practice to also use an ORDER BY.

Based on this, a possible solution should be adding the ORDER BY clause to the query:

ORDER BY ?id LIMIT 10000 OFFSET 120000

This, however, may come at a performance cost.

mdorf commented 1 year ago

I am working with the Franz developers on improving the performance of the ORDER BY clause in AllegroGraph. As of now, the performance degradation experienced as a result of adding ORDER BY is unacceptable.

mdorf commented 1 year ago
  1. It appears the duplicates are ALL coming from the first query with OFFSET 0
  2. ALL results from the first query with OFFSET 0 are duplicated in the subsequent queries (2500 duplicates)
  3. There exist NO other duplicates (consequence of 1 & 2)
  4. There are no “triplicates” or “fourplicates” or any other multiples; just duplicates (consequence of 1 & 2)
alexskr commented 7 months ago

resolved but needs to be confirmed with AllegroGraph v7.4 when it comes up