Open alexskr opened 1 year ago
updated ncbo_cron to the latest codebase in staging env and reprocessed ontologies. Missing perfLabel for purl.obolibrary.org/obo/GO_0008150 in GO ontology is fixed.
labels still missing in Mondo, e.g., https://stage.bioontology.org/ontologies/MONDO/?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMONDO_0005200 ('viral dilated cardiomyopathy')
see also https://github.com/ncbo/ontologies_linked_data/issues/137 for example from production (fixed by re-parsing in that case)
Note that the MONDO case does not have prefLabels in the original ontology, it only has regular labels throughout the XML/RDF file. I believe the rdf:label is displayed as the Preferred Label annotation when there is no prefLabel.
In case this gives any clue: If I click on the missing label (highlighted under herpes zoster in right side of screen shot), the page may not resolve (big WAITING box); but if I hit reload in the browser I get this display.
After further testing: If I click on the empty label location, I get LOADING CLASS spinner, nothing happens, and the network trace reads per the first gif below (a 500 error on that class).
If I then hit Reload, the page refreshes to display the item, the network trace shows a normal 200 response. The only visible difference in the call is there is no callback=load in the second case.
I think we found the responsible code for this problem. Well, we have a good theory, anyway. It's all in Slack for now, I'll let Misha decide what is worth summarizing in this thread.
I was able to identify the cause of this issue. It has to do with the fact that AllegroGraph does not impose a default ordering of records for paginated results, which results in duplicate values to be included when iterating over the entire record set:
SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/VTO/submissions/14> WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 0 LIMIT 2500
While each run of this query does not produce duplicates, the TOTAL run over the entire graph does. Because of these duplicates, many of the legitimate classes are omitted and are left without a label.
The attached file contains a good illustration of the issue. It includes both the queries run as well as the results of each run right below it: vto_id_queries_with_results_run1.txt
If you grep for the term VTO_0009953
, you will see that it’s returned by two of the queries from the set:
SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/VTO/submissions/14> WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 2500 LIMIT 2500
and
SELECT DISTINCT ?id FROM <http://data.bioontology.org/ontologies/VTO/submissions/14> WHERE { ?id a <http://www.w3.org/2002/07/owl#Class> . } OFFSET 102500 LIMIT 2500
4store does the internal ordering correctly, so we’ve never encountered this issue until AG. Because the internal ordering of records in AG is not deterministic, you end up getting random labels missing from one run to the next.
Here is another run to compare to the first one with different duplicates and different missing terms: vto_id_queries_with_results_run2.txt
Per the selected answer in this StackOverflow thread: https://stackoverflow.com/questions/55146844/offset-in-sparql,
[In a triple store] Rows may be delivered in any order, and this ordering may change from query-to-query, if you don't include an ORDER BY. This can mean that multiple queries with different OFFSET may not get you all rows, and may deliver duplicate rows, when all the partial result sets are combined. So -- anytime you're using OFFSET and/or LIMIT, it's best practice to also use an ORDER BY.
Based on this, a possible solution should be adding the ORDER BY clause to the query:
ORDER BY ?id LIMIT 10000 OFFSET 120000
This, however, may come at a performance cost.
I am working with the Franz developers on improving the performance of the ORDER BY clause in AllegroGraph. As of now, the performance degradation experienced as a result of adding ORDER BY is unacceptable.
resolved but needs to be confirmed with AllegroGraph v7.4 when it comes up
A number of ontologies have class tree visualization problems when BioPortal runs with AllegroGraph backend. The preferred name is missing so the class tree has blank entries.
API shows![image](https://user-images.githubusercontent.com/1591816/204669429-8cb7a528-4c75-438a-a3b2-3fd2a64c347e.png)
perfLabel: null