ncbo / ontologies_linked_data

Models and serializers for ontologies and related artifacts backed by 4store
Other
18 stars 8 forks source link

max_depth metrics fails to calculate with AllegroGraph backend for large UMLS ontologies #181

Closed alexskr closed 9 months ago

alexskr commented 10 months ago

Ontology metrics calculation fail for large UMLS ontologies such as SNOMEDCT and NCBITAXON with AllegroGraph 7.3.1 backend (with patches)

, [2024-01-20T21:43:41.700507 #19673]  INFO -- : ["metrics_for_submission start"]
I, [2024-01-20T21:43:41.701395 #19673]  INFO -- : ["Unable to find metrics providing max_depth in file for submission http://data.bioontology.org/ontologies/SNOMEDCT/submissions/28.  Using ruby calculation of max_depth."]
E, [2024-01-21T01:41:55.762358 #19673] ERROR -- : ["too many connection resets (due to Net::ReadTimeout with #<TCPSocket:(closed)> - Net::ReadTimeout) after 7702 requests on 9800, last used 10000.043899019 seconds ago"]
E, [2024-01-21T01:41:55.763227 #19673] ERROR -- : [#<Net::HTTP::Persistent::Error: too many connection resets (due to Net::ReadTimeout with #<TCPSocket:(closed)> - Net::ReadTimeout) after 7702 requests on 9800, last used 10000.043899019 seconds ago>]
E, [2024-01-21T01:41:55.764110 #19673] ERROR -- : ["NoMethodError: undefined method `id=' for nil:NilClass\n/srv/ncbo/ncbo_cron/vendor/bundle/ruby/2.7.0/bundler/gems/ontologies_linked_data-e716a6d41088/lib/ontologies_linked_data/models/ontology_submission.rb:1186:in `process_metrics'\n\t/srv/ncbo/ncbo_cron/vendor/bundle/ruby/2.7.0/bundler/gems/ontologies_linked_data-e716a6d41088/lib/ontologies_linked_data/models/ontology_submission.rb:1118:in `process_submission'\n\t/srv/ncbo/ncbo_cron/lib/ncbo_cron/ontology_submission_parser.rb:171:in `process_submission'\n\t/srv/ncbo/ncbo_cron/lib/ncbo_cron/ontology_submission_parser.rb:45:in `block in process_queue_submissions'\n\t/srv/ncbo/ncbo_cron/lib/ncbo_cron/ontology_submission_parser.rb:25:in `each'\n\t/srv/ncbo/ncbo_cron/lib/ncbo_cron/ontology_submission_parser.rb:25:in `process_queue_submissions'\n\t/srv/ncbo/ncbo_cron/bin/ncbo_cron:252:in `block (3 levels) in <main>'\n\t/srv/ncbo/ncbo_cron/lib/ncbo_cron/scheduler.rb:65:in `block (3 levels) in scheduled_locking_job'\n\t/srv/ncbo/ncbo_cron/lib/ncbo_cron/scheduler.rb:51:in `fork'\n\t/srv/ncbo/ncbo_cron/lib/ncbo_cron/scheduler.rb:51:in `block (2 levels) in scheduled_locking_job'\n\t/srv/ncbo/ncbo_cron/vendor/bundle/ruby/2.7.0/gems/mlanett-redis-lock-0.2.7/lib/redis-lock.rb:43:in `lock'\n\t/srv/ncbo/ncbo_cron/vendor/bundle/ruby/2.7.0/gems/mlanett-redis-lock-0.2.7/lib/redis-lock.rb:234:in `lock'\n\t/srv/ncbo/ncbo_cron/lib/ncbo_cron/scheduler.rb:50:in `block in scheduled_locking_job'\n\t/srv/ncbo/ncbo_cron/vendor/bundle/ruby/2.7.0/gems/rufus-scheduler-2.0.24/lib/rufus/sc/jobs.rb:230:in `trigger_block'\n\t/srv/ncbo/ncbo_cron/vendor/bundle/ruby/2.7.0/gems/rufus-scheduler-2.0.24/lib/rufus/sc/jobs.rb:204:in `block in trigger'\n\t/srv/ncbo/ncbo_cron/vendor/bundle/ruby/2.7.0/gems/rufus-scheduler-2.0.24/lib/rufus/sc/scheduler.rb:430:in `block in trigger_job'"]
alexskr commented 10 months ago

metrics is calculated by owlapi_wrapper for all ontologies except for UMLS. Parsing process falls back to using ruby/sparql code for calculating metrics which is doesn't work well with AllegroGraph.
https://github.com/ncbo/ontologies_linked_data/blob/ee0013f0ee23876076bff9d9258b46371ec3b248/lib/ontologies_linked_data/models/ontology_submission.rb#L453-L458

logs state that ontology parsing process for UMLS ontologies skips OWLAPI parse but repository directory contains owlapi.xrdf file which indicates that owlapi wrapper was invoked.

I, [2024-01-20T17:52:00.002129 #19673]  INFO -- : ["Starting to process http://data.bioontology.org/ontologies/SNOMEDCT/submissions/28"]
I, [2024-01-20T17:52:00.004475 #19673]  INFO -- : ["Starting to process SNOMEDCT/submissions/28"]
I, [2024-01-20T17:52:00.230010 #19673]  INFO -- : ["Using UMLS turtle file found, skipping OWLAPI parse"]

owlapi_wrapper is invoked when new UMLS ontology submissions are created so we should use that metrics instead of the metrics generated by https://github.com/ncbo/ontologies_linked_data/blob/ee0013f0ee23876076bff9d9258b46371ec3b248/lib/ontologies_linked_data/metrics/metrics.rb#L51

jvendetti commented 10 months ago

I wrote a couple of simple unit tests in the owlapi_wrapper project in my local dev environment to test metrics generation, e.g.:

@Test
public void parse_OntologySNOMEDCT() throws Exception {
    ParserInvocation pi = new ParserInvocation("./src/test/resources/repo/input/snomedct",
        "./src/test/resources/repo/output/snomedct", "SNOMEDCT.ttl", true);
    OntologyParser parser = new OntologyParser(pi);
    assertTrue(parser.parse());
}

The max depth metric is successfully calculated for both the SNOMEDCT and NCBITAXON TTL files, in 5 and 8 seconds respectively:

[main] DEBUG o.s.n.owlapi.wrapper.metrics.Graph - depth for owl:Thing is 30
[main] INFO  o.s.n.o.w.metrics.OntologyMetrics - Finished metrics calculation for SNOMEDCT.ttl in 5047 milliseconds
[main] INFO  o.s.n.o.w.metrics.OntologyMetrics - Generated metrics CSV file for SNOMEDCT.ttl
[main] DEBUG o.s.n.owlapi.wrapper.metrics.Graph - depth for owl:Thing is 37
[main] INFO  o.s.n.o.w.metrics.OntologyMetrics - Finished metrics calculation for NCBITAXON.ttl in 7583 milliseconds
[main] INFO  o.s.n.o.w.metrics.OntologyMetrics - Generated metrics CSV file for NCBITAXON.ttl

It should be relatively straightforward to modify the REST API to first check for the max depth in metrics.csv files. We're already doing this for classes, properties, etc.:

https://github.com/ncbo/ontologies_linked_data/blob/ee0013f0ee23876076bff9d9258b46371ec3b248/lib/ontologies_linked_data/metrics/metrics.rb#L176-L182

alexskr commented 10 months ago
max depth calculated by owlapi_wrapper is off by 1 compared to the max depth calculated by ruby/sparql. Ontology Ruby owlapi_wrapper
STY 7 8
SNOMEDCT 29 30
NCBITAXON 36 37

This needs to be looked into

jvendetti commented 10 months ago

Max depth calculated by the owlapi_wrapper starts from owl:Thing, which serves as the root class for all other classes in the ontology. It's making this calculation during the initial step of our ontology ingestion process where the ontology is loaded into memory by the OWL API, regardless of the format. The STY ontology is sufficiently small that I was able to verify this manually. I suppose you could debate which methodology is correct.