ncbo / goo

Graph Oriented Objects (GOO) for Ruby. A RDF/SPARQL based ORM.
http://ncbo.github.io/goo/
Other
15 stars 6 forks source link

TTL file load errors due to chunked data loading feature #155

Closed alexskr closed 4 months ago

alexskr commented 4 months ago

We encountered an error while parsing the UMLS (TTL) ontology:

I, [2024-05-08T22:01:02.600753 #1470563]  INFO -- : ["Starting to process http://data.bioontology.org/ontologies/MDRGER/submissions/8"]
I, [2024-05-08T22:01:02.606373 #1470563]  INFO -- : ["Starting to process MDRGER/submissions/8"]
I, [2024-05-08T22:01:02.801761 #1470563]  INFO -- : ["Using UMLS turtle file found, skipping OWLAPI parse"]
E, [2024-05-08T22:01:11.151685 #1470563] ERROR -- : ["Error sending data to triple store - 400 RestClient::BadRequest: MALFORMED DATA: Turtle parser error while parsing an input stream on or around line 500000: Expected mandatory token '.', got 'eof'"]

This problem is related to PR #122 which introduces chunked data loading. The feature fails when handling TTL files exceeding 500000 lines with the AllegroGraph triplestore due to its strict Turtle file checker. AllegroGraph expects to load complete Turtle statements that end with a period (.) but chunked data loading feature breaks up turtle statement before reaching the end of the statement. We have not tested this with 4store so similar issue might exist.

UMLS ontologies are processed differently from the other types, where .ttl file is loaded into the triplestore instead of the owlapi.xrdf

 499984 <http://purl.bioontology.org/ontology/MDRGER/10071099> a owl:Class ;
 499985         skos:prefLabel """H5N1-Influenza-Impfung"""@de ;
 499986         skos:notation """10071099"""^^xsd:string ;
 499987         <http://purl.bioontology.org/ontology/MDRGER/classified_as> <http://purl.bioontology.org/ontology/MDRGER/10059429> ;
 499988         umls:cui """C3160880"""^^xsd:string ;
 499989         umls:tui """T061"""^^xsd:string ;
 499990         umls:hasSTY <http://purl.bioontology.org/ontology/STY/T061> ;
 499991  .
 499992
 499993 <http://purl.bioontology.org/ontology/MDRGER/10064980> a owl:Class ;
 499994         skos:prefLabel """Neutralisierende Antikoerper positiv"""@de ;
 499995         skos:notation """10064980"""^^xsd:string ;
 499996         rdfs:subClassOf <http://purl.bioontology.org/ontology/MDRGER/10021504> ;
 499997         <http://purl.bioontology.org/ontology/MDRGER/classifies> <http://purl.bioontology.org/ontology/MDRGER/10064983> ;
 499998         <http://purl.bioontology.org/ontology/MDRGER/member_of> <http://purl.bioontology.org/ontology/MDRGER/20000214> ;
 499999         <http://purl.bioontology.org/ontology/MDRGER/SMQ_TERM_LEVEL> """4"""^^xsd:string ;
 500000         <http://purl.bioontology.org/ontology/MDRGER/MPS> """10022891"""^^xsd:string ;
 500001         umls:cui """C1609515"""^^xsd:string ;
 500002         umls:tui """T034"""^^xsd:string ;
 500003         umls:hasSTY <http://purl.bioontology.org/ontology/STY/T034> ;
 500004  .
 500005
alexskr commented 4 months ago

a temporary workaround is to bump up chunk_lines from 500,000 to a larger number to effectively disable this feature without full rollback https://github.com/ncbo/goo/blob/ef2d816df2d263c905bd034efd449a964fa4890f/lib/goo/sparql/client.rb#L92

syphax-bouazzouni commented 4 months ago

hello @alexskr,

The chunked load works only for ntriples format, not ttl.

The fix here is to not do the chunk load for ttl, or use another method of chunking for it, not by number of lines, but by the number of turtle blocks.

We didn't go through this bug at Agroportal, as we don't have UMLS or any ttl ontology.