Closed syphax-bouazzouni closed 3 months ago
CC: @alexskr with who I discussed the problem (loading huge file in 4store) last April. The proposed solution seems a good practice for groups (like us) running the Appliance and then hosting 4store on the same machine.
The current non-chunked RDF upload approach appropriately handles situations where triple store fails uploads of generated RDF due to malformed data errors, things as the mismatched types which owlapi doesn't catch (see https://github.com/ncbo/bioportal-project/issues/253) The whole operation fails so the triple store doesn't end up with a partially loaded graph; however, with chunked data uploads, it would be possible that one of the chunked upload fails in the middle and could result in an incomplete graph stored in the triple store. Do you have any mitigation mechanisms in place for this kind of problem?
HI @alexskr,
I think if one of the chunks fails, a RestClient::BadRequest
will be raised and stop the process (like https://github.com/ncbo/bioportal-project/issues/253).
And when we reprocess it again it will delete the remained graph and create a new empty one for appending again the chunks from the start.
This is an optimization PR. Currently, in the parsing process after the RDF generation step, we do a "delete and append to triple store"
In the append triples step, we transform the XRDF to Turtle in a temporary file Then we do a single "post" request to the triple store containing the turtle file as the request body.
The issue is when we have a big file (>= 1GB) (like in our use case here https://github.com/ontoportal-lirmm/ontologies_linked_data/issues/15) is not an efficient way to submit all the content file in a unique HTTP request
The PR changes the function
append_triples_no_bnodes
to do the append by chunks of 500 000 lines (triples) per requestWith the use case of TAXREF-LD
Before the change we had
After the change, it worked and we have the following benchmark Objects Freed: 572924847 Time: 734.6 seconds Memory usage: 618.36 MB (Before the memory usage was dependent and equal to the append Turtle version of the file, now it will never exceed 700MB)
Reference : https://tjay.dev/howto-working-efficiently-with-large-files-in-ruby/