Feature: Append new submission data by chunks to the triple store

syphax-bouazzouni commented 2 years ago

This is an optimization PR. Currently, in the parsing process after the RDF generation step, we do a "delete and append to triple store"

      def delete_and_append(triples_file_path, logger, mime_type = nil)
        Goo.sparql_data_client.delete_graph(self.id)
        Goo.sparql_data_client.put_triples(self.id, triples_file_path, mime_type)
        logger.info("Triples #{triples_file_path} appended in #{self.id.to_ntriples}")
        logger.flush
      end

In the append triples step, we transform the XRDF to Turtle in a temporary file Then we do a single "post" request to the triple store containing the turtle file as the request body.

The issue is when we have a big file (>= 1GB) (like in our use case here https://github.com/ontoportal-lirmm/ontologies_linked_data/issues/15) is not an efficient way to submit all the content file in a unique HTTP request

The PR changes the function append_triples_no_bnodes to do the append by chunks of 500 000 lines (triples) per request

With the use case of TAXREF-LD

Size : 870,3 Mb
Pasred file size: 1,71 Go Gb
Turtle version appended to the triple store: 2.1Gb

Before the change we had

Mar  1 12:25:38 agroportal 4store[28359]: httpd.c:598 starting add to http://data.bioontology.org/ontologies/TAXREF-LD/submissions/2 (2179291409 bytes)
Mar  1 12:25:38 agroportal 4s-httpd: 4store[28359]: httpd.c:598 starting add to http://data.bioontology.org/ontologies/TAXREF-LD/submissions/2 (2179291409 bytes)
Mar  1 12:25:43 agroportal 4store[28359]: import.c:167 Fatal error: out of dynamic memory in turtle_lexer__scan_bytes() at 1
Mar  1 12:25:43 agroportal 4s-httpd: 4store[28359]: import.c:167 Fatal error: out of dynamic memory in turtle_lexer__scan_bytes() at 1
Mar  1 12:25:43 agroportal 4store[12682]: httpd.c:1979 child 28359 terminated by signal 11
Mar  1 12:25:43 agroportal 4s-httpd: 4store[12682]: httpd.c:1979 child 28359 terminated by signal 11

After the change, it worked and we have the following benchmark Objects Freed: 572924847 Time: 734.6 seconds Memory usage: 618.36 MB (Before the memory usage was dependent and equal to the append Turtle version of the file, now it will never exceed 700MB)

Reference : https://tjay.dev/howto-working-efficiently-with-large-files-in-ruby/

jonquet commented 2 years ago

CC: @alexskr with who I discussed the problem (loading huge file in 4store) last April. The proposed solution seems a good practice for groups (like us) running the Appliance and then hosting 4store on the same machine.

alexskr commented 1 year ago

The current non-chunked RDF upload approach appropriately handles situations where triple store fails uploads of generated RDF due to malformed data errors, things as the mismatched types which owlapi doesn't catch (see https://github.com/ncbo/bioportal-project/issues/253) The whole operation fails so the triple store doesn't end up with a partially loaded graph; however, with chunked data uploads, it would be possible that one of the chunked upload fails in the middle and could result in an incomplete graph stored in the triple store. Do you have any mitigation mechanisms in place for this kind of problem?

syphax-bouazzouni commented 1 year ago

HI @alexskr,

I think if one of the chunks fails, a RestClient::BadRequest will be raised and stop the process (like https://github.com/ncbo/bioportal-project/issues/253).

And when we reprocess it again it will delete the remained graph and create a new empty one for appending again the chunks from the start.

ncbo / goo

Feature: Append new submission data by chunks to the triple store #122