neo4j-labs / neosemantics

Graph+Semantics: Import/Export RDF from Neo4j. SHACL Validation, Model mapping and more.... If you like it, please ★ ⇧
https://neo4j.com/labs/neosemantics/
Apache License 2.0
817 stars 142 forks source link

Memory issues when uploading inline data (n10s.rdf.import.inline) #217

Closed fanavarro closed 3 years ago

fanavarro commented 3 years ago

Hi everyone, I am developing an application for uploading RDF data into a neo4j server. First, I tried to use the method n10s.rdf.import.fetch with local files; however, these files should be stored in the server in order to make the command work.

For this reason, I moved into the n10s.rdf.import.inline function, which receives RDF data directly in string format. My strategy is to read the content of the RDF files locally by using an internal buffer of statements, and to use the n10s.rdf.import.inline function in order to upload the statements into the neo4j when my internal buffer is full. My internal buffer has a size of 10.000 statements, that it is to say, the system start reading the RDF file until it reaches 10.000 statements, then, these statements are translated into Turtle and uploaded to the neo4j server by calling n10s.rdf.import.inline. Then, the buffer is cleared in order to be filled again with the next statements in the RDF file.

This works fine at the beginning, however, I see how the memory usage increases until neo4j return a lack of memory error. I though that each call to n10s.rdf.import.inline was independent (If one of them works, then all of them will work because the size of the inline data is the same).

This memory usage is checked at the docker container that contains my neo4j instance, which means that it is a problem in the database, not in the application that makes the calls.

I have no idea about what happens inside the neo4j when I use n10s.rdf.import.inline function. Is it possible that some kind of processing is still performed even when I receive the response from the database? If then, when my app calls the function several times, it would collapse the server.

So, in summary, my question is: why is the memory usage increasing in the neo4j server when I call the n10s.rdf.import.inline function several times?

EDIT: I am using the java neo4j driver. First, the function I used for this was the following:


public void importInstancesInline(Driver driver, String data, String format) {
    try (Session session = driver.session()) {
        String command = String.format("CALL n10s.rdf.import.inline('%s', '%s');", data, format);
        session.run(command);
    }
}

After reading a little bit about transactions, I changed this function by the following one:

public void importInstancesInline(Driver driver, String data, String format) {
    try (Session session = driver.session()) {
        List<Record> result = session.writeTransaction(new TransactionWork<List<Record>>(){

            @Override
            public List<Record> execute(Transaction tx) {
                String command = String.format("CALL n10s.rdf.import.inline('%s', '%s');", data, format);
                Result result = tx.run(command);
                return result.list();
            }
        });
    }
}

However, I'm still having the memory issue.
jbarrasa commented 3 years ago

Hi @fanavarro and thanks for sharing the details of your experiment. In order for us to try to reproduce it, could you please share a bit more about the config settings. How much memory is available on the server for neo4j to run? Also, at what point do you see the degradation in performance or the out-of-memory problem happening? After how many 10k batches?

Thanks,

JB.

fanavarro commented 3 years ago

Hi @jbarrasa, thanks for your quick response.

Currently I am executing neo4j in a docker instance in my personal laptop which has 12GB of RAM. I was using 6GB for neo4j, but for this last experiment I let the system to choose the memory limit (I think it was 3GB according to the figure that I've attached below). My docker compose file is the following:

version: '3'
services:
  graphdb:
    image: neo4j:4.2.3
    ports:
      - 7474:7474
      - 7687:7687
      - 9010:9010
    environment:
      NEO4JLABS_PLUGINS: '["apoc", "n10s"]'
      NEO4J_AUTH: 'neo4j/pass'
      #NEO4J_dbms_memory_heap_max__size: '6g'
      NEO4J_dbms_jvm_additional: '-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=localhost -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.rmi.port=9010'

In my app, first I load some ontologies with n10s.onto.import.inline, and this is complete after 201 batches of 10.000 triples, but not all triples are loaded due to the simplifications performed by this function; that it is to say, 10.000 triples are read, but only several hundreds of them are processed for each batch. After the ontology loading, I upload RDF data that is compliant with the ontologies by using n10s.rdf.import.inline. Here, 122 batches of 10.000 triples are processed. In this case, all read triples are processed in each batch. After these 122 batches, the system started to throw the following exception:

org.neo4j.driver.exceptions.TransientException: There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.

I have monitored the neo4j database through visualvm until my script started to fail to check the memory usage: imagen

What I would expect is a memory peak per batch command, returning to the previous memory usage, but, in this case, the initial memory is not fully recovered after each batch operation.

jbarrasa commented 3 years ago

thanks for the detailed explanation @fanavarro Let us try to reproduce and we'll post results asap.

watch this space!

jbarrasa commented 3 years ago

Hi, the problem has to do with the client code, not with n10s. Try passing the payload as a parameter instead of as a long string combined with the import procedure call. Something like this:

public static void importInstancesInline(Driver driver, String data, String format) {
        try (Session session = driver.session()) {
            List<Record> result = session.writeTransaction(new TransactionWork<List<Record>>(){

                @Override
                public List<Record> execute(Transaction tx) {
                    String command = "CALL n10s.rdf.import.inline($payload, $format)";
                    Map<String, Object> params = new HashMap<>();
                    params.put("payload", data);
                    params.put("format", format);
                    Result result = tx.run(command, params);
                    return result.list();
                }
            });
        }
    }

You should get the expected memory behavior. I let you retry and close the issue if it solves your problem.

Cheers,

JB.

fanavarro commented 3 years ago

Hi @jbarrasa, thanks for your response, it worked perfectly: imagen

I'm closing this issue, thanks again!