Bulk Insert data - Githubissues

syz3r commented 7 years ago

Hi,

I couldn't find any useful resource regarding how to bulk insert data with neo4j-driver and nodejs. Could you please help me with that.

lutovich commented 7 years ago

Hi @syz3r,

There are couple options how you could bulk import data:

LOAD CSV cypher query in regular Transaction#run() call. More about syntax can be found here: https://neo4j.com/developer/guide-import-csv/
manually split your statements in batches and execute each batch in a separate transaction via same Transaction#run()
use neo4j-import tool to load data into an offline database from shell. This option does not really use JS driver. More import tool here: https://neo4j.com/developer/guide-import-csv/#_super_fast_batch_importer_for_huge_datasets

Hope this helps!

lutovich commented 7 years ago

Hi @syz3r,

I'll close this issue now. Please feel free to update it if you have any problems.

NeelSarwal commented 6 years ago

@lutovich Are there any better ways to go about this now?

lutovich commented 6 years ago

@NeelSarwal options for bulk loading are still the same right now. Does none of them work for you? Would be valuable to know the use-case and what options have you tried.

Heziode commented 2 weeks ago

6 years later, I do not think the state has changed. I have a program that generates .cypher files that contain queries to add all nodes and relationships.

Before generating .cyhper, I have generated a bunch of .json files, that contains at most 100 000 entities to add, and I parse the JSON to send it via a session.run.

My json files have the following structure:

{
  "statements": [
    {
      "statement": "UNWIND $rows AS row\nCREATE(n: `UNIQUE IMPORT LABEL`{node_id: row._id}) SET n += row.properties SET n:Element:AN_ASSOCIATION:A_DISCRIMINANT_ASSOCIATION;",
      "parameters": {
        "rows": [
          {
            "_id": "/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-strunb.ads:731:49:A_DISCRIMINANT_ASSOCIATION",
            "properties": {
              "node_id": "/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-strunb.ads:731:49:A_DISCRIMINANT_ASSOCIATION",
              "kinds": [
                "Element",
                "AN_ASSOCIATION",
                "A_DISCRIMINANT_ASSOCIATION"
              ],
              "filename": "/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-strunb.ads",
              "line": 731,
              "column": 49,
              "content": "0",
              "element_kind": "AN_ASSOCIATION",
              "enclosing_unit": "Ada.Strings.Unbounded",
              "is_part_of_implicit": false,
              "is_part_of_inherited": false,
              "is_part_of_instance": false,
              "special_case": "NOT_A_SPECIAL_CASE"
            }
          },
          {
            "_id": "/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/s-rannum.ads:151:36:A_DISCRIMINANT_ASSOCIATION",
            "properties": {
              "node_id": "/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/s-rannum.ads:151:36:A_DISCRIMINANT_ASSOCIATION",
              "kinds": [
                "Element",
                "AN_ASSOCIATION",
                "A_DISCRIMINANT_ASSOCIATION"
              ],
              "filename": "/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/s-rannum.ads",
              "line": 151,
              "column": 36,
              "content": "Generator'Access",
              "element_kind": "AN_ASSOCIATION",
              "enclosing_unit": "System.Random_Numbers",
              "is_part_of_implicit": false,
              "is_part_of_inherited": false,
              "is_part_of_instance": false,
              "special_case": "NOT_A_SPECIAL_CASE"
            }
          },
          {
            "_id": "/workspaces/bench-source/src/aaa/src/aaa-strings.adb:509:33:A_DISCRIMINANT_ASSOCIATION",
            "properties": {
              "node_id": "/workspaces/bench-source/src/aaa/src/aaa-strings.adb:509:33:A_DISCRIMINANT_ASSOCIATION",
              "kinds": [
                "Element",
                "AN_ASSOCIATION",
                "A_DISCRIMINANT_ASSOCIATION"
              ],
              "filename": "/workspaces/bench-source/src/aaa/src/aaa-strings.adb",
              "line": 509,
              "column": 33,
              "content": "Ada.Containers.Count_Type (Max)",
              "element_kind": "AN_ASSOCIATION",
              "enclosing_unit": "AAA.Strings",
              "is_part_of_implicit": false,
              "is_part_of_inherited": false,
              "is_part_of_instance": false,
              "special_case": "NOT_A_SPECIAL_CASE"
            }
          },
          {
            "_id": "/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-stzunb.ads:455:69:A_DISCRIMINANT_ASSOCIATION",
            "properties": {
              "node_id": "/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-stzunb.ads:455:69:A_DISCRIMINANT_ASSOCIATION",
              "kinds": [
                "Element",
                "AN_ASSOCIATION",
                "A_DISCRIMINANT_ASSOCIATION"
              ],
              "filename": "/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-stzunb.ads",
              "line": 455,
              "column": 69,
              "content": "0",
              "element_kind": "AN_ASSOCIATION",
              "enclosing_unit": "Ada.Strings.Wide_Wide_Unbounded",
              "is_part_of_implicit": false,
              "is_part_of_inherited": false,
              "is_part_of_instance": false,
              "special_case": "NOT_A_SPECIAL_CASE"
            }
          }
        ]
      }
    }
  ]
}

It is then easy to use it like:

const queries = JSON.parse(Deno.readTextFileSync(pathPath)).statements;

for (const query of queries) {
    const result = await session.run(query.statement, query.parameters);
}

But, on a large project, where there are 6M+ nodes and 10M+ relationships in 442 files (for ~11GB of data), adding it via the driver made the database crash. So I switched to a single, large .cypher file. But I have to use the cypher-shell that make extra dependencies on the project.

The .cypher file have the following structure:

// A lot of queries…
:begin
:param rows => [{_id:"/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-strunb.ads:731:49:A_DISCRIMINANT_ASSOCIATION",properties:{node_id:"/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-strunb.ads:731:49:A_DISCRIMINANT_ASSOCIATION",kinds:["Element","AN_ASSOCIATION","A_DISCRIMINANT_ASSOCIATION"],filename:"/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-strunb.ads",line:731,column:49,content:"0",element_kind:"AN_ASSOCIATION",enclosing_unit:"Ada.Strings.Unbounded",is_part_of_implicit:false,is_part_of_inherited:false,is_part_of_instance:false,special_case:"NOT_A_SPECIAL_CASE"}},{_id:"/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/s-rannum.ads:151:36:A_DISCRIMINANT_ASSOCIATION",properties:{node_id:"/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/s-rannum.ads:151:36:A_DISCRIMINANT_ASSOCIATION",kinds:["Element","AN_ASSOCIATION","A_DISCRIMINANT_ASSOCIATION"],filename:"/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/s-rannum.ads",line:151,column:36,content:"Generator'Access",element_kind:"AN_ASSOCIATION",enclosing_unit:"System.Random_Numbers",is_part_of_implicit:false,is_part_of_inherited:false,is_part_of_instance:false,special_case:"NOT_A_SPECIAL_CASE"}},{_id:"/workspaces/bench-source/src/aaa/src/aaa-strings.adb:509:33:A_DISCRIMINANT_ASSOCIATION",properties:{node_id:"/workspaces/bench-source/src/aaa/src/aaa-strings.adb:509:33:A_DISCRIMINANT_ASSOCIATION",kinds:["Element","AN_ASSOCIATION","A_DISCRIMINANT_ASSOCIATION"],filename:"/workspaces/bench-source/src/aaa/src/aaa-strings.adb",line:509,column:33,content:"Ada.Containers.Count_Type (Max)",element_kind:"AN_ASSOCIATION",enclosing_unit:"AAA.Strings",is_part_of_implicit:false,is_part_of_inherited:false,is_part_of_instance:false,special_case:"NOT_A_SPECIAL_CASE"}},{_id:"/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-stzunb.ads:455:69:A_DISCRIMINANT_ASSOCIATION",properties:{node_id:"/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-stzunb.ads:455:69:A_DISCRIMINANT_ASSOCIATION",kinds:["Element","AN_ASSOCIATION","A_DISCRIMINANT_ASSOCIATION"],filename:"/usr/gnat/libexec/asis-gnsa/lib/gcc/x86_64-pc-linux-gnu/8.3.1/adainclude/a-stzunb.ads",line:455,column:69,content:"0",element_kind:"AN_ASSOCIATION",enclosing_unit:"Ada.Strings.Wide_Wide_Unbounded",is_part_of_implicit:false,is_part_of_inherited:false,is_part_of_instance:false,special_case:"NOT_A_SPECIAL_CASE"}}]
UNWIND $rows AS row
CREATE(n: `UNIQUE IMPORT LABEL`{node_id: row._id}) SET n += row.properties SET n:Element:AN_ASSOCIATION:A_DISCRIMINANT_ASSOCIATION;;
:commit
// A lot of queries…

It would be nice if the driver can import bulk data via CSV or .cypher file…

MaxAake commented 3 days ago

Hi @Heziode , apologies for the delay in getting back to you!

The driver does indeed not support importing directly from CSV or .cypher, but your JSON parsing for-loop should be a functional solution with a few small alterations.

Firstly, using a managed transaction with session.executeWrite() rather than session.run() would ensure more stable transactions. executeWrite makes the driver perform automatic retries, which would reduce the risk of a network failure leading to some of the data not making it into the database.

Secondly, to prevent the crashing database. My best guess is that one of the queries sent in the loop became too large for some piece of the database infrastructure due to the rows of parameters being too long. If possible, I would investigate which specific query is being sent, and try splitting it into several queries.

The full solution could look something like this:

const BATCH_SIZE = 10
for (const query of queries) {
    for(let i = 0; i < query.parameters.rows.length; i = i + BATCH_SIZE) {
        const rows = query.parameters.rows.slice(i, i + BATCH_SIZE)
        const result = await session.executeWrite(async (tx) => {return await tx.run(query.statement, {"rows": rows})} );
    }
}

This should provide a stable way to load the data into your database. BATCH_SIZE could certainly be larger than 10 (unless your parameter lists are absolutely enormous) but this provides a framework.

I understand that a more dedicated way of loading bulk data would be useful, but the streaming of large amounts of data is a very different use case from what the driver was designed for.

Feel free to give this solution a try if it seems more useful than using cypher-shell, and please reach out again if you continue having issues!

neo4j / neo4j-javascript-driver

Bulk Insert data #227