neo4j-contrib / py2neo

EOL! Py2neo is a comprehensive Neo4j driver library and toolkit for Python.
https://py2neo.org
Apache License 2.0
20 stars 8 forks source link

Create relationships in batches in the neo4j database using py2neo bulk API (with multiple relation types) #911

Closed psomesh94 closed 3 years ago

psomesh94 commented 3 years ago

Hi all, I am trying to create relationships in batches in the neo4j database using py2neo bulk api create_relationship(). I have multiple relation types (like knows ,follows etc) in the list and also I want to create a large number of relationships (more than 30 million) into the graph database. Currently I am using Neo4j Desktop 1.4.7 and python version 3.9.

I have created nodes in batches using create_nodes() py2neo.bulk api. Below is the sample code for the same.

from py2neo import Graph from py2neo.bulk import create_nodes,create_relationships from py2neo.data import Node

graph = Graph(auth=('neo4j', 'abc123')) one_node=Node("Person", name="Alice", nid="01") two_node=Node("Person", name="ACME", nid="02") t_node=Node("Person", name="Mahi", nid="03") f_node=Node("Person", name="Ali", nid="04") data = [] data.append(one_node) data.append(two_node) data.append(t_node) data.append(f_node)

try: create_nodes(graph.auto(), data, labels={"Person"}) except Exception as error: print(error)

try: graph.run("CREATE INDEX ON :Person(nid)") except Exception as error: print(error)

I have created a relationship using create_relationship() api of py2neo.bulk. Below is the sample code for the same.

rel_data= [ ("01", {}, "02"), ("03", {}, "04"), ]

try: create_relationships(graph.auto(), rel_data, "WORKS_FOR", \ start_node_key=("Person", "nid"), end_node_key=("Person", "nid")) except Exception as error: print(error)

I am facing following queries problems:

  1. In the above code if I provide a single relation type to all 30 Millions nodes then the data gets inserted in only 74 seconds. But the data I want to insert in Neo4j have multiple relation types (like the file I have referring to has 79 relation types each having its own dataset). So I have called create_relationship() API 79 times with its own data and it took 9 minutes to complete. So conclusion is single relation type gives better performance, can we give multiple relation type in one go to create_relationship() API.

  2. In the above code if I don't provide start key and end key to create_relationship() API then performance for creating edges becomes fast but some edges are missing, But If I provide start key and end key (as shown in above code) then performance gets slow. So is there any better way to create a large number of relationships? Just FYI in our case we don't want to use the 'Load csv' query, please suggest some other solution, if possible.

  3. While investigating I found some samples of WriteBatch from py2neo library for creating our own batch, but I was not able to import the neo4j module in py2neo which has WriteBatch() API as I am using py2neo version 2021.1. So is there any way to fix this issue or is there any alternative way to create own batches using py2neo.

Thank you in advance.

technige commented 3 years ago
  1. The create_relationships function is only capable of creating one type of relationship at a time, as this is a restriction of the underlying Cypher language. Specifically, that relationship types within Cypher cannot be passed as parameters. It might be possible to extend the method itself to support multiple relationship types, but that would still require (in your case) 79 underlying Cypher queries, so performance would not improve.
  2. Without the start and end key, no matching of the endpoint nodes is possible. This is simply slower because the endpoints have to be matched before the relationship can be created. Note that it might be worth investigating the use of Neo4j indexes here, which could help speed up the match.
  3. WriteBatch is from a very old version of py2neo, linked to older versions of Neo4j. There used to be a "batch" endpoint in the HTTP API to which this relates, but it no longer exists in more recent versions. One of the reasons for its removal was that it was quite unreliable and buggy. Today, Cypher is generally the best way to work.

Overall, you may get better performance by writing your own custom Cypher query. Have a look at the implementation for create_relationships for inspiration, and check out indexing.

psomesh94 commented 3 years ago

Hello @technige, thank you for the quick response, it was very useful.

I have few queries regarding nodes and relations inserstion:-

While using create_nodesAPI of py2neo I have created nodes using batch size of 10000, because its is mentioned in the py2neo documentation that “There is no universal batch_sizethat performs optimally for all use cases. It is recommended to experiment with this value to discover what size works best”. Batch size 10000 works good for me, I don't use batch size while inserting million of nodes in database it gives me exception 'Java heap space'. But with batch size this exception does not occurs.

I have used the create_relationshipAPI for creating around 3 million edges in the Neo4j database.

  1. With 10000 as batch size while creating relationships: 669740 relations took 9 minutes to insert.
  2. With 100 as batch size while creating relationships: 669740 relations took only 59 seconds to insert

Could you please let me know why the create_relationshipAPI is taking less time when I am using batch size of 100 instead of 10000 ? And also let me know why first batch takes more time in inserting and remaining match gets inserted very quickly ? Thank you in advance.

technige commented 3 years ago

The performance characteristics you notice will be related to the internals of Neo4j itself. Each batch translates to an individual transaction, and any update queries carried out inside that transaction (as these insertions are) will build up state. On commit, the pending change in state gets applied to the main data store and the amount of work to do here will correspond to the amount of time it takes.

Therefore, choosing a batch size is a trade-off between the number of commits, and the amount of state that each commit has to apply. For your workload, 100 performs better than 10000. But 1000 might (or might not) perform better still.

Trying to calculate a batch size without experimentation will be almost impossible as the internals are complex and rely on a huge number of variables. Likewise, there is no single batch size that can universally perform well for all types of workload.

Finally, if you want to get more deeply into the internals of Neo4j, this issues list isn't the right forum for that. This is better discussed in one of the dedicated chat rooms, such as Discord, run by Neo4j itself.

psomesh94 commented 3 years ago

Thank you again, @technige for quick response. As suggested by you, I will ask this query on Discord, run by Neo4j.