neo4j / apoc

Apache License 2.0
89 stars 28 forks source link

Possible bug in Neo4J 4.0 version, during GRAPHML import #98

Open neo-technology-build-agent opened 2 years ago

neo-technology-build-agent commented 2 years ago

Issue by karrtikiyer Wednesday Mar 18, 2020 at 10:06 GMT Originally opened as https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/1451


Expected Behavior (Mandatory)

Able to successfully import the graphML file to a graph DB in Neo4j

Actual Behavior (Mandatory)

Errors out giving below exception, works in 3.x series. Failed to invoke procedure `apoc.import.graphml`: Caused by: org.neo4j.kernel.impl.util.collection.MemoryAllocationLimitException: Can't allocate 524288 bytes due to exceeding memory limit; used=2147028992, max=2147483648

How to Reproduce the Problem

CALL apoc.import.graphml("my_graph.graphml", {})`

Steps (Mandatory)

  1. Install Neo4J Desktop
  2. Modify config settings to allow apoc import
  3. create a new graph DB
  4. RUN below code to import the graphml
  5. CALL apoc.import.graphml("my_graph.graphml", {})

Specifications (Mandatory)

Currently used versions

Versions

neo-technology-build-agent commented 2 years ago

Comment by jexp Thursday Mar 19, 2020 at 00:24 GMT


can you provide more details about the graphml file? perhaps even the file itself to reproduce.

neo-technology-build-agent commented 2 years ago

Comment by karrtikiyer Thursday Mar 19, 2020 at 02:56 GMT


The file size is around 1.5 GB. And it contains some private confidential information because of which it can’t be shared. Any other logs etc which I can share? Also I have said in my original post this works in 3.x series of Neo4j.

On Thu, 19 Mar 2020 at 05:55, Michael Hunger notifications@github.com wrote:

can you provide more details about the graphml file? perhaps even the file itself to reproduce.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/1451#issuecomment-600922724, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBMHMFBFTLKAD47LDQTVP3RIFQ55ANCNFSM4LONWJVQ .

-- Thanks and regards, Karrtik

neo-technology-build-agent commented 2 years ago

Comment by jexp Thursday Mar 19, 2020 at 09:04 GMT


Can you increase e.g. double your heap memory until it works.

Afaik this implementation is not streaming the XML, because it's older, apoc.load.xml does stream which you could use to manually construct the graph.

Hope this helps as a workaround, Michael

neo-technology-build-agent commented 2 years ago

Comment by karrtikiyer Thursday Mar 19, 2020 at 09:29 GMT


I have tried with below settings, but does not work, do you want me to increase it even more?

dbms.memory.heap.initial_size=4G dbms.memory.heap.max_size=12G

Thanks and regards, Karrtik

On Thu, 19 Mar 2020 at 14:34, Michael Hunger notifications@github.com wrote:

Can you increase e.g. double your heap memory until it works.

Afaik this implementation is not streaming the XML, because it's older, apoc.load.xml does stream which you could use to manually construct the graph.

Hope this helps as a workaround, Michael

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/1451#issuecomment-601066276, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBMHMFAAPYR7L2VNDWGJPLRIHN3NANCNFSM4LONWJVQ .

neo-technology-build-agent commented 2 years ago

Comment by jexp Thursday Mar 19, 2020 at 10:14 GMT


If you have more, then yes. Or if you have the means to split up the file?

Changing the implementation to streaming is more involved and would probably not done any time soon.

You can try to load the file with apoc.load.xml with an xpath for nodes first and then rels and see if that at least works (just by return count(*) to see the number of elements each) and then use that to create the nodes / rels.

Sorry for the inconvenience all our testing data was way smaller files.

neo-technology-build-agent commented 2 years ago

Comment by conker84 Wednesday Mar 25, 2020 at 14:23 GMT


@karmakaze any update on this? Is it an option to anonymize the data and share it with us?

neo-technology-build-agent commented 2 years ago

Comment by AstrorEnales Thursday Apr 09, 2020 at 09:07 GMT


We just ran into the same problem. File is imported using "call apoc.import.graphml('test.graphml',{readLabels: true, useTypes: true})"

That 3.5 worked was a good hint by @karrtikiyer, i checked the diff of XmlGraphMLReader.java between apoc 4.x and 3.5. The only difference is the addition of the Transaction parameter in the constructor which is used in 4.x for "createNode" and "getNodeById". In 3.5 this was done using the GraphDatabaseService. While XmlGraphMLReader uses a BatchTransaction i wondered if using the provided transaction instead of the BatchTransaction's transaction would cause this issue, so i changed the usage. And this did the trick for me, now the 1.8GB file is loaded in neo4j 4.x without issues. (Yes, it still uses alot of memory and the xml parser should be changed to streaming (and not caching nodes internally as Java xml parser does), but at least with enough memory we can load large graphml files again).

Changed lines: Node node = this.tx.createNode(); --> Node node = tx.getTransaction().createNode();

Node from = this.tx.getNodeById(cache.get(source)); Node to = this.tx.getNodeById(cache.get(target)); --> Node from = tx.getTransaction().getNodeById(cache.get(source)); Node to = tx.getTransaction().getNodeById(cache.get(target));

I just checked the code out for the first time, so i have no larger overview of the apoc architecture and any potential side effects, so i would be happy if any apoc dev could check this out and either fix this or ping me and i could provide a pull request.