neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
623 stars 160 forks source link

gds.alpha.shortestPath.stream: java.lang.OutOfMemoryError: Java heap space #55

Closed Selevaniuk closed 4 years ago

Selevaniuk commented 4 years ago

Hi!

I have a graph: nodes = 3.3 million, relationships = 26 million. I run gds.alpha.shortestPath.stream. If I run from node id "0" to node id "5" (or "100" or "1000"), then everything works quickly. If I run from node id "0" to node id "3,000,000" (or up to "50,000"), then I have an error: "Failed to invoke procedure gds.alpha.shortestPath.stream: Caused by: java.lang.OutOfMemoryError: Java heap space".

My conf settings: dbms.memory.heap.initial_size=12g dbms.memory.heap.max_size=12g dbms.tx_state.max_off_heap_memory=8g dbms.memory.pagecache.size=4g

There are indexes (id). database size = 2G.

FROM MEMREC:

/neo4j-community-4.0.4$ bin/neo4j-admin memrec
Memory settings recommendation from neo4j-admin memrec:

Assuming the system is dedicated to running Neo4j and has 31.26GiB of memory,
we recommend a heap size of around 11900m, and a page cache of around 4g,
and that about 8000m is left for the operating system, and the native memory
needed by Lucene and Netty.
Tip: If the indexing storage use is high, e.g. there are many indexes or most
data indexed, then it might advantageous to leave more memory for the
operating system.
Tip: Depending on the workload type you may want to increase the amount
of off-heap memory available for storing transaction state.
For instance, in case of large write-intensive transactions
increasing it can lower GC overhead and thus improve performance.
On the other hand, if vast majority of transactions are small or read-only
then you can decrease it and increase page cache instead.
Tip: The more concurrent transactions your workload has and the more updates
they do, the more heap memory you will need. However, don't allocate more
than 31g of heap, since this will disable pointer compression, also known as
"compressed oops", in the JVM and make less effective use of the heap.
Tip: Setting the initial and the max heap size to the same value means the
JVM will never need to change the heap size. Changing the heap size otherwise
involves a full GC, which is desirable to avoid.
Based on the above, the following memory settings are recommended:
dbms.memory.heap.initial_size=11900m
dbms.memory.heap.max_size=11900m
dbms.memory.pagecache.size=4g
dbms.tx_state.max_off_heap_memory=8000m

It is also recommended turning out-of-memory errors into full crashes,
instead of allowing a partially crashed database to continue running:
#dbms.jvm.additional=-XX:+ExitOnOutOfMemoryError

The numbers below have been derived based on your current databases located at: '/neo4j-community-4.0.4/data/databases'.
They can be used as an input into more detailed memory analysis.
Total size of lucene indexes in all databases: 0k
Total size of data and native indexes in all databases: 1200m

ShortestPath from cypher (not from GDS) on this database doesn't produce errors and runs in a few seconds (from any node to any node).

Is this a problem in the implementation of the algorithm (gds.alpha.shortestPath.stream)?

log:

2020-06-15 08:29:32.772+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=122, gcTime=144, gcCount=1} 2020-06-15 08:29:34.399+0000 INFO [o.n.k.a.p.GlobalProcedures] Relationship Store Scan: Imported 53,176,328 records and 53,176,328 properties from 862 MiB (904,005,600 bytes); took 7.062 s, 7,529,704.79 Relationships/s, 122 MiB/s (128,006,117 bytes/s) (per thread: 1,882,426.20 Relationships/s, 30 MiB/s (32,001,529 bytes/s)) 2020-06-15 08:29:34.400+0000 INFO [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-4 [bolt] [/193.17.42.129:15049] ] LOADING 2020-06-15 08:29:38.465+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=262, gcTime=298, gcCount=2} 2020-06-15 08:29:39.388+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=220, gcTime=238, gcCount=3} 2020-06-15 08:29:40.007+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=318, gcTime=358, gcCount=4} 2020-06-15 08:29:40.009+0000 WARN [o.n.k.a.p.GlobalProcedures] Computation failed Not enough memory to allocate new buffers: 144,998,989 -> 217,498,485 com.carrotsearch.hppc.BufferAllocationException: Not enough memory to allocate new buffers: 144,998,989 -> 217,498,485 at com.carrotsearch.hppc.DoubleArrayDeque.ensureBufferSpace(DoubleArrayDeque.java:494) at com.carrotsearch.hppc.DoubleArrayDeque.addFirst(DoubleArrayDeque.java:98) at org.neo4j.graphalgo.impl.shortestpaths.ShortestPathDijkstra.compute(ShortestPathDijkstra.java:107) at org.neo4j.graphalgo.impl.shortestpaths.ShortestPathDijkstra.compute(ShortestPathDijkstra.java:89) at org.neo4j.graphalgo.impl.shortestpaths.ShortestPathDijkstra.compute(ShortestPathDijkstra.java:47) at org.neo4j.graphalgo.AlgoBaseProc.lambda$compute$3(AlgoBaseProc.java:394) at org.neo4j.graphalgo.BaseProc.runWithExceptionLogging(BaseProc.java:92) at org.neo4j.graphalgo.AlgoBaseProc.compute(AlgoBaseProc.java:390) at org.neo4j.graphalgo.AlgoBaseProc.compute(AlgoBaseProc.java:350) at org.neo4j.graphalgo.shortestpaths.DijkstraProc.dijkstraStream(DijkstraProc.java:67) at org.neo4j.kernel.impl.proc.GeneratedProcedure_stream1132852398314563.apply(Unknown Source) at org.neo4j.procedure.impl.ProcedureRegistry.callProcedure(ProcedureRegistry.java:208) at org.neo4j.procedure.impl.GlobalProceduresRegistry.callProcedure(GlobalProceduresRegistry.java:323) at org.neo4j.kernel.impl.newapi.AllStoreHolder.callProcedure(AllStoreHolder.java:941) at org.neo4j.kernel.impl.newapi.AllStoreHolder.procedureCallRead(AllStoreHolder.java:844) at org.neo4j.cypher.internal.runtime.interpreted.CallSupport$.$anonfun$callReadOnlyProcedure$2(CallSupport.scala:50) at org.neo4j.cypher.internal.runtime.interpreted.CallSupport$.callProcedure(CallSupport.scala:97) at org.neo4j.cypher.internal.runtime.interpreted.CallSupport$.callReadOnlyProcedure(CallSupport.scala:52) at org.neo4j.cypher.internal.runtime.interpreted.TransactionBoundQueryContext.callReadOnlyProcedure(TransactionBoundQueryContext.scala:823) at org.neo4j.cypher.internal.planning.ExceptionTranslatingQueryContext.$anonfun$callReadOnlyProcedure$1(ExceptionTranslatingQueryContext.scala:180) at org.neo4j.cypher.internal.planning.ExceptionTranslationSupport.translateException(ExceptionTranslationSupport.scala:33) at org.neo4j.cypher.internal.planning.ExceptionTranslationSupport.translateException$(ExceptionTranslationSupport.scala:32) at org.neo4j.cypher.internal.planning.ExceptionTranslatingQueryContext.translateException(ExceptionTranslatingQueryContext.scala:40) at org.neo4j.cypher.internal.planning.ExceptionTranslationSupport.translateIterator(ExceptionTranslationSupport.scala:48) at org.neo4j.cypher.internal.planning.ExceptionTranslationSupport.translateIterator$(ExceptionTranslationSupport.scala:47) at org.neo4j.cypher.internal.planning.ExceptionTranslatingQueryContext.translateIterator(ExceptionTranslatingQueryContext.scala:40) at org.neo4j.cypher.internal.planning.ExceptionTranslatingQueryContext.callReadOnlyProcedure(ExceptionTranslatingQueryContext.scala:180) at org.neo4j.cypher.internal.runtime.LazyReadOnlyCallMode.callProcedure(ProcedureCallMode.scala:48) at org.neo4j.cypher.internal.runtime.interpreted.pipes.ProcedureCallPipe.call(ProcedureCallPipe.scala:87) at org.neo4j.cypher.internal.runtime.interpreted.pipes.ProcedureCallPipe.$anonfun$internalCreateResultsByAppending$1(ProcedureCallPipe.scala:73) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:480) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:486) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:454) at org.neo4j.cypher.internal.runtime.interpreted.PipeExecutionResult.serveResults(PipeExecutionResult.scala:75) at org.neo4j.cypher.internal.runtime.interpreted.PipeExecutionResult.request(PipeExecutionResult.scala:63) at org.neo4j.cypher.internal.result.StandardInternalExecutionResult.request(StandardInternalExecutionResult.scala:88) at org.neo4j.cypher.internal.result.ClosingExecutionResult.request(ClosingExecutionResult.scala:135) at org.neo4j.bolt.runtime.AbstractCypherAdapterStream.handleRecords(AbstractCypherAdapterStream.java:105) at org.neo4j.bolt.v3.messaging.ResultHandler.onPullRecords(ResultHandler.java:41) at org.neo4j.bolt.v4.messaging.PullResultConsumer.consume(PullResultConsumer.java:42) at org.neo4j.bolt.runtime.statemachine.impl.TransactionStateMachine$State.consumeResult(TransactionStateMachine.java:511) at org.neo4j.bolt.runtime.statemachine.impl.TransactionStateMachine$State$2.streamResult(TransactionStateMachine.java:355) at org.neo4j.bolt.runtime.statemachine.impl.TransactionStateMachine.streamResult(TransactionStateMachine.java:92) at org.neo4j.bolt.v4.runtime.InTransactionState.processStreamResultMessage(InTransactionState.java:73) at org.neo4j.bolt.v4.runtime.AbstractStreamingState.processUnsafe(AbstractStreamingState.java:49) at org.neo4j.bolt.v4.runtime.InTransactionState.processUnsafe(InTransactionState.java:60) at org.neo4j.bolt.v3.runtime.FailSafeBoltStateMachineState.process(FailSafeBoltStateMachineState.java:48) at org.neo4j.bolt.runtime.statemachine.impl.AbstractBoltStateMachine.nextState(AbstractBoltStateMachine.java:143) at org.neo4j.bolt.runtime.statemachine.impl.AbstractBoltStateMachine.process(AbstractBoltStateMachine.java:91) at org.neo4j.bolt.messaging.BoltRequestMessageReader.lambda$doRead$1(BoltRequestMessageReader.java:90) at org.neo4j.bolt.runtime.DefaultBoltConnection.lambda$enqueue$0(DefaultBoltConnection.java:151) at org.neo4j.bolt.runtime.DefaultBoltConnection.processNextBatchInternal(DefaultBoltConnection.java:240) at org.neo4j.bolt.runtime.DefaultBoltConnection.processNextBatch(DefaultBoltConnection.java:175) at org.neo4j.bolt.runtime.DefaultBoltConnection.processNextBatch(DefaultBoltConnection.java:165) at org.neo4j.bolt.runtime.scheduling.ExecutorBoltScheduler.executeBatch(ExecutorBoltScheduler.java:212) at org.neo4j.bolt.runtime.scheduling.ExecutorBoltScheduler.lambda$scheduleBatchOrHandleError$2(ExecutorBoltScheduler.java:195) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.OutOfMemoryError: Java heap space at com.carrotsearch.hppc.DoubleArrayDeque.ensureBufferSpace(DoubleArrayDeque.java:486) at com.carrotsearch.hppc.DoubleArrayDeque.addFirst(DoubleArrayDeque.java:98) at org.neo4j.graphalgo.impl.shortestpaths.ShortestPathDijkstra.compute(ShortestPathDijkstra.java:107) at org.neo4j.graphalgo.impl.shortestpaths.ShortestPathDijkstra.compute(ShortestPathDijkstra.java:89) at org.neo4j.graphalgo.impl.shortestpaths.ShortestPathDijkstra.compute(ShortestPathDijkstra.java:47) at org.neo4j.graphalgo.AlgoBaseProc.lambda$compute$3(AlgoBaseProc.java:394) at org.neo4j.graphalgo.AlgoBaseProc$$Lambda$3286/0x000000080138ac40.get(Unknown Source) at org.neo4j.graphalgo.BaseProc.runWithExceptionLogging(BaseProc.java:92) at org.neo4j.graphalgo.AlgoBaseProc.compute(AlgoBaseProc.java:390) at org.neo4j.graphalgo.AlgoBaseProc.compute(AlgoBaseProc.java:350) at org.neo4j.graphalgo.shortestpaths.DijkstraProc.dijkstraStream(DijkstraProc.java:67) at org.neo4j.kernel.impl.proc.GeneratedProcedure_stream1132852398314563.apply(Unknown Source) at org.neo4j.procedure.impl.ProcedureRegistry.callProcedure(ProcedureRegistry.java:208) at org.neo4j.procedure.impl.GlobalProceduresRegistry.callProcedure(GlobalProceduresRegistry.java:323) at org.neo4j.kernel.impl.newapi.AllStoreHolder.callProcedure(AllStoreHolder.java:941) at org.neo4j.kernel.impl.newapi.AllStoreHolder.procedureCallRead(AllStoreHolder.java:844) at org.neo4j.cypher.internal.runtime.interpreted.CallSupport$.$anonfun$callReadOnlyProcedure$2(CallSupport.scala:50) at org.neo4j.cypher.internal.runtime.interpreted.CallSupport$$$Lambda$2700/0x0000000801163040.apply(Unknown Source) at org.neo4j.cypher.internal.runtime.interpreted.CallSupport$.callProcedure(CallSupport.scala:97) at org.neo4j.cypher.internal.runtime.interpreted.CallSupport$.callReadOnlyProcedure(CallSupport.scala:52) at org.neo4j.cypher.internal.runtime.interpreted.TransactionBoundQueryContext.callReadOnlyProcedure(TransactionBoundQueryContext.scala:823) at org.neo4j.cypher.internal.planning.ExceptionTranslatingQueryContext.$anonfun$callReadOnlyProcedure$1(ExceptionTranslatingQueryContext.scala:180) at org.neo4j.cypher.internal.planning.ExceptionTranslatingQueryContext$$Lambda$2699/0x0000000801164440.apply(Unknown Source) at org.neo4j.cypher.internal.planning.ExceptionTranslationSupport.translateException(ExceptionTranslationSupport.scala:33) at org.neo4j.cypher.internal.planning.ExceptionTranslationSupport.translateException$(ExceptionTranslationSupport.scala:32) at org.neo4j.cypher.internal.planning.ExceptionTranslatingQueryContext.translateException(ExceptionTranslatingQueryContext.scala:40) at org.neo4j.cypher.internal.planning.ExceptionTranslationSupport.translateIterator(ExceptionTranslationSupport.scala:48) at org.neo4j.cypher.internal.planning.ExceptionTranslationSupport.translateIterator$(ExceptionTranslationSupport.scala:47) at org.neo4j.cypher.internal.planning.ExceptionTranslatingQueryContext.translateIterator(ExceptionTranslatingQueryContext.scala:40) at org.neo4j.cypher.internal.planning.ExceptionTranslatingQueryContext.callReadOnlyProcedure(ExceptionTranslatingQueryContext.scala:180) at org.neo4j.cypher.internal.runtime.LazyReadOnlyCallMode.callProcedure(ProcedureCallMode.scala:48) at org.neo4j.cypher.internal.runtime.interpreted.pipes.ProcedureCallPipe.call(ProcedureCallPipe.scala:87)

Thanks

Mats-SX commented 4 years ago

Hello @Selevaniuk and thanks for reporting this. This is most likely an issue with the alpha algorithm.

Unfortunately, GDS does not integrate with Neo4j's default memrec feature, but we have our own .estimate procedures which estimate algorithm and graph catalog memory requirements based on the configured workload. These modes exist for all procedures in the production-ready tier, but are missing for most of the alpha procedures. These help to understand how much heap is necessary to guarantee your workload does not run into the above problem.

The general advice that I can give is: allocate more heap. If necessary, you can take away some gigs from the page cache (it doesn't help GDS much) as well as from dbms.tx_state.max_off_heap_memory as GDS does not accumulate much transaction state. This could have the adverse effect of reducing standard Neo4j performance however.

Mats-SX commented 4 years ago

I forgot to include a link to the manual where memory settings are discussed: https://neo4j.com/docs/graph-data-science/current/installation/#System-requirements

In general we can unfortunately not guarantee that you will never hit a OutOfMemoryError. We have a memory guard feature enabled for all algorithms that we have implemented .estimate procedures for, which is only guaranteed for the production-ready tier in the library.

More on memory guard: https://neo4j.com/docs/graph-data-science/current/common-usage/memory-estimation/#estimate-heap-control More on GDS tiers: https://neo4j.com/docs/graph-data-science/current/algorithms/

I will close this issue now. Please feel free to reach back if you have further questions.