Closed szarnyasg closed 2 years ago
Tried optimizing this in PR #56 but failed - the proposed change made no difference.
I tried a couple of optimizations. For the experiments, I was running the query on the SF1 data set on an r6id.2xlarge
(64GiB RAM) EC2 instance.
Rewrote the query to only use two clauses: https://github.com/ldbc/ldbc_snb_bi/commit/4b392f55af6a1a2a85e6ef4df8caa59b12c7fb0b
// case 1
OPTIONAL MATCH
(pA)<-[:HAS_CREATOR]-(c:Comment)-[r:REPLY_OF]->(m:Message)-[:HAS_CREATOR]->(pB),
(m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis + '})
WITH knows, pA, pB, 0 + count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
// case 2
OPTIONAL MATCH
(pA)<-[:HAS_CREATOR]-(m:Message)<-[r:REPLY_OF]-(c:Comment)-[:HAS_CREATOR]->(pB),
(m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis + '})
WITH knows, pA, pB, w + count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
When running on SF1, the CSR creation + cheapest path algorithm (Dijkstra) returned the following after ~35 minutes:
Failed to invoke procedure
gds.graph.create.cypher
: Caused by: java.lang.IllegalArgumentException: Expected positive long value, got -8589934576
I'm pretty sure my query never produced a negative number, so I dug into the log (in scratch/logs/debug.log). It seems some sort of a heap allocation, i.e. out-of-memory problem:
2022-09-13 07:09:34.860+0000 INFO [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Nodes 98%
2022-09-13 07:09:34.861+0000 INFO [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Nodes 99%
2022-09-13 07:09:34.861+0000 INFO [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Nodes 100%
2022-09-13 07:09:34.861+0000 INFO [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Nodes :: Finished
2022-09-13 07:09:34.868+0000 INFO [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Relationships :: Start
2022-09-13 07:09:58.164+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=517, gcTime=560, gcCount=1}
2022-09-13 07:13:48.202+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=636, gcTime=688, gcCount=1}
2022-09-13 07:19:43.113+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=480, gcTime=0, gcCount=0}
2022-09-13 07:27:52.384+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=419, gcTime=0, gcCount=0}
2022-09-13 07:47:23.696+0000 WARN [o.n.k.a.p.GlobalProcedures] Graph creation failed
java.lang.IllegalArgumentException: Expected positive long value, got -8589934576
at org.neo4j.util.Preconditions.requirePositive(Preconditions.java:48) ~[neo4j-common-4.4.2.jar:4.4.2]
at org.neo4j.memory.LocalMemoryTracker.allocateHeap(LocalMemoryTracker.java:145) ~[neo4j-common-4.4.2.jar:4.4.2]
at org.neo4j.cypher.internal.runtime.SingleThreadedResourcePool.ensureCapacity(ResourceManager.scala:223) ~[neo4j-cypher-runtime-util-4.4.2.jar:4.4.2]
at org.neo4j.cypher.internal.runtime.SingleThreadedResourcePool.add(ResourceManager.scala:144) ~[neo4j-cypher-runtime-util-4.4.2.jar:4.4.2]
at org.neo4j.cypher.internal.runtime.ResourceManager.trace(ResourceManager.scala:44) ~[neo4j-cypher-runtime-util-4.4.2.jar:4.4.2]
I increased the pagecache and heap memory settings to 20G:
--env NEO4J_dbms_memory_pagecache_size=20G --env NEO4J_dbms_memory_heap_max__size=20G
This fixed the problem.
I realized Forum's don't have an index on their creationDate. Having these would be beneficial, so I added an index with this commit: https://github.com/ldbc/ldbc_snb_bi/commit/06e8c7c8d8b88730616119440089ba79b68e990b
I rewrote the query to use subqueries:
CALL {
WITH pA, pB
OPTIONAL MATCH
(pA)<-[:HAS_CREATOR]-(c:Comment)-[r:REPLY_OF]->(m:Message)-[:HAS_CREATOR]->(pB),
(m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis + '})
RETURN count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
UNION ALL
WITH pA, pB
OPTIONAL MATCH
(pA)<-[:HAS_CREATOR]-(m:Message)<-[r:REPLY_OF]-(c:Comment)-[:HAS_CREATOR]->(pB),
(m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis + '})
RETURN count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
}
I removed the OPTIONAL MATCH
es from the subquery calls (this was not possible using the linear approach (WITH ... WITH ...
):
CALL {
WITH pA, pB
RETURN 0.0 AS w
UNION ALL
WITH pA, pB
MATCH
(pA)<-[:HAS_CREATOR]-(c:Comment)-[r:REPLY_OF]->(m:Message)-[:HAS_CREATOR]->(pB),
(m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis + '})
RETURN count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
UNION ALL
WITH pA, pB
MATCH
(pA)<-[:HAS_CREATOR]-(m:Message)<-[r:REPLY_OF]-(c:Comment)-[:HAS_CREATOR]->(pB),
(m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis + '})
RETURN count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
}
I changed to Neo4j Enterprise.
I tried using the parallel runtime with CYPHER runtime=parallel MATCH ... CALL ...
but it is not supported:
Parallel does not yet support the plans including
ProcedureCall
, use another runtime.
Using the slotted
and pipelined
runtimes worked but did not produce better performance.
Started bumping the Neo4j version incrementally. Note that using 4.4.4+ does not work on my laptop (Apple Silicon / M1), as reported in the Interactive issues: https://github.com/ldbc/ldbc_snb_interactive_impls/issues/298
Another strange thing is that the GDS library changes its major version mid-way when upgrading patch version of the DBMS:
export NEO4J_VERSION=4.4.3
-> RETURN gds.version()
results in 1.8.3export NEO4J_VERSION=4.4.4
-> RETURN gds.version()
results in 2.0.0The new GDS version also results in new method signatures, including the one for Dijkstra's algorithm.
Bumped Neo4j's version to the latest, 4.4.11:
export SF=1
. scripts/use-datagen-data-set.sh
export NEO4J_VERSION=4.4.11-enterprise
export NEO4J_ENV_VARS="--env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes"
export NEO4J_ENV_VARS="${NEO4J_ENV_VARS-} --env NEO4J_dbms_memory_pagecache_size=20G --env NEO4J_dbms_memory_heap_max__size=20G"
scripts/load-in-one-step.sh
Tried compacting the weight computation into a single query:
CALL gds.graph.project.cypher(
'q15',
'MATCH (p:Person) RETURN id(p) AS id',
'MATCH (pA:Person)-[knows:KNOWS]-(pB:Person)
OPTIONAL MATCH (pA)<-[:HAS_CREATOR]-(m1:Message)-[r:REPLY_OF]-(m2:Message)-[:HAS_CREATOR]->(pB)
OPTIONAL MATCH (m1)-[:REPLY_OF*0..]->(p1:Post)<-[:CONTAINER_OF]-(forum1:Forum)
WHERE forum1.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
AND forum1.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis + '})
OPTIONAL MATCH (m2)-[:REPLY_OF*0..]->(p2:Post)<-[:CONTAINER_OF]-(forum2:Forum)
WHERE forum2.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
AND forum2.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis + '})
WITH pA, pB, 0.0
+ sum(CASE forum1 IS NOT NULL WHEN true THEN 0.5 ELSE 0.0 END)
+ sum(CASE forum2 IS NOT NULL WHEN true THEN 0.5 ELSE 0.0 END)
+ sum(CASE m1 = p1 WHEN true THEN 0.5 ELSE 0.0 END)
+ sum(CASE m2 = p2 WHEN true THEN 0.5 ELSE 0.0 END) AS w
RETURN
id(pA) AS source,
id(pB) AS target,
1/(w+1) AS weight'
)
This made the computation much faster, around 15 seconds (down from 30+ minutes).
There are a whole lot of errors when validating against Umbra...:
15|15a|<10137;17592186053238;2012-11-05;2012-11-12>|[<4.0>]
15|15a|<13194139540988;32985348838424;2012-11-06;2012-11-09>|[]
15|15a|<19791209304199;4398046519404;2012-11-02;2012-11-12>|[<3.8>]
15|15b|<17592186051639;28587302330559;2010-04-13;2010-05-05>|[<1.1917211328976034>]
15|15b|<8796093031752;24189255815107;2010-04-19;2010-05-11>|[<1.220699850169947>]
15|15b|<2199023257011;30786325587660;2010-04-19;2010-05-11>|[<1.6312030075187969>]
15|15b|<2199023258884;26388279067193;2010-04-20;2010-05-12>|[<1.1588235294117646>]
15|15b|<10149;15393162796338;2010-04-19;2010-05-11>|[<1.263101604278075>]
15|15b|<17592186044545;35184372098116;2010-04-20;2010-05-12>|[<2.0>]
15|15b|<2199023265630;24189255813332;2010-04-16;2010-05-08>|[<1.1440993788819875>]
15|15b|<6597069769862;28587302325015;2010-04-21;2010-05-13>|[<2.0>]
15|15a|<10137;17592186053238;2012-11-05;2012-11-12>|[<4.0>]
15|15a|<13194139540988;32985348838424;2012-11-06;2012-11-09>|[<-1.0>]
15|15a|<19791209304199;4398046519404;2012-11-02;2012-11-12>|[<4.0>]
15|15b|<17592186051639;28587302330559;2010-04-13;2010-05-05>|[<1.27777777777778>]
15|15b|<8796093031752;24189255815107;2010-04-19;2010-05-11>|[<1.32123015873016>]
15|15b|<2199023257011;30786325587660;2010-04-19;2010-05-11>|[<1.82955465587045>]
15|15b|<2199023258884;26388279067193;2010-04-20;2010-05-12>|[<1.22981366459627>]
15|15b|<10149;15393162796338;2010-04-19;2010-05-11>|[<1.37619047619048>]
15|15b|<17592186044545;35184372098116;2010-04-20;2010-05-12>|[<2.0>]
15|15b|<2199023265630;24189255813332;2010-04-16;2010-05-08>|[<1.20833333333333>]
15|15b|<6597069769862;28587302325015;2010-04-21;2010-05-13>|[<2.0>]
Simplified the weight computation query to the following:
MATCH (pA:Person)-[knows:KNOWS]-(pB:Person)
OPTIONAL MATCH (pA)<-[:HAS_CREATOR]-(m1:Message)-[r:REPLY_OF]-(m2:Message)-[:HAS_CREATOR]->(pB)
OPTIONAL MATCH (m1)-[:REPLY_OF*0..]->(:Post)<-[:CONTAINER_OF]-(forum:Forum)
WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis + '})
WITH pA, pB,
sum(CASE forum IS NOT NULL
WHEN true THEN
CASE (m1:Post OR m2:Post) WHEN true THEN 1.0
ELSE 0.5 END
ELSE 0.0 END
) AS w
RETURN
id(pA) AS source,
id(pB) AS target,
1/(w+1) AS weight
The key idea is that both Message
s from (m1)-[r:REPLY_OF]-(m2)
can be used to find the root Post
along a sequence of :REPLY_OF*0
edges. Note that the latter needs to be in its own OPTIONAL MATCH
clause as repeated edges are not allowed in the same MATCH
clause and these edges can potentially include r
.
The new query fixes incorrect results and is still fast (it takes approx. 20 seconds).
Validation now passes against Umbra on SF1.
Commit b7b90631f718e19a1f5876aaa7ae8a7a483fd3aa made the Q15 Neo4j/Cypher implementation slow - it should be rewritten to become faster.