Optimize Q15 Cypher query

szarnyasg commented 2 years ago

Commit b7b90631f718e19a1f5876aaa7ae8a7a483fd3aa made the Q15 Neo4j/Cypher implementation slow - it should be rewritten to become faster.

szarnyasg commented 2 years ago

Tried optimizing this in PR #56 but failed - the proposed change made no difference.

szarnyasg commented 2 years ago

I tried a couple of optimizations. For the experiments, I was running the query on the SF1 data set on an r6id.2xlarge (64GiB RAM) EC2 instance.

Rewrote the query to only use two clauses: https://github.com/ldbc/ldbc_snb_bi/commit/4b392f55af6a1a2a85e6ef4df8caa59b12c7fb0b

// case 1
OPTIONAL MATCH
  (pA)<-[:HAS_CREATOR]-(c:Comment)-[r:REPLY_OF]->(m:Message)-[:HAS_CREATOR]->(pB),
  (m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
  AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis   + '})
WITH knows, pA, pB, 0 + count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w

// case 2
OPTIONAL MATCH
  (pA)<-[:HAS_CREATOR]-(m:Message)<-[r:REPLY_OF]-(c:Comment)-[:HAS_CREATOR]->(pB),
  (m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
  AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis   + '})
WITH knows, pA, pB, w + count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w

When running on SF1, the CSR creation + cheapest path algorithm (Dijkstra) returned the following after ~35 minutes:

Failed to invoke procedure gds.graph.create.cypher: Caused by: java.lang.IllegalArgumentException: Expected positive long value, got -8589934576

I'm pretty sure my query never produced a negative number, so I dug into the log (in scratch/logs/debug.log). It seems some sort of a heap allocation, i.e. out-of-memory problem:

2022-09-13 07:09:34.860+0000 INFO  [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Nodes 98%
2022-09-13 07:09:34.861+0000 INFO  [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Nodes 99%
2022-09-13 07:09:34.861+0000 INFO  [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Nodes 100%
2022-09-13 07:09:34.861+0000 INFO  [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Nodes :: Finished
2022-09-13 07:09:34.868+0000 INFO  [o.n.k.a.p.GlobalProcedures] [neo4j.BoltWorker-2 [bolt] [/86.59.196.44:50688] ] Loading :: Relationships :: Start
2022-09-13 07:09:58.164+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=517, gcTime=560, gcCount=1}
2022-09-13 07:13:48.202+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=636, gcTime=688, gcCount=1}
2022-09-13 07:19:43.113+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=480, gcTime=0, gcCount=0}
2022-09-13 07:27:52.384+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=419, gcTime=0, gcCount=0}
2022-09-13 07:47:23.696+0000 WARN  [o.n.k.a.p.GlobalProcedures] Graph creation failed
java.lang.IllegalArgumentException: Expected positive long value, got -8589934576
        at org.neo4j.util.Preconditions.requirePositive(Preconditions.java:48) ~[neo4j-common-4.4.2.jar:4.4.2]
        at org.neo4j.memory.LocalMemoryTracker.allocateHeap(LocalMemoryTracker.java:145) ~[neo4j-common-4.4.2.jar:4.4.2]
        at org.neo4j.cypher.internal.runtime.SingleThreadedResourcePool.ensureCapacity(ResourceManager.scala:223) ~[neo4j-cypher-runtime-util-4.4.2.jar:4.4.2]
        at org.neo4j.cypher.internal.runtime.SingleThreadedResourcePool.add(ResourceManager.scala:144) ~[neo4j-cypher-runtime-util-4.4.2.jar:4.4.2]
        at org.neo4j.cypher.internal.runtime.ResourceManager.trace(ResourceManager.scala:44) ~[neo4j-cypher-runtime-util-4.4.2.jar:4.4.2]

I increased the pagecache and heap memory settings to 20G:

--env NEO4J_dbms_memory_pagecache_size=20G --env NEO4J_dbms_memory_heap_max__size=20G

This fixed the problem.

I realized Forum's don't have an index on their creationDate. Having these would be beneficial, so I added an index with this commit: https://github.com/ldbc/ldbc_snb_bi/commit/06e8c7c8d8b88730616119440089ba79b68e990b

I rewrote the query to use subqueries:

CALL {
  WITH pA, pB
  OPTIONAL MATCH
    (pA)<-[:HAS_CREATOR]-(c:Comment)-[r:REPLY_OF]->(m:Message)-[:HAS_CREATOR]->(pB),
    (m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
  WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
    AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis   + '})
  RETURN count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
UNION ALL
  WITH pA, pB
  OPTIONAL MATCH
    (pA)<-[:HAS_CREATOR]-(m:Message)<-[r:REPLY_OF]-(c:Comment)-[:HAS_CREATOR]->(pB),
    (m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
  WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
    AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis   + '})
  RETURN count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
}

I removed the OPTIONAL MATCHes from the subquery calls (this was not possible using the linear approach (WITH ... WITH ...):

CALL {
  WITH pA, pB
  RETURN 0.0 AS w
UNION ALL
  WITH pA, pB
  MATCH
    (pA)<-[:HAS_CREATOR]-(c:Comment)-[r:REPLY_OF]->(m:Message)-[:HAS_CREATOR]->(pB),
    (m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
  WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
    AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis   + '})
  RETURN count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
UNION ALL
  WITH pA, pB
  MATCH
    (pA)<-[:HAS_CREATOR]-(m:Message)<-[r:REPLY_OF]-(c:Comment)-[:HAS_CREATOR]->(pB),
    (m)-[:REPLY_OF*0..]->(p:Post)<-[:CONTAINER_OF]-(forum:Forum)
  WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
    AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis   + '})
  RETURN count(r) * 0.5+sum(CASE m=p WHEN true THEN 0.5 ELSE 0 END) AS w
}

I changed to Neo4j Enterprise.

I tried using the parallel runtime with CYPHER runtime=parallel MATCH ... CALL ... but it is not supported:

Parallel does not yet support the plans including ProcedureCall, use another runtime.

Using the slotted and pipelined runtimes worked but did not produce better performance.
Started bumping the Neo4j version incrementally. Note that using 4.4.4+ does not work on my laptop (Apple Silicon / M1), as reported in the Interactive issues: https://github.com/ldbc/ldbc_snb_interactive_impls/issues/298

Another strange thing is that the GDS library changes its major version mid-way when upgrading patch version of the DBMS:
- export NEO4J_VERSION=4.4.3 -> RETURN gds.version() results in 1.8.3
- export NEO4J_VERSION=4.4.4 -> RETURN gds.version() results in 2.0.0
The new GDS version also results in new method signatures, including the one for Dijkstra's algorithm.

Bumped Neo4j's version to the latest, 4.4.11:

export SF=1
. scripts/use-datagen-data-set.sh
export NEO4J_VERSION=4.4.11-enterprise
export NEO4J_ENV_VARS="--env=NEO4J_ACCEPT_LICENSE_AGREEMENT=yes"
export NEO4J_ENV_VARS="${NEO4J_ENV_VARS-} --env NEO4J_dbms_memory_pagecache_size=20G --env NEO4J_dbms_memory_heap_max__size=20G"
scripts/load-in-one-step.sh

Tried compacting the weight computation into a single query:

CALL gds.graph.project.cypher(
    'q15',
    'MATCH (p:Person) RETURN id(p) AS id',
    'MATCH (pA:Person)-[knows:KNOWS]-(pB:Person)

    OPTIONAL MATCH (pA)<-[:HAS_CREATOR]-(m1:Message)-[r:REPLY_OF]-(m2:Message)-[:HAS_CREATOR]->(pB)
    OPTIONAL MATCH (m1)-[:REPLY_OF*0..]->(p1:Post)<-[:CONTAINER_OF]-(forum1:Forum)
             WHERE forum1.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
               AND forum1.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis   + '})
    OPTIONAL MATCH (m2)-[:REPLY_OF*0..]->(p2:Post)<-[:CONTAINER_OF]-(forum2:Forum)
             WHERE forum2.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
               AND forum2.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis   + '})
    WITH pA, pB, 0.0
      + sum(CASE forum1 IS NOT NULL WHEN true THEN 0.5 ELSE 0.0 END)
      + sum(CASE forum2 IS NOT NULL WHEN true THEN 0.5 ELSE 0.0 END)
      + sum(CASE m1 = p1            WHEN true THEN 0.5 ELSE 0.0 END)
      + sum(CASE m2 = p2            WHEN true THEN 0.5 ELSE 0.0 END) AS w

    RETURN
       id(pA) AS source,
       id(pB) AS target,
       1/(w+1) AS weight'
)

This made the computation much faster, around 15 seconds (down from 30+ minutes).

szarnyasg commented 2 years ago

There are a whole lot of errors when validating against Umbra...:

15|15a|<10137;17592186053238;2012-11-05;2012-11-12>|[<4.0>]
15|15a|<13194139540988;32985348838424;2012-11-06;2012-11-09>|[]
15|15a|<19791209304199;4398046519404;2012-11-02;2012-11-12>|[<3.8>]
15|15b|<17592186051639;28587302330559;2010-04-13;2010-05-05>|[<1.1917211328976034>]
15|15b|<8796093031752;24189255815107;2010-04-19;2010-05-11>|[<1.220699850169947>]
15|15b|<2199023257011;30786325587660;2010-04-19;2010-05-11>|[<1.6312030075187969>]
15|15b|<2199023258884;26388279067193;2010-04-20;2010-05-12>|[<1.1588235294117646>]
15|15b|<10149;15393162796338;2010-04-19;2010-05-11>|[<1.263101604278075>]
15|15b|<17592186044545;35184372098116;2010-04-20;2010-05-12>|[<2.0>]
15|15b|<2199023265630;24189255813332;2010-04-16;2010-05-08>|[<1.1440993788819875>]
15|15b|<6597069769862;28587302325015;2010-04-21;2010-05-13>|[<2.0>]

15|15a|<10137;17592186053238;2012-11-05;2012-11-12>|[<4.0>]
15|15a|<13194139540988;32985348838424;2012-11-06;2012-11-09>|[<-1.0>]
15|15a|<19791209304199;4398046519404;2012-11-02;2012-11-12>|[<4.0>]
15|15b|<17592186051639;28587302330559;2010-04-13;2010-05-05>|[<1.27777777777778>]
15|15b|<8796093031752;24189255815107;2010-04-19;2010-05-11>|[<1.32123015873016>]
15|15b|<2199023257011;30786325587660;2010-04-19;2010-05-11>|[<1.82955465587045>]
15|15b|<2199023258884;26388279067193;2010-04-20;2010-05-12>|[<1.22981366459627>]
15|15b|<10149;15393162796338;2010-04-19;2010-05-11>|[<1.37619047619048>]
15|15b|<17592186044545;35184372098116;2010-04-20;2010-05-12>|[<2.0>]
15|15b|<2199023265630;24189255813332;2010-04-16;2010-05-08>|[<1.20833333333333>]
15|15b|<6597069769862;28587302325015;2010-04-21;2010-05-13>|[<2.0>]

szarnyasg commented 2 years ago

Simplified the weight computation query to the following:

MATCH (pA:Person)-[knows:KNOWS]-(pB:Person)
OPTIONAL MATCH (pA)<-[:HAS_CREATOR]-(m1:Message)-[r:REPLY_OF]-(m2:Message)-[:HAS_CREATOR]->(pB)
OPTIONAL MATCH (m1)-[:REPLY_OF*0..]->(:Post)<-[:CONTAINER_OF]-(forum:Forum)
        WHERE forum.creationDate >= datetime({epochmillis: ' + $startDate.epochMillis + '})
          AND forum.creationDate <= datetime({epochmillis: ' + $endDate.epochMillis   + '})
WITH pA, pB,
    sum(CASE forum IS NOT NULL 
        WHEN true THEN
            CASE (m1:Post OR m2:Post) WHEN true THEN 1.0
            ELSE 0.5 END
        ELSE 0.0 END
    ) AS w
RETURN
   id(pA) AS source,
   id(pB) AS target,
   1/(w+1) AS weight

The key idea is that both Messages from (m1)-[r:REPLY_OF]-(m2) can be used to find the root Post along a sequence of :REPLY_OF*0 edges. Note that the latter needs to be in its own OPTIONAL MATCH clause as repeated edges are not allowed in the same MATCH clause and these edges can potentially include r.

The new query fixes incorrect results and is still fast (it takes approx. 20 seconds).

szarnyasg commented 2 years ago

Validation now passes against Umbra on SF1.

ldbc / ldbc_snb_bi

Optimize Q15 Cypher query #53