Jaccard Similarity doesn't work with concurrency

JorenVdV commented 5 years ago

Problem When running the Jaccard similarity algorithm over a list of node and categories entries all the similarities are 0 when run without concurrency limit set to 1.

Environment Docker image running Neo4j 3.5.3 and graph algorithms 3.5.3.3, memory is limited to 16G, cpu's are unbound (192 cpu's in the machine, shared with other processes)

Setup

MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (praveena)-[:LIKES]->(indian)
MERGE (praveena)-[:LIKES]->(portuguese)

MERGE (zhen)-[:LIKES]->(french)
MERGE (zhen)-[:LIKES]->(indian)

MERGE (michael)-[:LIKES]->(french)
MERGE (michael)-[:LIKES]->(italian)
MERGE (michael)-[:LIKES]->(indian)

MERGE (arya)-[:LIKES]->(lebanese)
MERGE (arya)-[:LIKES]->(italian)
MERGE (arya)-[:LIKES]->(portuguese)

MERGE (karin)-[:LIKES]->(lebanese)
MERGE (karin)-[:LIKES]->(italian)

Queries

MATCH (b:Person)-[v:LIKES]->(c:Cuisine)
WITH {item:id(b), categories: collect(id(c))} as vacatureData limit 50000
WITH collect(vacatureData) as data

CALL algo.similarity.jaccard(data, {concurrency:1, similarityCutoff:0.1})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95

results in

╒═══════╤═════════════════╤═══════════════════╤══════════════════╤═══════════════════╤══════════════════╤══════════════════╤═══════════════════╤══════════════════╤══════════════════╕
│"nodes"│"similarityPairs"│"min"              │"max"             │"mean"             │"p25"             │"p50"             │"p75"              │"p90"             │"p95"             │
╞═══════╪═════════════════╪═══════════════════╪══════════════════╪═══════════════════╪══════════════════╪══════════════════╪═══════════════════╪══════════════════╪══════════════════╡
│5      │7                │0.19999980926513672│0.6666669845581055│0.37380967821393696│0.2500009536743164│0.2500009536743164│0.33333301544189453│0.6666669845581055│0.6666669845581055│
└───────┴─────────────────┴───────────────────┴──────────────────┴───────────────────┴──────────────────┴──────────────────┴───────────────────┴──────────────────┴──────────────────┘

removing the concurrency limit

MATCH (b:Person)-[v:LIKES]->(c:Cuisine)
WITH {item:id(b), categories: collect(id(c))} as vacatureData limit 50000
WITH collect(vacatureData) as data

CALL algo.similarity.jaccard(data,  {similarityCutoff:0.1})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95

results in

╒═══════╤═════════════════╤═════╤═════╤══════╤═════╤═════╤═════╤═════╤═════╕
│"nodes"│"similarityPairs"│"min"│"max"│"mean"│"p25"│"p50"│"p75"│"p90"│"p95"│
╞═══════╪═════════════════╪═════╪═════╪══════╪═════╪═════╪═════╪═════╪═════╡
│5      │0                │0.0  │0.0  │0.0   │0.0  │0.0  │0.0  │0.0  │0.0  │
└───────┴─────────────────┴─────┴─────┴──────┴─────┴─────┴─────┴─────┴─────┘

Setting the concurrency to any number except for 1 results in the latter case. The same behaviour is observed when running with our 300k nodes Jaccard computation.

mneedham commented 5 years ago

Hey,

I'll take a look at it. I've seen this happen sporadically, but not been able to figure out exactly why it happens as it doesn't happen every time annoyingly.

e.g. I just tested this on a Docker image and it gives the same results with concurrency 1 and concurrency > 1.

Cheers, Mark

d-kilc commented 4 years ago

Any resolution on this? Still not able to use > 1 core with algo.similarity.jaccard. I'm running 3.5.8 EE.

tomasonjo commented 4 years ago

Please check the https://github.com/neo4j/graph-data-science as it has improved graph algorithms, and it is also the successor for the graph algorithms library

neo4j-contrib / neo4j-graph-algorithms

Jaccard Similarity doesn't work with concurrency #894