neo4j-contrib / neo4j-graph-algorithms

Efficient Graph Algorithms for Neo4j
https://github.com/neo4j/graph-data-science/
GNU General Public License v3.0
770 stars 194 forks source link

Pearson similarity results out of range #774

Closed tomasonjo closed 5 years ago

tomasonjo commented 5 years ago

When running algo.similarity.pearson.stream I get similarity returned that is out of -1 to 1 range. Running on Neo4j 3.5

Create sample graph:

MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (praveena)-[:LIKES {score: 9}]->(indian)
MERGE (praveena)-[:LIKES {score: 7}]->(portuguese)

MERGE (zhen)-[:LIKES {score: 10}]->(french)
MERGE (zhen)-[:LIKES {score: 6}]->(indian)

MERGE (michael)-[:LIKES {score: 8}]->(french)
MERGE (michael)-[:LIKES {score: 7}]->(italian)
MERGE (michael)-[:LIKES {score: 9}]->(indian)

MERGE (arya)-[:LIKES {score: 10}]->(lebanese)
MERGE (arya)-[:LIKES {score: 10}]->(italian)
MERGE (arya)-[:LIKES {score: 7}]->(portuguese)

MERGE (karin)-[:LIKES {score: 9}]->(lebanese)
MERGE (karin)-[:LIKES {score: 7}]->(italian)

Run the algorithm:

MATCH (p:Person), (c:Cuisine)
OPTIONAL MATCH (p)-[likes:LIKES]->(c)
WITH {item:id(p), weights: collect(coalesce(likes.score, 0))} as userData
WITH collect(userData) as data
CALL algo.similarity.pearson.stream(data)
YIELD item1, item2, count1, count2, similarity
RETURN algo.getNodeById(item1).name AS from, algo.getNodeById(item2).name AS to, similarity
ORDER BY similarity DESC

Returns results:

from to similarity
"Zhen" "Arya" 95.31377377744579
"Michael" "Arya" 68.20606579452823
"Arya" "Karin" 64.31334147339001
"Zhen" "Michael" 55.139295305548224
"Praveena" "Karin" 51.19999999999999
"Zhen" "Karin" 49.35545314063058
"Praveena" "Arya" 42.800450683143545
"Michael" "Karin" 27.800000000000008
"Praveena" "Michael" 4.199999999999998
"Zhen" "Praveena" 2.699126343628236

Results for Zhen and Arya should be -0.9235830792388159

as calculated from

RETURN algo.similarity.pearson([10, 0, 6, 0, 0],[0, 10, 0, 10, 7])

mneedham commented 5 years ago

This looks similar to something I noticed while testing, and I think it'll be fixed when this PR is merged - https://github.com/neo4j-contrib/neo4j-graph-algorithms/pull/773/