How to calculate jaccard similarity score when multi-relationships

neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.

https://neo4j.com/docs/graph-data-science/current/

Other

596 stars 157 forks source link

How to calculate jaccard similarity score when multi-relationships #284

Closed lin17182210 closed 9 months ago

lin17182210 commented 9 months ago

In gds docs:

This is my gds graph project: then i call nodeSimilarity get this:

call gds.nodeSimilarity.stream('test', {topK:5})
yield node1, node2, similarity
return node1, node2, similarity
order by similarity, node1, node2;

the following two pictures are the query results of 12271 and 29419:

they only have the same content in the red box, So why the final similarity is 0.2? At the same time, I also tested that only one relationship is used when creating a project, the result is consistent with the formula in the gds document. Thank you in advanced for any helps.

IoannisPanagiotas commented 9 months ago

Hi @lin17182210,

Based on the data you are sharing, the results seem normal to me. The first node has a degree of 4 so |A|=4 The second node has a degree of 2, so |B|=2, and |A /\ B| =1 , the node in the red box.

So Jaccard(A,B) =. 1/(4 + 2 -1) = 1/5 = 0.2 as shown in the result.

Note that node Similarity does only looks at the neighbors of a node, not the relationship types.

Can you please let us how you were expecting the algorithm to behave?

Best regards, Ioannis.

lin17182210 commented 9 months ago

Hi @IoannisPanagiotas Thank you very much. I understand. I fell into a misunderstanding of calculation before, which led to the inconsistency with the expected result. I am going to use the similarity of nodes for supplementary expansion of the search result set. Before that, our expansion method completely used query statements, because I have not been in contact with neo4j for a long time, do you have any other suggestions on search recommendations? thanks again

IoannisPanagiotas commented 9 months ago

Hello again @lin17182210,

Nice, happy to see nothing is wrong after all :)

If you want exact similarity search based on common neighborhoods between nodes, I cannot think anything besides gds.nodeSimilarity.

If you can settle for non-exact, one potential idea is generating embeddings and using these embeddings as the input node property on kNN.

But this approach does not gurantee you that the top-k pairs you find are indeed the closest in similarity or that the computed similarities are 100% accurate.

But as a compromise it might be faster than nodeSimilarity and quality-wise hopefully not that bad.

Best regards and good luck in your project, Ioannis.