neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

Incorrect triangle counts for gds.triangleCount.stream (counting non-existing triangles) #281

Closed joyemang33 closed 10 months ago

joyemang33 commented 10 months ago

Neo4j Version: 5.10 Operating System: Ubuntu 20.04 API: Docker

Hello! I apologize for reaching out again. We're currently working on an automated tool for testing graph databases and libraries, and we might have a number of bug reports coming in these days. I'll do my best to provide clear explanations and minimize the scale of test cases that trigger bugs.

Steps to reproduce

Run the following Python code:

import pandas as pd

if __name__ == "__main__":
    nodes = pd.read_csv('G1_node.csv')
    edges = pd.read_csv('G1_edge.csv')
    G = gds.graph.construct(
        graph_name="test",      
        nodes=nodes,           
        relationships=edges,
        undirected_relationship_types= ["T0", "T1", "T2", "T3", "T4"]  
    )
    print(gds.triangleCount.stream(G, concurrency = 1))

Result 1:

       nodeId     triangleCount
0        0              0
1        1              0
2        2              0
3        3              0
4        5              0
5        6              0
6        7              0
7        8              0
8        9              0
9       10              1
10      11              0
11      12              0
12      13              1
13      14              0
14      15              0
15      17              0
16      18              0
17      19              0
18      20              0
19      21              0
20      22              0
21      23              0
22      24              0
23      25              1
24      26              0
25      28              0
26      29              0

The result contains three nodes with non-zero triangle counts (i.e., node 10, node 13, node 25), which indicate that the three nodes form a triangle in the graph. However, in the graph data, there are only edges (10, 13), (10, 25), but not (13, 25), which is contradicted by the result. I also tried to set the different concurrency to verify whether is an issue with concurrency, but they return the invariant results. Therefore, I believe that it may be related to a potential bug in the basic implementation of the gds.triangleCount.stream. Could you further investigate and confirm it? It would be highly appreciated!

Here is the graph data: G1_node.csv G1_edge.csv

Best regards, Joye

IoannisPanagiotas commented 10 months ago

Hi @joyemang33,

Thank you for bringing this issue to our attention. We are going to look into the problem here as well as for #280 and come back when we know more about what is causing these issues.

Best regards, Ioannis.

joyemang33 commented 10 months ago

Hi @joyemang33,

Thank you for bringing this issue to our attention. We are going to look into the problem here as well as for #280 and come back when we know more about what is causing these issues.

Best regards, Ioannis.

Thanks for your so quick responses. Please feel free to tell me if you need any additional related to these issues.

IoannisPanagiotas commented 10 months ago

Hello again @joyemang33,

After some examination, we believe that we have identified the main cause of the problem.

We are currently working on a fix that should be in version 2.4.5. I do not yet know when it will be scheduled for release, but will try to get it out as soon as possible!

We also believe the same underlying problem is responsible for the behaviour seen in #280 and it isn't due to parallelism issues: now we seem to get stable performance all the times and find 10 triangles in total.

We hope that this will fix your issues :)

I will let you know when 2.4.5 is out

In the mean-time, do not hesitate to contact us again in case you have any further questions and thank you very much for using our library as well as providing these detailed small examples.

Best, Ioannis.

joyemang33 commented 10 months ago

Hi @IoannisPanagiotas! Many thanks for your kind help in investigating these issues! It's great to know that you've identified the underlying causes.

Another piece of good news is that we have also applied the same strategy to test k-core decomposition and SCC. Fortunately, we didn't come across any bug-triggering test cases for these algorithms in GDS. Cheers!

BTW: Do you think the bug reports are valuable and should we keep reporting them?

Best regards, Joye

IoannisPanagiotas commented 10 months ago

@joyemang33

Thank you very much, that is great to hear.

Of course, if you do find other issues with gds, feel free to contact us again in the future.

Best regards, Ioannis

joyemang33 commented 10 months ago

Hi @IoannisPanagiotas! Looks like you have fixed the issue now, Cheers!

Would it be possible for you to share something about the root cause of the issue?

Best regards, Joye

IoannisPanagiotas commented 10 months ago

Hello again @joyemang33,

Yes, we have merged a fix. It is actually possible to build the repository and use the fix immediately (see instructions here). Otherwise, it will be available with 2.4.5 as said.

Sure, I'll try my best !

The standard algorithms for Triangle Count find unique triangle triplets (a,b,c) such that a<b<c and require that adjacency lists are sorted by neighbour id to speed-up these computations.

Our library by default sorts per neighborhood id, but only for each distinct relationship type (e.g., T0, T1 etc). This will work fine when triangle counting is run with a single rel. type.

Now, when multiple rel. types appear (like in your example), our triangle counting implementation relies on a priority queue to coordinate all the sorted adjacency lists (e.g., pick the minimum neighbour from T0 and T1 etc ) to get a single sorted adjacency list.

Now this priority queue was passed from one vertex to another to avoid recreating objects all the time, but it was not cleared properly, meaning that it could contain irrelevant information.

I hope that this makes things more clear, do not hesitate to contact us if you have any more questions :)

Best regards, Ioannis.

joyemang33 commented 10 months ago

Hi @IoannisPanagiotas! Many thanks for your clear explanation and your solution looks good!

Best regards, Joye

IoannisPanagiotas commented 10 months ago

Hi again @joyemang33,

We've finally released 2.4.5 in case you would like to test again :)

Best regards, Ioannis

joyemang33 commented 10 months ago

Hi again @joyemang33,

We've finally released 2.4.5 in case you would like to test again :)

Best regards, Ioannis

Thank you! We will test again ASAP :)

joyemang33 commented 10 months ago

Hi @IoannisPanagiotas. I only see gds.version() 2.4.4 in the latest neo4j docker image. Could you tell me how to install the latest GDS as a plugin of Neo4j Docker? It would be very helpful for our testing.

Thanks for your help!

Best regards, Joye

vnickolov commented 10 months ago

Hi @joyemang33,

Assuming you are using --env NEO4J_PLUGINS=["graph-data-science"] to load the GDS plugin and Neo4j enterprise Docker image, can you give Neo4j community image a try?

joyemang33 commented 10 months ago

Hi @joyemang33,

Assuming you are using --env NEO4J_PLUGINS=["graph-data-science"] to load the GDS plugin and Neo4j enterprise Docker image, can you give Neo4j community image a try?

Many thanks for your kind help! I will try it at community image.

Best regards, Joye

joyemang33 commented 10 months ago

The new version passes all test cases! Cheers!