neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

Unstable triangle listings returning non-existing triangles #280

Closed joyemang33 closed 10 months ago

joyemang33 commented 10 months ago

Neo4j Version: 5.10 Operating System: Ubuntu 20.04 API: Docker

Hello! When I use Graph Data Science to test the algorithm gds.alpha.triangles, I found that it returns unstable results for the same graph.

Steps to reproduce

Run the following Python code:

import pandas as pd

if __name__ == "__main__":
    nodes = pd.read_csv('G_node.csv')
    edges = pd.read_csv('G_edge.csv')
    G = gds.graph.construct(
        graph_name="test",      
        nodes=nodes,           
        relationships=edges,
        undirected_relationship_types= ["T0", "T1", "T2", "T3", "T4"]  
    )
    print(gds.alpha.triangles(G))

Interestingly, running the code multiple times can yield two inconsistent results:

Result1:

    nodeA  nodeB  nodeC
0       0      1      2
1       1      2      3
2       0      2      7
3       0      2      9
4       0      4      7
5       0      4      8
6       0      4      9
7       4      7      8
8       5      6      7
9       0      7      8
10      5      6      9
11      5      7      8

Result2:

   nodeA  nodeB  nodeC
0      0      1      2
1      1      2      3
2      0      2      7
3      0      2      9
4      0      4      7
5      0      4      8
6      0      4      9
7      4      7      8
8      0      7      8
9      5      7      8

After my investigation, the result1 is obviously wrong since there is no edge between the node 6 and node 7 in the data but (5, 6, 7) is listed in line 8 of the result. I am not sure what the root cause of the strange behavior is, but it may indicate a potential bug in Neo4j's triangle algorithm. Could you further investigate and confirm it? It would be highly appreciated!

Here is the graph data: G_edge.csv G_node.csv

Best regards, Joye

FlorentinD commented 10 months ago

Hello again, thanks for submitting another very well-written bug report!

On a first glance, the different result on multiple runs, hints towards a concurrency bug. So you could try to fix concurrency=1 as a temporary workaround.

joyemang33 commented 10 months ago

Many thanks for your so quick response again! I also agree with you that the root cause may be in the implementation of the concurrency. When I set concurrency = 1, I obtain a new and stable result 3 different from the previous two results.

Results3

    nodeA  nodeB  nodeC
0       0      1      2
1       0      2      7
2       0      2      9
3       0      4      7
4       0      4      8
5       0      4      9
6       0      7      8
7       1      2      3
8       2      3      7
9       2      3      9
10      4      7      8
11      5      7      8

It doesn't contain the non-existing triangle (5, 6, 7) but has 11 triangles, which indicates that Result1 and Result2 may both be incorrect.

joyemang33 commented 10 months ago

Many thanks for your so quick response again! I also agree with you that the root cause may be in the implementation of the concurrency. When I set concurrency = 1, I obtain a new and stable result 3 different from the previous two results.

Results3

    nodeA  nodeB  nodeC
0       0      1      2
1       0      2      7
2       0      2      9
3       0      4      7
4       0      4      8
5       0      4      9
6       0      7      8
7       1      2      3
8       2      3      7
9       2      3      9
10      4      7      8
11      5      7      8

It doesn't contain the non-existing triangle (5, 6, 7) but has 11 triangles, which indicates that Result1 and Result2 may both be incorrect.

PS: When I set concurrency = 1 and run my testing tools again, I found another bug-triggering test case for gds.alpha.triangles. I believe that it is different from the potential concurrency bug in this issue, and I will report it once this issue has been confirmed.

joyemang33 commented 10 months ago

Hi @FlorentinD! Here is another issue even though I set concurrency = 1.

Run the following Python code:

import pandas as pd

if __name__ == "__main__":
    nodes = pd.read_csv('G2_node.csv')
    edges = pd.read_csv('G2_edge.csv')
    G = gds.graph.construct(
        graph_name="test",      
        nodes=nodes,           
        relationships=edges,
        undirected_relationship_types= ["T0", "T1", "T2", "T3", "T4"]  
    )
    print(gds.alpha.triangles(G, concurrency=1))

Result:

   nodeA  nodeB  nodeC
0      0      1      8
1      0      1      9
2      0      3      8
3      0      3      9
4      1      4      7
5      2      6      8
6      3      4      7

However, the result contains a non-existing triangle (0, 3, 9) since the graph does not contain the edge between node 3 and node 9. I believe that it may indicate a unique bug in this algorithm since the results of one concurrency is invariant. Could you further investigate and confirm it?

The graph data is here: G2_edge.csv G2_node.csv