Open Cyber-Ware opened 1 year ago
@Cyber-Ware Thanks for your very comprehensive bug report! We will look into this and report back here.
Best regards Louise, Neo4j
Hello, I would like to know if there is any update regarding this issue I submitted 2 months ago. Kind regards, Cyberware
@Cyber-Ware We currently have a limited engineering capacity for APOC within Neo4j, where we focus on customer support and security issues so we have a bit of a backlog of bugs to work through unfortunately. It is still in our list of things to look into.
Expected Behavior
When creating the fingerprints of two graphs, I expect them to be the same if graphs are the same.
Actual Behavior
Nodes with the same label(s) and same (or no) properties have identical hashes.
I observed that whenever there are nodes with identical hashes present in the graph, different fingerprints of the graph can be created even though the graphs are exactly the same.
How to Reproduce the Problem
Cypher 1: identical nodes inserted in the order they are used when creating relationships
Cypher 2: identical nodes inserted in a random order
Steps
RETURN apoc.hashing.fingerprintGraph()
returns fingerprint:E235B8C8CBFDA8FA5DD97F019CF2CFF0
RETURN apoc.hashing.fingerprintGraph()
returns fingerprint:DBB9CD233F4EF6F200D1D0A550CEC62F
(even though the graph is exactly the same)Origin of the bug
I looked into the source code of the
fingerprintGraph()
function and found that:TreeMap
is used for theidToNodeHash
variable which will automatically sort its items in natural order (aka sort by node ID). When building the inversenodeHashToId
map, theidToNodeHash
indexes are used which results in sortedList
s as values of the inverse map.Solution for the bug
TL;DR: it's complicated so here are a few thoughts of my own
The most obvious solution in my opinion is to loop over the identical nodes in the list, hash every node with their relationships and end nodes, then sort the hashes and only then add them to the actual fingerprint. This trick would solve the issue partially because there is still another edge case: only the directly related nodes are accounted for, not the position of a node in the bigger/entire graph structure. This means that in a graph with a chain of identical nodes (e.g. a chemical molecule), another node could theoretically be almost anywhere in the chain and the algorithm will return the same fingerprint.
The only actual solution I can see is to use a non-recursive depth-first algorithm (example code here in chapter 12.3.2) which traverses through the graph (more complicated when they are disconnected) and uses the hash of the "parent" node as seed for the hash of the current node or relationship. If a node were to have multiple relationships with nodes that have the same hash, multiple fingerprints for the entire graph are generated, sorted and combined to make up the actual fingerprint of the graph.
Specifications
Even though I'm using one of the latest Neo4J APOC releases, I found that this bug has always been present by checking the version history of the
core/src/main/java/apoc/hashing/Fingerprint.java
file.Versions