apoc.hashing.fingerprintGraph returns different fingersprints for identical graphs

Cyber-Ware commented 1 year ago

Expected Behavior

When creating the fingerprints of two graphs, I expect them to be the same if graphs are the same.

Actual Behavior

Nodes with the same label(s) and same (or no) properties have identical hashes.

I observed that whenever there are nodes with identical hashes present in the graph, different fingerprints of the graph can be created even though the graphs are exactly the same.

How to Reproduce the Problem

Cypher 1: identical nodes inserted in the order they are used when creating relationships

CREATE
  (t1:Thing),
  (t2:Thing),
  (t3:Thing),
  (t4:Thing),
  (t1)-[:IS_RELATED_TO]->(t2),
  (t2)-[:IS_RELATED_TO]->(t3),
  (t3)-[:IS_RELATED_TO]->(t4)

Cypher 2: identical nodes inserted in a random order

CREATE
  (t4:Thing),
  (t1:Thing),
  (t3:Thing),
  (t2:Thing),
  (t1)-[:IS_RELATED_TO]->(t2),
  (t2)-[:IS_RELATED_TO]->(t3),
  (t3)-[:IS_RELATED_TO]->(t4)

Steps

Empty database
Execute cypher 1
Execute RETURN apoc.hashing.fingerprintGraph() returns fingerprint: E235B8C8CBFDA8FA5DD97F019CF2CFF0
Empty database
Execute cypher 2
Execute RETURN apoc.hashing.fingerprintGraph() returns fingerprint: DBB9CD233F4EF6F200D1D0A550CEC62F (even though the graph is exactly the same)

Origin of the bug

I looked into the source code of the fingerprintGraph() function and found that:

A TreeMap is used for the idToNodeHash variable which will automatically sort its items in natural order (aka sort by node ID). When building the inverse nodeHashToId map, the idToNodeHash indexes are used which results in sorted Lists as values of the inverse map.
The order of processing nodes with identical hashes affects the fingerprint. Because of (1) the list is sorted by node ID, but the fingerprint should be always the same no matter which order the nodes with identical hashes were processed in.

Solution for the bug

TL;DR: it's complicated so here are a few thoughts of my own

The most obvious solution in my opinion is to loop over the identical nodes in the list, hash every node with their relationships and end nodes, then sort the hashes and only then add them to the actual fingerprint. This trick would solve the issue partially because there is still another edge case: only the directly related nodes are accounted for, not the position of a node in the bigger/entire graph structure. This means that in a graph with a chain of identical nodes (e.g. a chemical molecule), another node could theoretically be almost anywhere in the chain and the algorithm will return the same fingerprint.

The only actual solution I can see is to use a non-recursive depth-first algorithm (example code here in chapter 12.3.2) which traverses through the graph (more complicated when they are disconnected) and uses the hash of the "parent" node as seed for the hash of the current node or relationship. If a node were to have multiple relationships with nodes that have the same hash, multiple fingerprints for the entire graph are generated, sorted and combined to make up the actual fingerprint of the graph.

Specifications

Even though I'm using one of the latest Neo4J APOC releases, I found that this bug has always been present by checking the version history of the core/src/main/java/apoc/hashing/Fingerprint.java file.

Versions

OS: Debian 11 (Bullseye) with Docker
Neo4j: 5.9.0-community
Neo4j-Apoc: 5.9.0-core

Lojjs commented 1 year ago

@Cyber-Ware Thanks for your very comprehensive bug report! We will look into this and report back here.

Best regards Louise, Neo4j

Cyber-Ware commented 1 year ago

Hello, I would like to know if there is any update regarding this issue I submitted 2 months ago. Kind regards, Cyberware

Lojjs commented 1 year ago

@Cyber-Ware We currently have a limited engineering capacity for APOC within Neo4j, where we focus on customer support and security issues so we have a bit of a backlog of bugs to work through unfortunately. It is still in our list of things to look into.

neo4j / apoc