neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
597 stars 157 forks source link

Add Hamming Distance KNN similarity metric for long property #193

Open htmlboss opened 2 years ago

htmlboss commented 2 years ago

This is a draft PR so I can leverage the github action to validate my tests, since the gradle setup for this project does not work on Apple silicon (M1). Not a Java expert so I can't quickly figure out a fix 🤷 Feel free to take a look, but I will mark this as ready for review when my changes are complete!

My end goal is to enable neo4j to build a KNN graph for finding top K similar images by leveraging perceptual hashes (pHash). A pHash is simply a 64-bit unsigned integer (i.e. long in Java). I will not be adding any pHash-specific code to this PR. Any two pHashes can be compared for similarity by calculating the Hamming Distance between them.

Computing the Hamming Distance isn't specific to pHashes, and is useful in genetic computing, error detection/correction, text analysis, where both inputs are of equal length, and where graph structures are common. In my humble (and noob) opinion, it makes sense to offer this feature to the wider community.

From reading your docs, I believe the most natural area to extend with this metric is the LongPropertySimilarityComputer (https://neo4j.com/docs/graph-data-science/current/algorithms/knn/#_scalar_numbers). I intend for the existing default behaviour to remain unchanged, and to allow a new configuration of the KNN algorithm such that

nodeProperties: {
    pHashLongProperty: 'HAMMING_DISTANCE'
}

TODO: expand summary of changes + questions. Changes summary:

Before submitting this PR, please make sure:

htmlboss commented 2 years ago

Hmm looks like CI needs approval from first-time contributors. I'll boot up my linux box later to work on this!

jjaderberg commented 2 years ago

the gradle setup for this project does not work on Apple silicon (M1)

@htmlboss what problems are you having? Several people working on this project use M1s. There were some problems with double precision but I think those were fixed--I'll ask. Feel free to open an issue for whatever problems you're seeing.

Mats-SX commented 2 years ago

For the record, I'm working with an M1 Mac and I'm using the Zulu distribution of the JDK, currently version 11.0.14 (11.0.14-zulu). This works with the default setup. There is a part of the project that is geared towards JDK 17, but this is for the next version of Neo4j (5.0) which is still in development. You should be fine with JDK11, which we recommend.

However, it is possible to use JDK17 as well, and we routinely test for that also. There I would recommend the Temurin distribution, 17.0.3-tem.

htmlboss commented 2 years ago

Thanks for the suggestions! I guess my IntelliJ Idea configuration isn't setup properly 😝 @Mats-SX I'll give your versions a shot when I get a chance!

FlorentinD commented 1 year ago

Hey, @htmlboss I was just wondering if you found the time to try to out Mats suggestion.

Would love to know if there was any further blocker, if we can help you or if you just ended up with a different solution :)