rust-cv / hnsw

HNSW ANN from the paper "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs"
MIT License
217 stars 14 forks source link

ndarray example? #25

Closed mooreniemi closed 2 years ago

mooreniemi commented 2 years ago

I wanted to implement distance as ndarray's dot but I'm finding the type errors inscrutable so far. Are there any examples anywhere of using this crate and ndarray together?

I implemented:

struct Embedding;
impl Metric<ArrayView1<'_, f32>> for Embedding {
    type Unit = u64;

    fn distance(&self, a: &ArrayView1<f32>, b: &ArrayView1<f32>) -> Self::Unit {
        a.dot(b) as u64
    }
}

But when I try to insert like so:

    let mut searcher: Searcher<_> = Searcher::default();
    let mut index = Hnsw::new(Embedding);
    for i in 0..n {
        let row: ArrayView1<f32> = v.row(i);
        index.insert(row, &mut searcher);
    }

I hit:

error[E0599]: the method `insert` exists for struct `Hnsw<Embedding, _, _, {_: usize}, {_: usize}>`, but its trait bounds were not satisfied
  --> src/hnsw.rs:64:15
   |
13 | struct Embedding;
   | ----------------- doesn't satisfy `Embedding: space::Metric<_>`
...
64 |         index.insert(row, &mut searcher);
   |               ^^^^^^ method cannot be called on `Hnsw<Embedding, _, _, {_: usize}, {_: usize}>` due to unsatisfied trait bounds
   |
   = note: the following trait bounds were not satisfied:
           `Embedding: space::Metric<_>`
note: the following trait must be implemented
  --> /home/alex/.cargo/registry/src/github.com-1ecc6299db9ec823/space-0.17.0/src/lib.rs:50:1
   |
50 | / pub trait Metric<P> {
51 | |     type Unit: Unsigned + Ord + Copy;
52 | |
53 | |     fn distance(&self, a: &P, b: &P) -> Self::Unit;
54 | | }
   | |_^

It's not jumping out to me what's missing given I see no bounds on P and the bounds of the Self::Unit are met.

mooreniemi commented 2 years ago

I'm actually hitting issues even if I convert to &[f32] which is what I see in the examples, so I think it's not ndarray per se but me misunderstanding how to write the types.

vadixidav commented 2 years ago

I added an example to hgg that covers the usage of ndarray here: https://github.com/rust-cv/hgg/blob/main/examples/ndarray.rs

You should be able to use the same process to create a metric for hnsw as well.

Keep in mind that cosine similarity is not a metric, and even cosine distance is not a metric (see https://towardsdatascience.com/what-is-metric-74b0bf6e862). You may encounter issues when using nearest neighbor search graph algorithms with distances that do not fulfill the triangle inequality (and thus aren't true metrics). If you truly want to perform a search in angular space, it is recommended to use angular distance: https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity. You may find that cosine distance works fine if you insert enough points into the data structure, since the relationship between angle and cosine distance become roughly proportional at small angles, but these data structures may have trouble being built out initially before eventually settling out. Additionally, since hnsw and hgg are hierarchical, the higher layers may have some broken triangle inequalities which would lead to bad search behavior. I highly recommend switching to using a proper distance metric before proceeding.

Feel free to reach out on Discord if you have any additional questions. I will respond when available.

Let me know if this resolves your issue.

vadixidav commented 2 years ago

Also, just in case, make sure you are using the same version of space mentioned in the dependencies: https://crates.io/crates/hnsw/0.11.0/dependencies. For 0.11.0, that version is space 0.17.0.

vadixidav commented 2 years ago

Also, make sure to use to_bits() rather than as here:

a.dot(b) as u64

Here is another version using euclidean metric and to_bits. to_bits allows the positive non-NaN floats to be compared as integers for higher performance during search. It also ensures full ordering, which is not guaranteed by floats. Using as may cause a loss of precision and dynamic range, and it might not work for all data sets.

let distance = (a - b).mapv(|v| v*v).sum().sqrt();
debug_assert!(!distance.is_nan());
distance.to_bits()
mooreniemi commented 2 years ago

Thanks for this! I realized last night but didn't comment that I was accidentally on space 18 rather than 17 which is why the trait wouldn't satisfy even for &[f32].

After switching, I was able to do what you showed above to insert into hnsw. I appreciate the rest of your advice, as well!