opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
157 stars 123 forks source link

[FEATURE]Binary vector support with Lucene #1857

Open heemin32 opened 4 months ago

heemin32 commented 4 months ago

Similar to https://github.com/opensearch-project/k-NN/issues/1764, I like to see binary vector support with Lucene engine

jed326 commented 2 weeks ago

I will pick this one up, thanks!

jed326 commented 4 days ago

Issue

The main issue here is that the Lucene VectorSimilarityFunction [1] does not define the HAMMING distance calculation and [2] is implemented as a java enum so it's not possible to extend with our own compare functions.

It looks like this issue has been discussed a few places in Lucene:

And ultimately what Lucene has done, at least for now, is rather than introduce HAMMING to the VectorSimilarityFunction they have made it easier to extend the FlatVectorsFormat class to provide a custom scorer, and with that scorer we can do whatever score computations we would like.

Changes Needed

Mapper related changes:

VectorDataType changes:

PerFieldKnnVectorsFormat changes:

@heemin32 @navneet1v @jmazanec15 do these changes sound reasonable to you? Let me know if you think I missed any big parts too.

jed326 commented 4 days ago

I think there's potentially some more discussion to be had with Lucene as well. It seems that now there are 2 ways to provide distance calculations:

  1. The existing VectorSimilarityFunction, which is actually encoded into the segment file itself
  2. Via a customer scorer through a custom KnnVectorsFormat.

This seems confusing because this means the format can completely ignore what is encoded into it's own file, so at the very least I think we can probably start a discussion on how the future of VectorSimilarityFunction looks and how Lucene plans on either deprecating it or enhancing it.

navneet1v commented 4 days ago

@jed326 Thanks for all the details. From the changes standpoint more less I am aligned that we need all those changes. I just have question on this:

Lucene has some sample implementations of this in HnswBitVectorsFormat and FlatBitVectorsScorer, but since these are not in the backwards compatible packages we probably shouldn't use them yet and instead implement our own classes.

If a codec is not present in backwards compatible codec then it doesn't mean that it shouldn't be use. Actually its the other way round.

Nevertheless, as per this comment: https://github.com/apache/lucene/pull/13288#issuecomment-2149954395 Lucene doesn't officially support the BitVectors so we have to create our own KNNVectorsFormat and Scorer. But we can take some reference from BitVectorsFormat from Lucene.

Another reason for the implementation is, even if we make changes in Lucene, Lucene is already moved to 10.x version and Opensearch currently have no plans to upgrade Lucene to 10. So, in the light of that I think we should implement out own format.

I think there's potentially some more discussion to be had with Lucene as well. It seems that now there are 2 ways to provide distance calculations:

  1. The existing VectorSimilarityFunction, which is actually encoded into the segment file itself
  2. Via a customer scorer through a custom KnnVectorsFormat.

This seems confusing because this means the format can completely ignore what is encoded into it's own file, so at the very least I think we can probably start a discussion on how the future of VectorSimilarityFunction looks and how Lucene plans on either deprecating it or enhancing it.

On this I completely agree that we should start a discussion on Lucene and understand what is the long term plan/direction.

jed326 commented 3 days ago

Thanks @navneet1v. I opened a discussion on Lucene for this: https://github.com/apache/lucene/issues/14025, please take a look when you get a chance!