Open heemin32 opened 4 months ago
I will pick this one up, thanks!
The main issue here is that the Lucene VectorSimilarityFunction
[1] does not define the HAMMING
distance calculation and [2] is implemented as a java enum
so it's not possible to extend with our own compare
functions.
It looks like this issue has been discussed a few places in Lucene:
And ultimately what Lucene has done, at least for now, is rather than introduce HAMMING
to the VectorSimilarityFunction
they have made it easier to extend the FlatVectorsFormat
class to provide a custom scorer, and with that scorer we can do whatever score computations we would like.
Mapper related changes:
LuceneHNSWMethod
to support binary data type and hamming spaceVectorSimilarityFunction
we need to double check the validations for not supporting anything other than hamming.VectorDataType
changes:
BINARY
data type we can use the KnnByteVectorField
and adjust the dimension by 8 accordingly, so basically we store an 8 dimension binary vector as a 1 dimensional byte vector. Conceptually, we are already doing this in other places.PerFieldKnnVectorsFormat
changes:
FlatVectorsScorer
, which means a custom KnnVectorsFormat
is also needed. So it would look something like BinaryVectorsFormat → Lucene99FlatVectorsFormat → BinaryFlatVectorsScorer
HnswBitVectorsFormat
and FlatBitVectorsScorer
, but since these are not in the backwards compatible packages we probably shouldn't use them yet and instead implement our own classes.@heemin32 @navneet1v @jmazanec15 do these changes sound reasonable to you? Let me know if you think I missed any big parts too.
I think there's potentially some more discussion to be had with Lucene as well. It seems that now there are 2 ways to provide distance calculations:
VectorSimilarityFunction
, which is actually encoded into the segment file itselfKnnVectorsFormat
.This seems confusing because this means the format can completely ignore what is encoded into it's own file, so at the very least I think we can probably start a discussion on how the future of VectorSimilarityFunction
looks and how Lucene plans on either deprecating it or enhancing it.
@jed326 Thanks for all the details. From the changes standpoint more less I am aligned that we need all those changes. I just have question on this:
Lucene has some sample implementations of this in HnswBitVectorsFormat and FlatBitVectorsScorer, but since these are not in the backwards compatible packages we probably shouldn't use them yet and instead implement our own classes.
If a codec is not present in backwards compatible codec then it doesn't mean that it shouldn't be use. Actually its the other way round.
Nevertheless, as per this comment: https://github.com/apache/lucene/pull/13288#issuecomment-2149954395 Lucene doesn't officially support the BitVectors so we have to create our own KNNVectorsFormat and Scorer. But we can take some reference from BitVectorsFormat from Lucene.
Another reason for the implementation is, even if we make changes in Lucene, Lucene is already moved to 10.x version and Opensearch currently have no plans to upgrade Lucene to 10. So, in the light of that I think we should implement out own format.
I think there's potentially some more discussion to be had with Lucene as well. It seems that now there are 2 ways to provide distance calculations:
- The existing
VectorSimilarityFunction
, which is actually encoded into the segment file itself- Via a customer scorer through a custom
KnnVectorsFormat
.This seems confusing because this means the format can completely ignore what is encoded into it's own file, so at the very least I think we can probably start a discussion on how the future of
VectorSimilarityFunction
looks and how Lucene plans on either deprecating it or enhancing it.
On this I completely agree that we should start a discussion on Lucene and understand what is the long term plan/direction.
Thanks @navneet1v. I opened a discussion on Lucene for this: https://github.com/apache/lucene/issues/14025, please take a look when you get a chance!
Similar to https://github.com/opensearch-project/k-NN/issues/1764, I like to see binary vector support with Lucene engine