opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Implement cosine for faiss engine #2242

Open jmazanec15 opened 3 weeks ago

jmazanec15 commented 3 weeks ago

Description

Cosine similarity is one of the more popular space types. faiss does not support it directly. Instead, they prefer to have data be normalized and then use the inner product (which is equivalent to cosine --> <u,v>/||u||||v|| = cos(theta) (https://github.com/facebookresearch/faiss/blob/2c961cc308ade8a85b3aa10a550728ce3387f625/README.md?plain=1#L11). We should figure out how to add cosine for faiss now that it is default (#2163)

We have a couple different options:

  1. We could normalize data before serializing to flat vector values
  2. We could just normalize when passing to faiss and then for exact scoring lazily compute norms and normalize
  3. We could compute the norm and then use it to divide the dot product when scoring is needed or when vectors need to be normalized.

I prefer option 3 mainly because it will allow us to use use knn vector values as synthetic source (see #1571), but also let us be efficient on search.

We would need to investigate how best to do this. One simple way to do it would be to add one extra dimension and store the normalized value there.

luyuncheng commented 3 weeks ago

we added an ingest pipeline to normalize the data for faiss engine who want to Cosine similarity

heemin32 commented 3 weeks ago

I prefer option 2. It is more simple as we keep everything as is and only pass normalized data to faiss engine. Could you tell why we need exact scoring lazily compute norms and normalize for option 2?

jmazanec15 commented 3 weeks ago

For option 2, for re-scoring (exact search), because the vectors are not normalized, we would need to compute norms, as we do now.

vamshin commented 3 weeks ago

we added an ingest pipeline to normalize the data for faiss engine who want to Cosine similarity

@luyuncheng interesting. What are your thoughts on having this support in k-NN plugin VS Ingest pipeline approach?

heemin32 commented 3 weeks ago

For option 2, for re-scoring (exact search), because the vectors are not normalized, we would need to compute norms, as we do now.

For re-scoring we can just use cosine distance calculation instead of (normalization + innerproduct)?

jmazanec15 commented 3 weeks ago

cosine distance calculation is implemented as normalization + innerproduct I believe

luyuncheng commented 2 weeks ago

we added an ingest pipeline to normalize the data for faiss engine who want to Cosine similarity

@luyuncheng interesting. What are your thoughts on having this support in k-NN plugin VS Ingest pipeline approach?

@vamshin @jmazanec15 @heemin32 as @jmazanec15 says

cosine distance calculation is implemented as normalization + innerproduct I believe

cosine distance calculation is normalization + innerproduct also says in faiss#wiki

before i read this issues, and i do need cosine distance in faiss.

firstly i introduced normalization at faiss_wrapper#CreateIndex and do normalize before faiss::write_index like @jmazanec15 's option 1. then i found, every merge would do normalize again. it increased the refresh time.

so in the end, i introduced a new pipeline doing normalize,

PROS:

  1. it can do normalize at ingestNode, save cpu usage at datanode to build Graph
  2. not change jni code.
  3. not increase refresh time.
  4. compare to option2 option3, not increase search time

CONS:

  1. changed raw vector data
  2. need more ingestNode computation