[FEA] Filtered CAGRA to automatically use pre-filtered brute-force when the filter ratio reaches a particular theshold.

rapidsai / cuvs

cuVS - a library for vector search and clustering on the GPU

https://rapids.ai

Apache License 2.0

166 stars 58 forks source link

[FEA] Filtered CAGRA to automatically use pre-filtered brute-force when the filter ratio reaches a particular theshold. #252

Open cjnolet opened 1 month ago

cjnolet commented 1 month ago

While CAGRA offers the capability to specify a pre-filter, it is limited in that heavy filters can end up reducing recall significantly after a particular point. For this reason, we recommend to use pre-filtered brute-force in the cases where 90%+ of the vectors are being filtered out, since it'll conly compute the distances for the vectors that are not being filtered out.

Many users are getting unpleasant surprises when they attempt to filter out 99% of the vectors from a CAGRA index and find that the recall is close to 0. We also have CAGRA integrated into several different places atm and instead of having each integrator manually switch over to pre-filtered brute-force in these cases, we should do the switch in CAGRA itself so everyone benefits from this feature automatically.

rhdong commented 1 month ago

Does the issue mentioned here depend on the specific dataset possibly?

cjnolet commented 1 month ago

No specific dataset, but the core use-case is hybrid search, where Lucene or Milvus might be accepting a filter that has been constructed based on having done a prior structured search. For example, a user searches for the nearest vectors within a specific geographic region that have certain attributes (Eg above a certain age, or work at a certain store). The results returned from the structured search are often very small in comparison to the total number of elements in the index and so it results in a filter that only includes maybe 1% (sometimes maybe slightly more).

rhdong commented 1 month ago

No specific dataset, but the core use-case is hybrid search, where Lucene or Milvus might be accepting a filter that has been constructed based on having done a prior structured search. For example, a user searches for the nearest vectors within a specific geographic region that have certain attributes (Eg above a certain age, or work at a certain store). The results returned from the structured search are often very small in comparison to the total number of elements in the index and so it results in a filter that only includes maybe 1% (sometimes maybe slightly more).

Got it! Thank you for the clarification!