unum-cloud / usearch

Fast Open-Source Search & Clustering engine Γ— for Vectors & πŸ”œ Strings Γ— in C++, C, Python, JavaScript, Rust, Java, Objective-C, Swift, C#, GoLang, and Wolfram πŸ”
https://unum-cloud.github.io/usearch/
Apache License 2.0
2.26k stars 141 forks source link

Feature: Filtering support? #348

Open plurch opened 9 months ago

plurch commented 9 months ago

Describe what you are looking for

Very nice library with fast performance in my limited testing so far πŸ‘

I am curious to know if there are any plans to support filtering when searching the index? It is useful in some use-cases to exclude specific ids and still get the expected top n results returned.

Faiss does support this but it has some performance impacts.

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

Code of Conduct

ashvardanian commented 9 months ago

Hi @plurch! That's available in the C++ layer already. I assume you are using a binding. Is it written in Python? Are you going to check those IDs against some Python data-structure like a set or dictionary?

plurch commented 9 months ago

Hi @ashvardanian , I am using the python bindings to build the index and then the javascript bindings to do the search from a web app. So I really would be using the search filtering through JS in my case, but python support would probably be useful eventually also.

I think some type of interface similar to faiss that accepts a set of int ids to exclude/include might work. This isn't something that is actively blocking me, just curious about feasibility at this point while evaluating some ANN libraries. Thanks!

ashvardanian commented 9 months ago

Gotcha, @plurch, thanks for the feedback! I will keep in mind for future releases πŸ€—

plurch commented 9 months ago

Sounds good πŸ‘

bennimmo commented 8 months ago

I would be able to utilise this library with this functionality. Love the work though! I would be using the python lib.

raulcarlomagno commented 7 months ago

metadata filtering would a game change feature

are you thinking about adding metadata storage besides vectors storage? i mean, for the filtering support. Avoiding Faiss way in which you should filter in advance the ids to compare, but sometimes these ids could be million quantities

ashvardanian commented 7 months ago

@raulcarlomagno, in our case, we use predicate functions instead of an ID list. Passing them from C and Rust isn't hard to add, C++ already supports that, but in Python and JavaScript, I am not sure about how we can make it fast...

raulcarlomagno commented 7 months ago

what about adding an optional storage for metadata like rocksdb? you keep the current vectors index, and other index for the metadata, and this predicate function thing is done inside C, not python the heavy thing is done in internally in C, transparent for python API wrapper

or maybe you don't want to mess storing metadata... ☺️

bennimmo commented 7 months ago

Hi @plurch! That's available in the C++ layer already. I assume you are using a binding. Is it written in Python? Are you going to check those IDs against some Python data-structure like a set or dictionary?

Would this also apply to the clustering, as this would be a real game changer?

lukebuehler commented 6 months ago

I just built a PoC with usearch. It's amazing! However, metadata filtering is blocking me to use it in our product. In our case, we were first using the java bindings, but are now using python. A predicate solution, like you have in c++, would work. However setting some meta data fields and then being able to filter on them would be the best--basically how it works in qdrant.

I'm aware that this is a big ask and will require to extend your store by some other, non-vector index, but it would make it one of the most attractive in-process vector stores out there.