creating ann index with membership bits

prateekpatelsc commented 2 years ago

@yurymalkov : I have a use case where i want to build ann index and also support filtering operation based on some tags . The tags do not influence the distance or graph structure (during building index) , but during search operation we want to filter elements based on membership or filtering by tags associated. Two approaches i am thinking of : 1) It seems like search operation can be similar as it is used to ignore deleted items. 2) Add items with some added bit vector that indicates tags
Build index with a custom distance function , where during indexing the usual distance metric is used (ignoring the bit vector), but while searching the distance function uses the added bit vector to decide eligibility , if eligibility == true return the original distance otherwise return some large distance (max float) .

Do you observe any issues i.e any scenario in eligible items may get returned as topK ? Any other approaches to support such filtering ?

TechnikEmpire commented 2 years ago

I'm a little confused by this because the template parameters specifically enable you to associate a type with each entry so you have full control to implement what you're wanting here. If you're talking about modifying the vectors somehow post insert I don't see how that's possible.

Take your unique index and build out a secondary structure where you do whatever custom logic you want there, no? Maybe I'm completely missing it here, just my $0.02.

prateekpatelsc commented 2 years ago

I am not talking about modifying any vector . I want to build an index that also supports filtering i.e give me topK neighbors which have tags in [.. tag ids]

There are 2 ways to implement this

similar to deleted items , if a topk neighbour is not satisfying a tag membership , it gets ingnored during graph traversal
second approach is add membership bit vector along with embedding , create index based on normal distance metric , but during search if membership is false return a large distance else return original distance , this way the topk neightbours will be most probably the correct filtered items

TechnikEmpire commented 2 years ago

Yeah so what I'm saying is why you don't simply have a secondary element, where your index is std::uint64_t, something like:

std::unordered_map<std::uint64_t, MyFilterData>

And simply fetch KNN, use the returned indices to check your extra filter data and pop the elements that don't make the cut. This is precisely what you're asking for, you just want them to implement it and build it into the library. It is un-possible for this to be implemented any other way while satisfying the requirement that the vectors and indexing system be unaltered.

As a user of this library the last thing I want is to incur performance penalties for no reason, but that's my $0.02. Also just a disclaimer here I'm just a guy who starred the repo and uses this library to great effect.

prateekpatelsc commented 2 years ago

may be i am not phrasing it properly. If you get knn and then pop , this is just post filtering . Are you suggesting this ? because this would not work when the filtering criterias can be restrictive and can filter out all and do not fill in the required amount of results needed

I was thinking of building in this filtering support during the fetching process itself i.e during the graph traversal We can either have it by storing this extra map serialized as a part of index as well , but that is a more intrusive change and changes the index structure as well. Hence the approach i was talking about was plugging in the filtering data as a part of the embedding and instead have a custom distance function that does the trick , instead of changing or adding another map and serialization code. enabling with just using a distance function seems would be a neat way ?

Slasher111 commented 2 years ago

may be i am not phrasing it properly. If you get knn and then pop , this is just post filtering . Are you suggesting this ? because this would not work when the filtering criterias can be restrictive and can filter out all and do not fill in the required amount of results needed

I have the same problem. I solve it the following way: let's say you need 10 nearest neighbors matching some other criterias besides lowest distance. You find 1000 nearest neighbors, then filter them, and check, if there are at least 10 neighbors left. If there is more than 10 left, you just take the first 10. If there is less than 10 left, you should just select 5000 (or 10000 etc.) instead of 1000 and repeat the procedure.

yurymalkov commented 2 years ago

@prateekpatelsc Yeah. This is the way how services deal with filtering in hnsw (e.g. https://www.semi.technology/developers/weaviate/current/architecture/prefiltering.html )

prateekpatelsc commented 2 years ago

@yurymalkov : one issue is inverted index could result in large (millions of ids, index could be hundred million) to create large allowlist . I am wondering if its reasonable to create a custom distance function that during query search returns a large distance if an item is not a part of allowlist (allowlist is determined by some boolean operation on the embedding stored).

Trying to understand if just returning a large distance during search is sufficient enough to achieve the desired filtering ? Is there a scenario where we can have all returned topK have large distance since the search operation stops when it cant find something smaller among neighboring nodes it is searching ?

yurymalkov commented 2 years ago

It is reasonable to have, but it creates additional decisions (e.g. how big is the distance) but probably would work.

Is there a scenario where we can have all returned topK have large distance since the search operation stops when it cant find something smaller among neighboring nodes it is searching ? I am not sure I understand that

prateekpatelsc commented 2 years ago

Trying to understand the difference between the 2 approach : 1) treat items not in allowlist as deleted 2) Compute a large distance for items not in allowlist

In the first approach , we guarantee the elements not in allowlist are not returned in topK since they are not even added to our result queue IIUC , in the second approach , is there a possibility that all or a few returned results are items not in allowlist (we stop exploring graph if we cannot find any neighbor that improves our current solution set , so this could result in a scenario that all my current result set are large distance (items not in allowlist). and no other node in neighbour is also improving , basically being stuck in a very bad local minima ?

yurymalkov commented 2 years ago

@prateekpatelsc I think you are right, this potentially might happen.

nmslib / hnswlib

creating ann index with membership bits #366