Implement brute force search in hnsw

milvus-io / knowhere

Knowhere is an open-source vector search engine, integrating FAISS, HNSW, etc.

Apache License 2.0

202 stars 81 forks source link

Implement brute force search in hnsw #256

Open czs007 opened 2 years ago

czs007 commented 2 years ago

Milvus introduces the concept of a bitmask to the filtering mechanism to keep similar vectors with a bitmask of 1 based on satisfying specific attribute filters.

Currently knowhere will accept a parameter bitmask when doing hnsw search. When the proportion of valid data marked in the bitmask is small, the performance of hnsw continues to degrade, which is slower than brute force search.

In the above figure, the X-axis represents the number of valid data in 1 million rows of data

The hnsw index structure contains raw vector data, so it is feasible to add a brute force search process to hnsw.

Suggest: On the milvus side, according to a simple strategy (such as the percentage of valid data), an additional bool parameter is passed to knohwere, indicating whether it needs to go the brute forece search path.

landyliu commented 2 years ago

From the picture, when 1 million data is filtered out more than 960000, bruteforce search will be faster than HNSW search. I have two considerations:

Whether the HNSW processing can be optimized in this case? In this case, the processing logic of HNSW vs. bruteforce is the random access vs. continuous access in RAM?
The situation tha filtering out more than 95% is an extreme situation or occurs frequently; if it actually happens rarely, for example, only appears once every 1 million runs, which feels negligible. So here we keep an eye on this issue, will follow up to inverstigate on how to solve it.

foxspy commented 2 years ago

test config: M : 8 efConstruction : 20 dim : 128 data : random ef : 10 topK: 10

cydrain commented 2 years ago

Suggest: On the milvus side, according to a simple strategy (such as the percentage of valid data), an additional bool parameter is passed to knohwere, indicating whether it needs to go the brute forece search path.

Hi @czs007 , I don't think it's a good idea to add brute force search logic into HNSW. I believe each index algorithm has its applicable scenario, user should analyze their scenario then decide to use which index type. For this case, if HNSW run slower than brute force when bitset filtering out 95% vectors, segcore can handle this judgment and use IDMAP instead of HNSW. This logic should not be added into HNSW algorith.