Open czs007 opened 2 years ago
From the picture, when 1 million data is filtered out more than 960000, bruteforce search will be faster than HNSW search. I have two considerations:
test config: M : 8 efConstruction : 20 dim : 128 data : random ef : 10 topK: 10
Suggest: On the milvus side, according to a simple strategy (such as the percentage of valid data), an additional bool parameter is passed to knohwere, indicating whether it needs to go the brute forece search path.
Hi @czs007 , I don't think it's a good idea to add brute force search logic into HNSW. I believe each index algorithm has its applicable scenario, user should analyze their scenario then decide to use which index type. For this case, if HNSW run slower than brute force when bitset filtering out 95% vectors, segcore can handle this judgment and use IDMAP instead of HNSW. This logic should not be added into HNSW algorith.
Milvus introduces the concept of a bitmask to the filtering mechanism to keep similar vectors with a bitmask of 1 based on satisfying specific attribute filters.
Currently knowhere will accept a parameter bitmask when doing hnsw search. When the proportion of valid data marked in the bitmask is small, the performance of hnsw continues to degrade, which is slower than brute force search.
In the above figure, the X-axis represents the number of valid data in 1 million rows of data
The hnsw index structure contains raw vector data, so it is feasible to add a brute force search process to hnsw.
Suggest: On the milvus side, according to a simple strategy (such as the percentage of valid data), an additional bool parameter is passed to knohwere, indicating whether it needs to go the brute forece search path.