[Bug]: the closed vector was not returned in search with hnsw index top150

yanliang567 commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.2.2
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):

Current Behavior

the closest vector( ip dis: 0.92) was not return by hnsw index with top150

Expected Behavior

the closest vector was return as top1

Steps To Reproduce

No response

Milvus Log

12/22/2022 11:04:32 AM - INFO - index param: {'index_type': 'HNSW', 'metric_type': 'IP', 'params': {'M': 32, 'efConstruction': 256}}
12/22/2022 11:04:32 AM - INFO - search_param: {'metric_type': 'IP', 'params': {'ef': 350}}
12/22/2022 11:04:32 AM - INFO - assert kedaxunfei_400d flushed num_entities 2785388: 0.001
12/22/2022 11:04:32 AM - INFO - {'total_rows': 2785388, 'indexed_rows': 2785388}
12/22/2022 11:04:32 AM - INFO - assert load kedaxunfei_400d: 0.004
12/22/2022 11:04:32 AM - INFO - search start: nq1_ef350_top150_threads1
12/22/2022 11:04:32 AM - INFO - nq results:
12/22/2022 11:04:32 AM - INFO - search result_0: 597aedc871763b8b5208c1aeecf61af4, 0.8745715022087097
12/22/2022 11:04:32 AM - INFO - search result_1: 1fc11513d9778ded72140c4d526849be, 0.8607193827629089

12/22/2022 11:04:12 AM - INFO - index param: {'index_type': 'FLAT', 'metric_type': 'IP', 'params': {}}
12/22/2022 11:04:12 AM - INFO - search_param: {'metric_type': 'IP', 'params': {'nprobe': 16}}
12/22/2022 11:04:12 AM - INFO - assert keda_flat flushed num_entities 2785388: 0.001
12/22/2022 11:04:12 AM - INFO - {'total_rows': 2785388, 'indexed_rows': 2785388}
12/22/2022 11:04:12 AM - INFO - assert load keda_flat: 0.005
12/22/2022 11:04:12 AM - INFO - search start: nq1_ef350_top150_threads1
12/22/2022 11:04:12 AM - INFO - nq results:
12/22/2022 11:04:12 AM - INFO - search result_0: 9ec635492eca172ebb9b492413629d84, 0.9133248329162598
12/22/2022 11:04:12 AM - INFO - search result_1: 597aedc871763b8b5208c1aeecf61af4, 0.8745715022087097
12/22/2022 11:04:12 AM - INFO - search result_2: 1fc11513d9778ded72140c4d526849be, 0.8607193827629089

Anything else?

输入向量.txt gt.txt

yanliang567 commented 1 year ago

/assign @cydrain

cydrain commented 1 year ago

scripts: create_insert_kedaxunfei.py.txt search_keda.py.txt

dzqoo commented 1 year ago

Could you share me the scripts of recall compute ?

cydrain commented 1 year ago

Correct run 1:

12/23/2022 19:27:15 PM - INFO - switch_alias: False
12/23/2022 19:27:15 PM - INFO - switcher: False
12/23/2022 19:27:15 PM - INFO - index param: {'index_type': 'HNSW', 'metric_type': 'IP', 'params': {'M': 32, 'efConstruction': 256}}
12/23/2022 19:27:15 PM - INFO - search_param: {'metric_type': 'IP', 'params': {'ef': 150}}
12/23/2022 19:27:15 PM - INFO - assert keda_test flushed num_entities 2785388: 0.001
12/23/2022 19:27:15 PM - INFO - {'total_rows': 2785388, 'indexed_rows': 2785388}
12/23/2022 19:27:15 PM - INFO - assert load keda_test: 0.004
12/23/2022 19:27:15 PM - INFO - search start: nq1_top150_threads1
12/23/2022 19:27:15 PM - INFO - nq results:
12/23/2022 19:27:15 PM - INFO - search result_0: 9ec635492eca172ebb9b492413629d84, 0.9133248925209045
12/23/2022 19:27:15 PM - INFO - search result_1: 3c85d2293625e0a8ddc6728861a17c85, 0.8254495859146118
12/23/2022 19:27:15 PM - INFO - search result_2: 5904d812339036bab1c8683441452a70, 0.8224592804908752
12/23/2022 19:27:15 PM - INFO - search result_3: 3bb523af5ecf193616543bd642f1af52, 0.8184521198272705
12/23/2022 19:27:15 PM - INFO - search result_4: 99221918b9a6a4f54b12949fee15f8fe, 0.8147043585777283
12/23/2022 19:27:15 PM - INFO - search result_5: 5fab558c9b5a19a8beea4e8e199be42b, 0.8146071434020996
12/23/2022 19:27:15 PM - INFO - search result_6: 848a85419130f599b5e145ffaeaf6dfc, 0.8110066652297974
12/23/2022 19:27:15 PM - INFO - search result_7: e87353ef56fc4fc7b5fc1377ed1bf41b, 0.81040358543396
12/23/2022 19:27:15 PM - INFO - search result_8: 1bf2d956ca04734a99de9bbea665c8c8, 0.8093280792236328
12/23/2022 19:27:15 PM - INFO - search result_9: db39b8444c348da50135fd104e7c9b8c, 0.8066975474357605
12/23/2022 19:27:15 PM - INFO - collection keda_test search 1 times single thread: cost 0.0062, qps 161.2903, avg 0.0062, p99 0.0062 
12/23/2022 19:27:15 PM - INFO - search completed

Correct run 2: Screenshot from 2022-12-23 22-46-42

Correct run 3: Screenshot from 2022-12-24 19-55-12

cydrain commented 1 year ago

reproduce run 1: Screenshot from 2022-12-23 20-16-10

reproduce run 2: Screenshot from 2022-12-24 16-23-46

reproduce run 3: Screenshot from 2022-12-24 19-24-54

cydrain commented 1 year ago

build IVF_FLAT index, can get correct result: Screenshot from 2022-12-23 21-12-17

Screenshot from 2022-12-24 15-53-06

cydrain commented 1 year ago

load 140~159.pkl also can reproduce this issue

run 1: Screenshot from 2022-12-24 20-23-52

run 2: Screenshot from 2022-12-24 20-36-03

cydrain commented 1 year ago

Milvus version	Knowhere version	search out nearest vector
2.1.4	1.2.0	yes
2.2.0	1.3.2	yes
2.2.1	1.3.6	no
2.2.2	1.3.6	no

xiaofan-luan commented 1 year ago

@cydrain Doest that mean Knowhere 1.3.6 has accuracy issue? If it's only one vector it might be just a negative?

xiaofan-luan commented 1 year ago

Milvus version Knowhere version search out nearest vector 2.1.4 1.2.0 yes 2.2.0 1.3.2 yes 2.2.1 1.3.6 no 2.2.2 1.3.6 no

Check the biggest distance from segment. We need to first make sure if it's a knowhere issue or milvus issue

cydrain commented 1 year ago

Milvus version Knowhere version search out nearest vector 2.1.4 1.2.0 yes 2.2.0 1.3.2 yes 2.2.1 1.3.6 no 2.2.2 1.3.6 no

Check the biggest distance from segment. We need to first make sure if it's a knowhere issue or milvus issue

I have tried Milvus 2.2.2 + knowhere (1.3.6 roll back all HNSW related changes), this issue still exist. So it seems this issue is coming from Milvus change between 2.2.0 and 2.2.1. I will figure it out ASAP.

cydrain commented 1 year ago

This issue starts from PR #21011

before #21011, search can always return correct top1 result: Screenshot from 2022-12-26 16-47-55

after #21011, run script, 50% possibility search will return wrong top1 result: Screenshot from 2022-12-26 16-01-32

cydrain commented 1 year ago

set "sealProportion: 0.23", HNSW cannot get correct top1 result

CYD - seg id 438317009491922314, 69593
  (1679, 0.755558)
  (11540, 0.753473)
  (10459, 0.752947)
CYD - seg id 438317009491922252, 69595
  (63989, 0.758025)
  (29080, 0.753873)
  (1881, 0.751062)
CYD - seg id 438317009491922315, 69679
  (10906, 0.757257)
  (25173, 0.754956)
  (20616, 0.749624)
CYD - seg id 438317009491922253, 69686
  (64632, 0.874571)
  (38060, 0.806698)
  (30248, 0.78817)

set "sealProportion: 0.25", HNSW can get correct top1 result

CYD - seg id 438316741167353238, 62648
  (55478, 0.913325)
  (54384, 0.784778)
  (38048, 0.767808)
CYD - seg id 438316741167353165, 76717
  (64632, 0.874571)
  (30248, 0.78817)
  (6744, 0.774419)
CYD - seg id 438316741167353164, 76491
  (63989, 0.758025)
  (71274, 0.755558)
  (29080, 0.753873)
CYD - seg id 438316741167353239, 62697
  (4644, 0.753473)
  (3563, 0.752947)
  (35322, 0.748726)

xiaofan-luan commented 1 year ago

set "sealProportion: 0.23", HNSW cannot get correct top1 result

CYD - seg id 438317009491922314, 69593
  (1679, 0.755558)
  (11540, 0.753473)
  (10459, 0.752947)
CYD - seg id 438317009491922252, 69595
  (63989, 0.758025)
  (29080, 0.753873)
  (1881, 0.751062)
CYD - seg id 438317009491922315, 69679
  (10906, 0.757257)
  (25173, 0.754956)
  (20616, 0.749624)
CYD - seg id 438317009491922253, 69686
  (64632, 0.874571)
  (38060, 0.806698)
  (30248, 0.78817)

set "sealProportion: 0.25", HNSW can get correct top1 result

CYD - seg id 438316741167353238, 62648
  (55478, 0.913325)
  (54384, 0.784778)
  (38048, 0.767808)
CYD - seg id 438316741167353165, 76717
  (64632, 0.874571)
  (30248, 0.78817)
  (6744, 0.774419)
CYD - seg id 438316741167353164, 76491
  (63989, 0.758025)
  (71274, 0.755558)
  (29080, 0.753873)
CYD - seg id 438316741167353239, 62697
  (4644, 0.753473)
  (3563, 0.752947)
  (35322, 0.748726)

Might be related to compaction?

cydrain commented 1 year ago

I believe this issue is not caused by some bugs from Milvus or Knowhere. It just hit a corner case which makes HNSW cannot find out the top1 result.

This issue comes from PR #21011, this PR only change a Milvus parameter "sealProportion" from 0.25 to 0.23. In Milvus 2.2.0, "sealProportion" is set to 0.25 by default, the top1 vector is inserted into segment_A with row count 62648 (id 55478); in Milvus 2.2.1 or later, "sealProportion" is set to 0.23 by default, the top1 vector is insert into segment_B with row count 69679 (id 62509). I dump the raw data of segment_A to a file named 62648_ok.fbin; also dump the raw data of segment_B to a file named 69679_err.fbin. Then load these data into knowhere directly, and build index and do search.

For file 62648_ok.fbin, whatever Knowhere create IVF_FLAT or HNSW index, it can always return top1 vector as result.

For file 69679_err.fbin, if create IVF_FLAT index, it can return top1 vector as result; but if create HNSW index, it CANNOT return top1 vector as result. I also tried Knowhere with thirdparty hnswlib (v1.3.6 / v1.3.2 / v1.2.0 / v1.0.0), all cannot return top1 vector as result.

So it seems to be a corner case that hnswlib cannot handle.

cydrain commented 1 year ago

use Feder to record the HNSW visit info

can find 55478 in json_62648_ok.txt json_62648_ok.txt

cannot find 62509 in json_69679_err.txt json_69679_err.txt

so in some cases HNSW cannot find out the top1 vector is because this vector is left in a isolated area in HNSW graph.

cydrain commented 1 year ago

find when set "ef = 1500", it can always get top1 vector

xiaofan-luan commented 1 year ago

find when set "ef = 1500", it can always get top1 vector

Very impressive test result. can we improve the search result by tuning index build parameters?

cydrain commented 1 year ago

find when set "ef = 1500", it can always get top1 vector

Very impressive test result. can we improve the search result by tuning index build parameters?

no, the build parameter "efc" affects the recall rate a little.

cydrain commented 1 year ago

Some conclusions:

This issue is not a real bug from Milvus or Knowhere, it's caused by HNSW algorithm
This issue is triggered since Milvus 2.2.1, because PR #21011 changed the configuration "sealProportion" from 0.25 to 0.23, it makes the vector distribution in segment changed, and makes top1 vector not be able to be searched out with "ef = 150"
There are 2 work-arounds: a) set sealProportion back to 0.25 (not so good, may work for this issue, but may not for other cases) b) set ef larger, for example 1500 (better, can reduce the probability of this kind of issue)

yanliang567 commented 1 year ago

a by-design limitation for hnsw. close for now.

milvus-io / milvus