Closed yanliang567 closed 1 year ago
/assign @cydrain
Could you share me the scripts of recall compute ?
Correct run 1:
12/23/2022 19:27:15 PM - INFO - switch_alias: False
12/23/2022 19:27:15 PM - INFO - switcher: False
12/23/2022 19:27:15 PM - INFO - index param: {'index_type': 'HNSW', 'metric_type': 'IP', 'params': {'M': 32, 'efConstruction': 256}}
12/23/2022 19:27:15 PM - INFO - search_param: {'metric_type': 'IP', 'params': {'ef': 150}}
12/23/2022 19:27:15 PM - INFO - assert keda_test flushed num_entities 2785388: 0.001
12/23/2022 19:27:15 PM - INFO - {'total_rows': 2785388, 'indexed_rows': 2785388}
12/23/2022 19:27:15 PM - INFO - assert load keda_test: 0.004
12/23/2022 19:27:15 PM - INFO - search start: nq1_top150_threads1
12/23/2022 19:27:15 PM - INFO - nq results:
12/23/2022 19:27:15 PM - INFO - search result_0: 9ec635492eca172ebb9b492413629d84, 0.9133248925209045
12/23/2022 19:27:15 PM - INFO - search result_1: 3c85d2293625e0a8ddc6728861a17c85, 0.8254495859146118
12/23/2022 19:27:15 PM - INFO - search result_2: 5904d812339036bab1c8683441452a70, 0.8224592804908752
12/23/2022 19:27:15 PM - INFO - search result_3: 3bb523af5ecf193616543bd642f1af52, 0.8184521198272705
12/23/2022 19:27:15 PM - INFO - search result_4: 99221918b9a6a4f54b12949fee15f8fe, 0.8147043585777283
12/23/2022 19:27:15 PM - INFO - search result_5: 5fab558c9b5a19a8beea4e8e199be42b, 0.8146071434020996
12/23/2022 19:27:15 PM - INFO - search result_6: 848a85419130f599b5e145ffaeaf6dfc, 0.8110066652297974
12/23/2022 19:27:15 PM - INFO - search result_7: e87353ef56fc4fc7b5fc1377ed1bf41b, 0.81040358543396
12/23/2022 19:27:15 PM - INFO - search result_8: 1bf2d956ca04734a99de9bbea665c8c8, 0.8093280792236328
12/23/2022 19:27:15 PM - INFO - search result_9: db39b8444c348da50135fd104e7c9b8c, 0.8066975474357605
12/23/2022 19:27:15 PM - INFO - collection keda_test search 1 times single thread: cost 0.0062, qps 161.2903, avg 0.0062, p99 0.0062
12/23/2022 19:27:15 PM - INFO - search completed
Correct run 2:
Correct run 3:
reproduce run 1:
reproduce run 2:
reproduce run 3:
build IVF_FLAT index, can get correct result:
load 140~159.pkl also can reproduce this issue
run 1:
run 2:
Milvus version | Knowhere version | search out nearest vector |
---|---|---|
2.1.4 | 1.2.0 | yes |
2.2.0 | 1.3.2 | yes |
2.2.1 | 1.3.6 | no |
2.2.2 | 1.3.6 | no |
@cydrain Doest that mean Knowhere 1.3.6 has accuracy issue? If it's only one vector it might be just a negative?
Milvus version Knowhere version search out nearest vector 2.1.4 1.2.0 yes 2.2.0 1.3.2 yes 2.2.1 1.3.6 no 2.2.2 1.3.6 no
Check the biggest distance from segment. We need to first make sure if it's a knowhere issue or milvus issue
Milvus version Knowhere version search out nearest vector 2.1.4 1.2.0 yes 2.2.0 1.3.2 yes 2.2.1 1.3.6 no 2.2.2 1.3.6 no
Check the biggest distance from segment. We need to first make sure if it's a knowhere issue or milvus issue
I have tried Milvus 2.2.2 + knowhere (1.3.6 roll back all HNSW related changes), this issue still exist. So it seems this issue is coming from Milvus change between 2.2.0 and 2.2.1. I will figure it out ASAP.
This issue starts from PR #21011
before #21011, search can always return correct top1 result:
after #21011, run script, 50% possibility search will return wrong top1 result:
set "sealProportion: 0.23", HNSW cannot get correct top1 result
CYD - seg id 438317009491922314, 69593
(1679, 0.755558)
(11540, 0.753473)
(10459, 0.752947)
CYD - seg id 438317009491922252, 69595
(63989, 0.758025)
(29080, 0.753873)
(1881, 0.751062)
CYD - seg id 438317009491922315, 69679
(10906, 0.757257)
(25173, 0.754956)
(20616, 0.749624)
CYD - seg id 438317009491922253, 69686
(64632, 0.874571)
(38060, 0.806698)
(30248, 0.78817)
set "sealProportion: 0.25", HNSW can get correct top1 result
CYD - seg id 438316741167353238, 62648
(55478, 0.913325)
(54384, 0.784778)
(38048, 0.767808)
CYD - seg id 438316741167353165, 76717
(64632, 0.874571)
(30248, 0.78817)
(6744, 0.774419)
CYD - seg id 438316741167353164, 76491
(63989, 0.758025)
(71274, 0.755558)
(29080, 0.753873)
CYD - seg id 438316741167353239, 62697
(4644, 0.753473)
(3563, 0.752947)
(35322, 0.748726)
set "sealProportion: 0.23", HNSW cannot get correct top1 result
CYD - seg id 438317009491922314, 69593 (1679, 0.755558) (11540, 0.753473) (10459, 0.752947) CYD - seg id 438317009491922252, 69595 (63989, 0.758025) (29080, 0.753873) (1881, 0.751062) CYD - seg id 438317009491922315, 69679 (10906, 0.757257) (25173, 0.754956) (20616, 0.749624) CYD - seg id 438317009491922253, 69686 (64632, 0.874571) (38060, 0.806698) (30248, 0.78817)
set "sealProportion: 0.25", HNSW can get correct top1 result
CYD - seg id 438316741167353238, 62648 (55478, 0.913325) (54384, 0.784778) (38048, 0.767808) CYD - seg id 438316741167353165, 76717 (64632, 0.874571) (30248, 0.78817) (6744, 0.774419) CYD - seg id 438316741167353164, 76491 (63989, 0.758025) (71274, 0.755558) (29080, 0.753873) CYD - seg id 438316741167353239, 62697 (4644, 0.753473) (3563, 0.752947) (35322, 0.748726)
Might be related to compaction?
I believe this issue is not caused by some bugs from Milvus or Knowhere. It just hit a corner case which makes HNSW cannot find out the top1 result.
This issue comes from PR #21011, this PR only change a Milvus parameter "sealProportion" from 0.25 to 0.23. In Milvus 2.2.0, "sealProportion" is set to 0.25 by default, the top1 vector is inserted into segment_A with row count 62648 (id 55478); in Milvus 2.2.1 or later, "sealProportion" is set to 0.23 by default, the top1 vector is insert into segment_B with row count 69679 (id 62509). I dump the raw data of segment_A to a file named 62648_ok.fbin; also dump the raw data of segment_B to a file named 69679_err.fbin. Then load these data into knowhere directly, and build index and do search.
For file 62648_ok.fbin, whatever Knowhere create IVF_FLAT or HNSW index, it can always return top1 vector as result.
For file 69679_err.fbin, if create IVF_FLAT index, it can return top1 vector as result; but if create HNSW index, it CANNOT return top1 vector as result. I also tried Knowhere with thirdparty hnswlib (v1.3.6 / v1.3.2 / v1.2.0 / v1.0.0), all cannot return top1 vector as result.
So it seems to be a corner case that hnswlib cannot handle.
use Feder to record the HNSW visit info
can find 55478 in json_62648_ok.txt json_62648_ok.txt
cannot find 62509 in json_69679_err.txt json_69679_err.txt
so in some cases HNSW cannot find out the top1 vector is because this vector is left in a isolated area in HNSW graph.
find when set "ef = 1500", it can always get top1 vector
find when set "ef = 1500", it can always get top1 vector
Very impressive test result. can we improve the search result by tuning index build parameters?
find when set "ef = 1500", it can always get top1 vector
Very impressive test result. can we improve the search result by tuning index build parameters?
no, the build parameter "efc" affects the recall rate a little.
Some conclusions:
a by-design limitation for hnsw. close for now.
Is there an existing issue for this?
Environment
Current Behavior
the closest vector( ip dis: 0.92) was not return by hnsw index with top150
Expected Behavior
the closest vector was return as top1
Steps To Reproduce
No response
Milvus Log
Anything else?
输入向量.txt gt.txt