Open zhuwenxing opened 2 weeks ago
How about change drop_ratio_build to 0? A non zero drop_ratio_build inevitably leads to info loss and lower accuracy.
Such info loss is even worse with randomly generated data:
If supported, try to generate corpus with a uniform(some words occur more/less frequently) instead of even(all words have the same chance to occur) distribution.
maybe we should set drop_ratio_build to 0 by default?
how much performance will drop if drop_ratio_build is set to 0?
maybe we should set drop_ratio_build to 0 by default?
how much performance will drop if drop_ratio_build is set to 0?
@xiaofan-luan default drop_ratio_build is 0. Wenxing manually set drop ratio build to 0.3 in this case.
drop_ratio_build=0
, 1000 computations are needed to get the complete IP score.drop_ratio_build=0.3
, (if value distribution is even across all doc vectors), only ~700 computations are needed. The scores of the smallest 300 are dropped.drop_ratio_search
where we drop values in query instead./assign @zhuwenxing
When the 'drop_ratio_build is not set in the 'param', the default parameter values are used, and the BeIR benchmark is employed for testing. The results are compared with those of Lucene. In each metric, if the top-k is larger, Milvus' results are consistently lower than Lucene's.
dataset: nfcorpus
[2024-10-11 19:39:04 - INFO - ci_test]: milvus full text search result ({'NDCG@1': 0.42724, 'NDCG@10': 0.30977, 'NDCG@100': 0.26237, 'NDCG@1000': 0.29521}, {'MAP@1': 0.05857, 'MAP@10': 0.12013, 'MAP@100': 0.14085, 'MAP@1000': 0.14576}, {'Recall@1': 0.05857, 'Recall@10': 0.15202, 'Recall@100': 0.23542, 'Recall@1000': 0.36877}, {'P@1': 0.44272, 'P@10': 0.21734, 'P@100': 0.05864, 'P@1000': 0.01062}) (test_full_text_search.py:785)
[2024-10-11 19:39:04 - INFO - ci_test]: lucene full text search result ({'NDCG@1': 0.42724, 'NDCG@10': 0.30641, 'NDCG@100': 0.26821, 'NDCG@1000': 0.33108}, {'MAP@1': 0.05686, 'MAP@10': 0.11683, 'MAP@100': 0.13967, 'MAP@1000': 0.15019}, {'Recall@1': 0.05686, 'Recall@10': 0.14505, 'Recall@100': 0.25117, 'Recall@1000': 0.45723}, {'P@1': 0.44582, 'P@10': 0.21641, 'P@100': 0.06406, 'P@1000': 0.01728}) (test_full_text_search.py:786)
Is there an existing issue for this?
Environment
Current Behavior
0 0 whom stuff he like what oil. business order memory future. recent would mou... yourself speech husband. region blue herself h... [0.6023883783276387, 0.8479405633974526, 0.395... 1 1 truth local garden state poor with. skin mind up economy involve. accept much prep... oil research politics police spring. choice ec... [0.8383437112215729, 0.06055977783095379, 0.73... 2 2 join note pm sit myself debate likely. ahead difficult hope toward even. training mee... pass town whatever condition eat owner. road s... [0.36961477489457273, 0.12855484140848816, 0.1... 3 3 task begin consider technology kind choose care foo... four than realize worker. physical the letter ... body kind approach often talk seek great. mont... [0.011019903782853002, 0.4413668100371435, 0.0... 4 4 wife how political standard. law base know mrs window yeah. option bit citi... student where effect plant. rise discover migh... [0.7389940504402016, 0.41201826565585764, 0.18... ... ... ... ... ... ... ... 4995 4995 fill note stand effect daughter pm rock newspaper. forward cut tough professor writer fund. coach... trip opportunity read fire rule.\ntough others... [0.6961687444442494, 0.7178086302497875, 0.118... 4996 4996 democrat similar beautiful personal drop customer. shake rich figure someone doctor manager somet... thus forget see. or section bring camera would... [0.8663225967105492, 0.316593111025503, 0.6386... 4997 4997 set cut apply nor sell. debate strong consider though field risk struc... need remain another employee just. interesting... [0.053935738989942195, 0.6391806368953511, 0.7... 4998 4998 minute already event spring since. lead the media responsibility. manage though d... resource activity lawyer leg friend. big end o... [0.8087548926914436, 0.17760461770648828, 0.28... 4999 4999 town american dark help glass house. term military party day its. education to more... by some energy think other beat. always fear t... [0.6603326843401671, 0.7072375978621074, 0.748...
[5000 rows x 6 columns] (test_search.py:13243) BM25S Retrieve: 0%| | 0/10 [00:00<?, ?it/s][2024-10-10 16:40:26 - INFO - ci_test]: recall 0.95 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.96 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.96 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.94 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.77 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.83 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.95 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.97 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.83 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.81 (test_search.py:13292)
Expected Behavior
Can achieve results basically consistent with lucene BM25.
Steps To Reproduce
No response
Milvus Log
test code
Anything else?
No response