Open zhuwenxing opened 1 month ago
How about change drop_ratio_build to 0? A non zero drop_ratio_build inevitably leads to info loss and lower accuracy.
Such info loss is even worse with randomly generated data:
If supported, try to generate corpus with a uniform(some words occur more/less frequently) instead of even(all words have the same chance to occur) distribution.
maybe we should set drop_ratio_build to 0 by default?
how much performance will drop if drop_ratio_build is set to 0?
maybe we should set drop_ratio_build to 0 by default?
how much performance will drop if drop_ratio_build is set to 0?
@xiaofan-luan default drop_ratio_build is 0. Wenxing manually set drop ratio build to 0.3 in this case.
drop_ratio_build=0
, 1000 computations are needed to get the complete IP score.drop_ratio_build=0.3
, (if value distribution is even across all doc vectors), only ~700 computations are needed. The scores of the smallest 300 are dropped.drop_ratio_search
where we drop values in query instead./assign @zhuwenxing
When the 'drop_ratio_build is not set in the 'param', the default parameter values are used, and the BeIR benchmark is employed for testing. The results are compared with those of Lucene. In each metric, if the top-k is larger, Milvus' results are consistently lower than Lucene's.
dataset: nfcorpus
[2024-10-11 19:39:04 - INFO - ci_test]: milvus full text search result ({'NDCG@1': 0.42724, 'NDCG@10': 0.30977, 'NDCG@100': 0.26237, 'NDCG@1000': 0.29521}, {'MAP@1': 0.05857, 'MAP@10': 0.12013, 'MAP@100': 0.14085, 'MAP@1000': 0.14576}, {'Recall@1': 0.05857, 'Recall@10': 0.15202, 'Recall@100': 0.23542, 'Recall@1000': 0.36877}, {'P@1': 0.44272, 'P@10': 0.21734, 'P@100': 0.05864, 'P@1000': 0.01062}) (test_full_text_search.py:785)
[2024-10-11 19:39:04 - INFO - ci_test]: lucene full text search result ({'NDCG@1': 0.42724, 'NDCG@10': 0.30641, 'NDCG@100': 0.26821, 'NDCG@1000': 0.33108}, {'MAP@1': 0.05686, 'MAP@10': 0.11683, 'MAP@100': 0.13967, 'MAP@1000': 0.15019}, {'Recall@1': 0.05686, 'Recall@10': 0.14505, 'Recall@100': 0.25117, 'Recall@1000': 0.45723}, {'P@1': 0.44582, 'P@10': 0.21641, 'P@100': 0.06406, 'P@1000': 0.01728}) (test_full_text_search.py:786)
this is a good catch.
The drop ratio will have great impact on top 1000, which make senses(because for many long tail result we ignored some critical information here).
What is the default drop ratio? Instead of simply ignore some % of field, we should probably ignore field with weights under certain threshold.
@liliu-z @zhengbuqian it also showed that drop ratio has it's major drawback in very similar document
this is a good catch.
The drop ratio will have great impact on top 1000, which make senses(because for many long tail result we ignored some critical information here).
What is the default drop ratio? Instead of simply ignore some % of field, we should probably ignore field with weights under certain threshold.
This is on the roadmap. We need to do drop based on the data distribution instead of a simple ratio. The purpose is dropping outlier data instead of a fixed percentage of data @zhengbuqian
/assign @hhy3 Plz take a look
/assign @hhy3 Plz take a look
As my previous experiments drop by threshold is much better than drop by percentage. If konwhere sparse is implemented as drop by percentage it should be changed to drop by threshold imo. With this approach a significant performance improvement will be gained with nearly no recall loss if results are refined
I think drop might lose critical informations. How about:
drop and quantization are two different aspects and they can be combined together. and in this case drop can get much more improvement than quantizaiton. And with drop by threshold method there's a theoritical upper bound for errors so the loss of information is controllable.
Is there an existing issue for this?
Environment
Current Behavior
0 0 whom stuff he like what oil. business order memory future. recent would mou... yourself speech husband. region blue herself h... [0.6023883783276387, 0.8479405633974526, 0.395... 1 1 truth local garden state poor with. skin mind up economy involve. accept much prep... oil research politics police spring. choice ec... [0.8383437112215729, 0.06055977783095379, 0.73... 2 2 join note pm sit myself debate likely. ahead difficult hope toward even. training mee... pass town whatever condition eat owner. road s... [0.36961477489457273, 0.12855484140848816, 0.1... 3 3 task begin consider technology kind choose care foo... four than realize worker. physical the letter ... body kind approach often talk seek great. mont... [0.011019903782853002, 0.4413668100371435, 0.0... 4 4 wife how political standard. law base know mrs window yeah. option bit citi... student where effect plant. rise discover migh... [0.7389940504402016, 0.41201826565585764, 0.18... ... ... ... ... ... ... ... 4995 4995 fill note stand effect daughter pm rock newspaper. forward cut tough professor writer fund. coach... trip opportunity read fire rule.\ntough others... [0.6961687444442494, 0.7178086302497875, 0.118... 4996 4996 democrat similar beautiful personal drop customer. shake rich figure someone doctor manager somet... thus forget see. or section bring camera would... [0.8663225967105492, 0.316593111025503, 0.6386... 4997 4997 set cut apply nor sell. debate strong consider though field risk struc... need remain another employee just. interesting... [0.053935738989942195, 0.6391806368953511, 0.7... 4998 4998 minute already event spring since. lead the media responsibility. manage though d... resource activity lawyer leg friend. big end o... [0.8087548926914436, 0.17760461770648828, 0.28... 4999 4999 town american dark help glass house. term military party day its. education to more... by some energy think other beat. always fear t... [0.6603326843401671, 0.7072375978621074, 0.748...
[5000 rows x 6 columns] (test_search.py:13243) BM25S Retrieve: 0%| | 0/10 [00:00<?, ?it/s][2024-10-10 16:40:26 - INFO - ci_test]: recall 0.95 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.96 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.96 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.94 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.77 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.83 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.95 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.97 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.83 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.81 (test_search.py:13292)
Expected Behavior
Can achieve results basically consistent with lucene BM25.
Steps To Reproduce
No response
Milvus Log
test code
Anything else?
No response