milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.93k stars 2.95k forks source link

[Bug]: The results of full-text search are compared with those obtained by calculating lucene BM25, with a recall rate ranging from 0.7 to 0.9 #36739

Open zhuwenxing opened 1 month ago

zhuwenxing commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version:zhengbuqian-doc-in-restful-d174d05-20241010
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

    id      word                                           sentence                                          paragraph                                               text                                                emb

0 0 whom stuff he like what oil. business order memory future. recent would mou... yourself speech husband. region blue herself h... [0.6023883783276387, 0.8479405633974526, 0.395... 1 1 truth local garden state poor with. skin mind up economy involve. accept much prep... oil research politics police spring. choice ec... [0.8383437112215729, 0.06055977783095379, 0.73... 2 2 join note pm sit myself debate likely. ahead difficult hope toward even. training mee... pass town whatever condition eat owner. road s... [0.36961477489457273, 0.12855484140848816, 0.1... 3 3 task begin consider technology kind choose care foo... four than realize worker. physical the letter ... body kind approach often talk seek great. mont... [0.011019903782853002, 0.4413668100371435, 0.0... 4 4 wife how political standard. law base know mrs window yeah. option bit citi... student where effect plant. rise discover migh... [0.7389940504402016, 0.41201826565585764, 0.18... ... ... ... ... ... ... ... 4995 4995 fill note stand effect daughter pm rock newspaper. forward cut tough professor writer fund. coach... trip opportunity read fire rule.\ntough others... [0.6961687444442494, 0.7178086302497875, 0.118... 4996 4996 democrat similar beautiful personal drop customer. shake rich figure someone doctor manager somet... thus forget see. or section bring camera would... [0.8663225967105492, 0.316593111025503, 0.6386... 4997 4997 set cut apply nor sell. debate strong consider though field risk struc... need remain another employee just. interesting... [0.053935738989942195, 0.6391806368953511, 0.7... 4998 4998 minute already event spring since. lead the media responsibility. manage though d... resource activity lawyer leg friend. big end o... [0.8087548926914436, 0.17760461770648828, 0.28... 4999 4999 town american dark help glass house. term military party day its. education to more... by some energy think other beat. always fear t... [0.6603326843401671, 0.7072375978621074, 0.748...

[5000 rows x 6 columns] (test_search.py:13243) BM25S Retrieve: 0%| | 0/10 [00:00<?, ?it/s][2024-10-10 16:40:26 - INFO - ci_test]: recall 0.95 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.96 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.96 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.94 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.77 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.83 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.95 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.97 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.83 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.81 (test_search.py:13292)

Expected Behavior

Can achieve results basically consistent with lucene BM25.

Steps To Reproduce

No response

Milvus Log

test code

    @pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("enable_partition_key", [True, False])
    @pytest.mark.parametrize("enable_inverted_index", [True, False])
    @pytest.mark.parametrize("tokenizer", ["default"])
    def test_search_with_full_text_search(
            self, tokenizer, enable_inverted_index, enable_partition_key
    ):
        """
        target: test full text search
        method: 1. enable full text search and insert data with varchar
                2. search with text
                3. verify the result
        expected: full text search successfully and result is correct
        """
        tokenizer_params = {
            "tokenizer": tokenizer,
        }
        dim = 128
        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
            FieldSchema(
                name="word",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
                is_partition_key=enable_partition_key,
            ),
            FieldSchema(
                name="sentence",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="paragraph",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="text",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim),
            FieldSchema(name="text_sparse_emb", dtype=DataType.SPARSE_FLOAT_VECTOR),
        ]
        schema = CollectionSchema(fields=fields, description="test collection")
        bm25_function = Function(
            name="text_bm25_emb",
            function_type=FunctionType.BM25,
            input_field_names=["text"],
            output_field_names=["text_sparse_emb"],
            params={},
        )
        schema.add_function(bm25_function)
        data_size = 5000
        collection_w = self.init_collection_wrap(
            name=cf.gen_unique_str(prefix), schema=schema
        )
        fake = fake_en
        if tokenizer == "jieba":
            language = "zh"
            fake = fake_zh
        else:
            language = "en"

        data = [
            {
                "id": i,
                "word": fake.word().lower(),
                "sentence": fake.sentence().lower(),
                "paragraph": fake.paragraph().lower(),
                "text": fake.text().lower(),
                "emb": [random.random() for _ in range(dim)],
            }
            for i in range(data_size)
        ]
        df = pd.DataFrame(data)
        corpus = df["text"].to_list()
        log.info(f"dataframe\n{df}")
        batch_size = 5000
        for i in range(0, len(df), batch_size):
            collection_w.insert(
                data[i : i + batch_size]
                if i + batch_size < len(df)
                else data[i : len(df)]
            )
            collection_w.flush()
        collection_w.create_index(
            "emb",
            {"index_type": "HNSW", "metric_type": "L2", "params": {"M": 16, "efConstruction": 500}},
        )
        collection_w.create_index(
            "text_sparse_emb",
            {
                "index_type": "SPARSE_INVERTED_INDEX",
                "metric_type": "BM25",
                "params": {
                    "drop_ratio_build": 0.3,
                    "bm25_k1": 1.5,  # if k1 and b are left unspecified at index params,
                    "bm25_b": 0.75,  # values set in function params will be used.
                }
            }
        )
        if enable_inverted_index:
            collection_w.create_index("text", {"index_type": "INVERTED"})
        collection_w.load()
        nq = 10
        limit = 100
        search_data = [fake.text().lower() for _ in range(nq)]
        res_list, _ = collection_w.search(
                        data=search_data,
                        anns_field="text_sparse_emb",
                        param={},
                        limit=limit,
                        output_fields=["id", "text", "text_sparse_emb"])

        results, scores = cf.get_bm25_ground_truth(corpus, search_data, top_k=limit)
        for i in range(len(res_list)):
            res = res_list[i]
            log.info(f"res len {len(res)} res {res}")
            assert len(res) == limit
            text_get = [r.entity.text for r in res]
            text_expected = [results[i][j] for j in range(limit)]
            log.info(f"text_get {text_get}")
            log.info(f"text_expected {text_expected}")
            # get recall
            recall = len(set(text_get).intersection(set(text_expected))) / len(set(text_expected))
            log.info(f"recall {recall}")

Anything else?

No response

zhengbuqian commented 1 month ago

How about change drop_ratio_build to 0? A non zero drop_ratio_build inevitably leads to info loss and lower accuracy.

Such info loss is even worse with randomly generated data:

If supported, try to generate corpus with a uniform(some words occur more/less frequently) instead of even(all words have the same chance to occur) distribution.

xiaofan-luan commented 1 month ago

maybe we should set drop_ratio_build to 0 by default?

how much performance will drop if drop_ratio_build is set to 0?

zhengbuqian commented 1 month ago

maybe we should set drop_ratio_build to 0 by default?

how much performance will drop if drop_ratio_build is set to 0?

@xiaofan-luan default drop_ratio_build is 0. Wenxing manually set drop ratio build to 0.3 in this case.

zhengbuqian commented 1 month ago

/assign @zhuwenxing

zhuwenxing commented 1 month ago

When the 'drop_ratio_build is not set in the 'param', the default parameter values are used, and the BeIR benchmark is employed for testing. The results are compared with those of Lucene. In each metric, if the top-k is larger, Milvus' results are consistently lower than Lucene's.

dataset: nfcorpus

[2024-10-11 19:39:04 - INFO - ci_test]: milvus full text search result ({'NDCG@1': 0.42724, 'NDCG@10': 0.30977, 'NDCG@100': 0.26237, 'NDCG@1000': 0.29521}, {'MAP@1': 0.05857, 'MAP@10': 0.12013, 'MAP@100': 0.14085, 'MAP@1000': 0.14576}, {'Recall@1': 0.05857, 'Recall@10': 0.15202, 'Recall@100': 0.23542, 'Recall@1000': 0.36877}, {'P@1': 0.44272, 'P@10': 0.21734, 'P@100': 0.05864, 'P@1000': 0.01062}) (test_full_text_search.py:785)
[2024-10-11 19:39:04 - INFO - ci_test]: lucene full text search result ({'NDCG@1': 0.42724, 'NDCG@10': 0.30641, 'NDCG@100': 0.26821, 'NDCG@1000': 0.33108}, {'MAP@1': 0.05686, 'MAP@10': 0.11683, 'MAP@100': 0.13967, 'MAP@1000': 0.15019}, {'Recall@1': 0.05686, 'Recall@10': 0.14505, 'Recall@100': 0.25117, 'Recall@1000': 0.45723}, {'P@1': 0.44582, 'P@10': 0.21641, 'P@100': 0.06406, 'P@1000': 0.01728}) (test_full_text_search.py:786)
xiaofan-luan commented 2 weeks ago

this is a good catch.

The drop ratio will have great impact on top 1000, which make senses(because for many long tail result we ignored some critical information here).

What is the default drop ratio? Instead of simply ignore some % of field, we should probably ignore field with weights under certain threshold.

xiaofan-luan commented 2 weeks ago

@liliu-z @zhengbuqian it also showed that drop ratio has it's major drawback in very similar document

liliu-z commented 2 weeks ago

this is a good catch.

The drop ratio will have great impact on top 1000, which make senses(because for many long tail result we ignored some critical information here).

What is the default drop ratio? Instead of simply ignore some % of field, we should probably ignore field with weights under certain threshold.

This is on the roadmap. We need to do drop based on the data distribution instead of a simple ratio. The purpose is dropping outlier data instead of a fixed percentage of data @zhengbuqian

liliu-z commented 1 week ago

/assign @hhy3 Plz take a look

hhy3 commented 1 week ago

/assign @hhy3 Plz take a look

As my previous experiments drop by threshold is much better than drop by percentage. If konwhere sparse is implemented as drop by percentage it should be changed to drop by threshold imo. With this approach a significant performance improvement will be gained with nearly no recall loss if results are refined

xiaofan-luan commented 1 week ago

I think drop might lose critical informations. How about:

  1. try quantizations instead of drop
  2. only drop if the quantized value is smaller and close to 0
hhy3 commented 1 week ago

drop and quantization are two different aspects and they can be combined together. and in this case drop can get much more improvement than quantizaiton. And with drop by threshold method there's a theoritical upper bound for errors so the loss of information is controllable.