milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.13k stars 2.89k forks source link

[Bug]: The results of full-text search are compared with those obtained by calculating lucene BM25, with a recall rate ranging from 0.7 to 0.9 #36739

Open zhuwenxing opened 2 weeks ago

zhuwenxing commented 2 weeks ago

Is there an existing issue for this?

Environment

- Milvus version:zhengbuqian-doc-in-restful-d174d05-20241010
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

    id      word                                           sentence                                          paragraph                                               text                                                emb

0 0 whom stuff he like what oil. business order memory future. recent would mou... yourself speech husband. region blue herself h... [0.6023883783276387, 0.8479405633974526, 0.395... 1 1 truth local garden state poor with. skin mind up economy involve. accept much prep... oil research politics police spring. choice ec... [0.8383437112215729, 0.06055977783095379, 0.73... 2 2 join note pm sit myself debate likely. ahead difficult hope toward even. training mee... pass town whatever condition eat owner. road s... [0.36961477489457273, 0.12855484140848816, 0.1... 3 3 task begin consider technology kind choose care foo... four than realize worker. physical the letter ... body kind approach often talk seek great. mont... [0.011019903782853002, 0.4413668100371435, 0.0... 4 4 wife how political standard. law base know mrs window yeah. option bit citi... student where effect plant. rise discover migh... [0.7389940504402016, 0.41201826565585764, 0.18... ... ... ... ... ... ... ... 4995 4995 fill note stand effect daughter pm rock newspaper. forward cut tough professor writer fund. coach... trip opportunity read fire rule.\ntough others... [0.6961687444442494, 0.7178086302497875, 0.118... 4996 4996 democrat similar beautiful personal drop customer. shake rich figure someone doctor manager somet... thus forget see. or section bring camera would... [0.8663225967105492, 0.316593111025503, 0.6386... 4997 4997 set cut apply nor sell. debate strong consider though field risk struc... need remain another employee just. interesting... [0.053935738989942195, 0.6391806368953511, 0.7... 4998 4998 minute already event spring since. lead the media responsibility. manage though d... resource activity lawyer leg friend. big end o... [0.8087548926914436, 0.17760461770648828, 0.28... 4999 4999 town american dark help glass house. term military party day its. education to more... by some energy think other beat. always fear t... [0.6603326843401671, 0.7072375978621074, 0.748...

[5000 rows x 6 columns] (test_search.py:13243) BM25S Retrieve: 0%| | 0/10 [00:00<?, ?it/s][2024-10-10 16:40:26 - INFO - ci_test]: recall 0.95 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.96 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.96 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.94 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.77 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.83 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.95 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.97 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.83 (test_search.py:13292) [2024-10-10 16:40:26 - INFO - ci_test]: recall 0.81 (test_search.py:13292)

Expected Behavior

Can achieve results basically consistent with lucene BM25.

Steps To Reproduce

No response

Milvus Log

test code

    @pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("enable_partition_key", [True, False])
    @pytest.mark.parametrize("enable_inverted_index", [True, False])
    @pytest.mark.parametrize("tokenizer", ["default"])
    def test_search_with_full_text_search(
            self, tokenizer, enable_inverted_index, enable_partition_key
    ):
        """
        target: test full text search
        method: 1. enable full text search and insert data with varchar
                2. search with text
                3. verify the result
        expected: full text search successfully and result is correct
        """
        tokenizer_params = {
            "tokenizer": tokenizer,
        }
        dim = 128
        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
            FieldSchema(
                name="word",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
                is_partition_key=enable_partition_key,
            ),
            FieldSchema(
                name="sentence",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="paragraph",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="text",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim),
            FieldSchema(name="text_sparse_emb", dtype=DataType.SPARSE_FLOAT_VECTOR),
        ]
        schema = CollectionSchema(fields=fields, description="test collection")
        bm25_function = Function(
            name="text_bm25_emb",
            function_type=FunctionType.BM25,
            input_field_names=["text"],
            output_field_names=["text_sparse_emb"],
            params={},
        )
        schema.add_function(bm25_function)
        data_size = 5000
        collection_w = self.init_collection_wrap(
            name=cf.gen_unique_str(prefix), schema=schema
        )
        fake = fake_en
        if tokenizer == "jieba":
            language = "zh"
            fake = fake_zh
        else:
            language = "en"

        data = [
            {
                "id": i,
                "word": fake.word().lower(),
                "sentence": fake.sentence().lower(),
                "paragraph": fake.paragraph().lower(),
                "text": fake.text().lower(),
                "emb": [random.random() for _ in range(dim)],
            }
            for i in range(data_size)
        ]
        df = pd.DataFrame(data)
        corpus = df["text"].to_list()
        log.info(f"dataframe\n{df}")
        batch_size = 5000
        for i in range(0, len(df), batch_size):
            collection_w.insert(
                data[i : i + batch_size]
                if i + batch_size < len(df)
                else data[i : len(df)]
            )
            collection_w.flush()
        collection_w.create_index(
            "emb",
            {"index_type": "HNSW", "metric_type": "L2", "params": {"M": 16, "efConstruction": 500}},
        )
        collection_w.create_index(
            "text_sparse_emb",
            {
                "index_type": "SPARSE_INVERTED_INDEX",
                "metric_type": "BM25",
                "params": {
                    "drop_ratio_build": 0.3,
                    "bm25_k1": 1.5,  # if k1 and b are left unspecified at index params,
                    "bm25_b": 0.75,  # values set in function params will be used.
                }
            }
        )
        if enable_inverted_index:
            collection_w.create_index("text", {"index_type": "INVERTED"})
        collection_w.load()
        nq = 10
        limit = 100
        search_data = [fake.text().lower() for _ in range(nq)]
        res_list, _ = collection_w.search(
                        data=search_data,
                        anns_field="text_sparse_emb",
                        param={},
                        limit=limit,
                        output_fields=["id", "text", "text_sparse_emb"])

        results, scores = cf.get_bm25_ground_truth(corpus, search_data, top_k=limit)
        for i in range(len(res_list)):
            res = res_list[i]
            log.info(f"res len {len(res)} res {res}")
            assert len(res) == limit
            text_get = [r.entity.text for r in res]
            text_expected = [results[i][j] for j in range(limit)]
            log.info(f"text_get {text_get}")
            log.info(f"text_expected {text_expected}")
            # get recall
            recall = len(set(text_get).intersection(set(text_expected))) / len(set(text_expected))
            log.info(f"recall {recall}")

Anything else?

No response

zhengbuqian commented 2 weeks ago

How about change drop_ratio_build to 0? A non zero drop_ratio_build inevitably leads to info loss and lower accuracy.

Such info loss is even worse with randomly generated data:

If supported, try to generate corpus with a uniform(some words occur more/less frequently) instead of even(all words have the same chance to occur) distribution.

xiaofan-luan commented 2 weeks ago

maybe we should set drop_ratio_build to 0 by default?

how much performance will drop if drop_ratio_build is set to 0?

zhengbuqian commented 1 week ago

maybe we should set drop_ratio_build to 0 by default?

how much performance will drop if drop_ratio_build is set to 0?

@xiaofan-luan default drop_ratio_build is 0. Wenxing manually set drop ratio build to 0.3 in this case.

zhengbuqian commented 1 week ago

/assign @zhuwenxing

zhuwenxing commented 1 week ago

When the 'drop_ratio_build is not set in the 'param', the default parameter values are used, and the BeIR benchmark is employed for testing. The results are compared with those of Lucene. In each metric, if the top-k is larger, Milvus' results are consistently lower than Lucene's.

dataset: nfcorpus

[2024-10-11 19:39:04 - INFO - ci_test]: milvus full text search result ({'NDCG@1': 0.42724, 'NDCG@10': 0.30977, 'NDCG@100': 0.26237, 'NDCG@1000': 0.29521}, {'MAP@1': 0.05857, 'MAP@10': 0.12013, 'MAP@100': 0.14085, 'MAP@1000': 0.14576}, {'Recall@1': 0.05857, 'Recall@10': 0.15202, 'Recall@100': 0.23542, 'Recall@1000': 0.36877}, {'P@1': 0.44272, 'P@10': 0.21734, 'P@100': 0.05864, 'P@1000': 0.01062}) (test_full_text_search.py:785)
[2024-10-11 19:39:04 - INFO - ci_test]: lucene full text search result ({'NDCG@1': 0.42724, 'NDCG@10': 0.30641, 'NDCG@100': 0.26821, 'NDCG@1000': 0.33108}, {'MAP@1': 0.05686, 'MAP@10': 0.11683, 'MAP@100': 0.13967, 'MAP@1000': 0.15019}, {'Recall@1': 0.05686, 'Recall@10': 0.14505, 'Recall@100': 0.25117, 'Recall@1000': 0.45723}, {'P@1': 0.44582, 'P@10': 0.21641, 'P@100': 0.06406, 'P@1000': 0.01728}) (test_full_text_search.py:786)