milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30k stars 2.88k forks source link

[Bug]: When full-text search is enabled (or the schema contains the BM25 function), and dynamic fields are also enabled, inserting correct data will still result in an error. #36986

Open zhuwenxing opened 9 hours ago

zhuwenxing commented 9 hours ago

Is there an existing issue for this?

Environment

- Milvus version:20750c0-dev
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc96
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024-10-18 14:24:32 - INFO - ci_test]: ################################################################################ (conftest.py:232)
[2024-10-18 14:24:32 - INFO - ci_test]: [initialize_milvus] Log cleaned up, start testing... (conftest.py:233)
[2024-10-18 14:24:32 - INFO - ci_test]: [setup_class] Start setup class... (client_base.py:40)
[2024-10-18 14:24:32 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:46)
[2024-10-18 14:24:32 - INFO - ci_test]: pymilvus version: 2.5.0rc96 (client_base.py:47)
[2024-10-18 14:24:32 - INFO - ci_test]: [setup_method] Start setup test case test_insert_for_full_text_search_enable_dynamic_field. (client_base.py:49)
-------------------------------- live log call ---------------------------------
[2024-10-18 14:24:32 - INFO - ci_test]: server version: 20750c0-dev (client_base.py:165)
[2024-10-18 14:24:34 - ERROR - pymilvus.decorators]: RPC error: [insert_rows], <ParamError: (code=The data fields number is not match with schema., message=)>, <Time:{'RPC start': '2024-10-18 14:24:33.552853', 'RPC error': '2024-10-18 14:24:34.595584'}> (decorators.py:140)
[2024-10-18 14:24:34 - ERROR - ci_test]: Traceback (most recent call last):
  File "/Users/zilliz/workspace/milvus/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
    res = func(*args, **_kwargs)
  File "/Users/zilliz/workspace/milvus/tests/python_client/utils/api_request.py", line 63, in api_request
    return func(*arg, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 507, in insert
    return conn.insert_rows(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 141, in handler
    raise e from e
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 137, in handler
    return func(*args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 176, in handler
    return func(self, *args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 116, in handler
    raise e from e
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 86, in handler
    return func(*args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 493, in insert_rows
    request = self._prepare_row_insert_request(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 519, in _prepare_row_insert_request
    return Prepare.row_insert_param(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/prepare.py", line 587, in row_insert_param
    return cls._parse_row_request(request, fields_info, enable_dynamic, entities)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/prepare.py", line 481, in _parse_row_request
    raise ParamError(ExceptionsMessage.FieldsNumInconsistent)
pymilvus.exceptions.ParamError: <ParamError: (code=The data fields number is not match with schema., message=)>
 (api_request.py:45)
[2024-10-18 14:24:34 - ERROR - ci_test]: (api_response) : <ParamError: (code=The data fields number is not match with schema., message=)> (api_request.py:46)
FAILED
testcases/test_full_text_search.py:518 (TestInsertWithFullTextSearch.test_insert_for_full_text_search_enable_dynamic_field[default-en-False-True])
self = <test_full_text_search.TestInsertWithFullTextSearch object at 0x12e371be0>
tokenizer = 'default', text_lang = 'en', nullable = False
enable_dynamic_field = True

    @pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("enable_dynamic_field", [True])
    @pytest.mark.parametrize("nullable", [False])
    @pytest.mark.parametrize("text_lang", ["en"])
    @pytest.mark.parametrize("tokenizer", ["default"])
    def test_insert_for_full_text_search_enable_dynamic_field(self, tokenizer, text_lang, nullable, enable_dynamic_field):
        """
        target: test full text search
        method: 1. enable full text search and insert data with varchar
                2. search with text
                3. verify the result
        expected: full text search successfully and result is correct
        """
        tokenizer_params = {
            "tokenizer": tokenizer,
        }
        dim = 128
        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
            FieldSchema(
                name="word",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
                is_partition_key=True,
            ),
            FieldSchema(
                name="sentence",
                dtype=DataType.VARCHAR,
                max_length=65535,
                nullable=nullable,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="paragraph",
                dtype=DataType.VARCHAR,
                max_length=65535,
                nullable=nullable,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="text",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim),
            FieldSchema(name="text_sparse_emb", dtype=DataType.SPARSE_FLOAT_VECTOR),
        ]
        schema = CollectionSchema(fields=fields, description="test collection", enable_dynamic_field=enable_dynamic_field)
        bm25_function = Function(
            name="text_bm25_emb",
            function_type=FunctionType.BM25,
            input_field_names=["text"],
            output_field_names=["text_sparse_emb"],
            params={},
        )
        schema.add_function(bm25_function)
        data_size = 5000
        collection_w = self.init_collection_wrap(
            name=cf.gen_unique_str(prefix), schema=schema
        )
        fake = fake_en
        if text_lang == "zh":
            fake = fake_zh
        elif text_lang == "de":
            fake = Faker("de_DE")
        elif text_lang == "hybrid":
            fake = Faker()

        if nullable:
            data = [
                {
                    "id": i,
                    "word": fake.word().lower(),
                    "sentence": fake.sentence().lower() if random.random() < 0.5 else None,
                    "paragraph": fake.paragraph().lower() if random.random() < 0.5 else None,
                    "text": fake.text().lower(),  # function input should not be None
                    "emb": [random.random() for _ in range(dim)],
                    f"dynamic_field_{i}": f"dynamic_value_{i}"
                }
                for i in range(data_size)
            ]
        else:
            data = [
                {
                    "id": i,
                    "word": fake.word().lower(),
                    "sentence": fake.sentence().lower(),
                    "paragraph": fake.paragraph().lower(),
                    "text": fake.text().lower(),
                    "emb": [random.random() for _ in range(dim)],
                    f"dynamic_field_{i}": f"dynamic_value_{i}"
                }
                for i in range(data_size)
            ]
        if text_lang == "hybrid":
            hybrid_data = []
            for i in range(data_size):
                fake = random.choice([fake_en, fake_zh, Faker("de_DE")])
                tmp = {
                    "id": i,
                    "word": fake.word().lower(),
                    "sentence": fake.sentence().lower(),
                    "paragraph": fake.paragraph().lower(),
                    "text": fake.text().lower(),
                    "emb": [random.random() for _ in range(dim)],
                    f"dynamic_field_{i}": f"dynamic_value_{i}"
                }
                hybrid_data.append(tmp)
            data = hybrid_data + data
        # df = pd.DataFrame(data)
        # log.info(f"dataframe\n{df}")
        batch_size = 5000
        for i in range(0, len(data), batch_size):
>           collection_w.insert(
                data[i: i + batch_size]
                if i + batch_size < len(data)
                else data[i: len(data)]
            )

test_full_text_search.py:638: 

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

It should be a problem with pymilvus's check. The error is thrown by pymilvus, not the server.

zhuwenxing commented 8 hours ago

/assign @zhengbuqian PTAL