milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.13k stars 2.89k forks source link

[Bug]: After enabling full-text search (or adding a function to the schema), when performing an upsert, you still need to provide the value for the output field, otherwise, it will report an error ` Unexpected error: [upsert_rows], 'text_sparse_emb'`. #37021

Open zhuwenxing opened 3 days ago

zhuwenxing commented 3 days ago

Is there an existing issue for this?

Environment

- Milvus version:master-346510e-20241021
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc101
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

self = <test_full_text_search.TestUpsertWithFullTextSearchNegative object at 0x1300cb7c0>, tokenizer = 'default', nullable = False

    @pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("nullable", [False, True])
    @pytest.mark.parametrize("tokenizer", ["default"])
    def test_upsert_with_full_text_search(self, tokenizer, nullable):
        """
        target: test full text search
        method: 1. enable full text search and insert data with varchar
                2. search with text
                3. verify the result
        expected: full text search successfully and result is correct
        """
        if nullable:
            pytest.xfail(reason="nullable field not support yet")

        tokenizer_params = {
            "tokenizer": tokenizer,
        }
        dim = 128
        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
            FieldSchema(
                name="word",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
                is_partition_key=True,
            ),
            FieldSchema(
                name="sentence",
                dtype=DataType.VARCHAR,
                max_length=65535,
                nullable=nullable,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="paragraph",
                dtype=DataType.VARCHAR,
                max_length=65535,
                nullable=nullable,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="text",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim),
            FieldSchema(name="text_sparse_emb", dtype=DataType.SPARSE_FLOAT_VECTOR),
        ]
        schema = CollectionSchema(fields=fields, description="test collection")
        bm25_function = Function(
            name="text_bm25_emb",
            function_type=FunctionType.BM25,
            input_field_names=["text"],
            output_field_names=["text_sparse_emb"],
            params={},
        )
        schema.add_function(bm25_function)
        data_size = 5000
        collection_w = self.init_collection_wrap(
            name=cf.gen_unique_str(prefix), schema=schema
        )
        fake = fake_en
        language = "en"
        if tokenizer == "jieba":
            fake = fake_zh
            language = "zh"

        if nullable:
            data = [
                {
                    "id": i,
                    "word": fake.word().lower() if random.random() < 0.5 else None,
                    "sentence": fake.sentence().lower() if random.random() < 0.5 else None,
                    "paragraph": fake.paragraph().lower() if random.random() < 0.5 else None,
                    "text": fake.text().lower(),  # function input should not be None
                    "emb": [random.random() for _ in range(dim)],
                }
                for i in range(data_size // 2, data_size)
            ]
        else:
            data = [
                {
                    "id": i,
                    "word": fake.word().lower(),
                    "sentence": fake.sentence().lower(),
                    "paragraph": fake.paragraph().lower(),
                    "text": fake.text().lower(),
                    "emb": [random.random() for _ in range(dim)],
                }
                for i in range(data_size)
            ]
        df = pd.DataFrame(data)
        log.info(f"dataframe\n{df}")
        batch_size = 5000
        for i in range(0, len(df), batch_size):
            collection_w.insert(
                data[i: i + batch_size]
                if i + batch_size < len(df)
                else data[i: len(df)]
            )
            collection_w.flush()
        collection_w.create_index(
            "emb",
            {"index_type": "HNSW", "metric_type": "L2", "params": {"M": 16, "efConstruction": 500}},
        )
        collection_w.create_index(
            "text_sparse_emb",
            {
                "index_type": "SPARSE_INVERTED_INDEX",
                "metric_type": "BM25",
                "params": {
                    "drop_ratio_build": 0.3,
                    "bm25_k1": 1.5,
                    "bm25_b": 0.75,
                }
            }
        )
        collection_w.create_index("text", {"index_type": "INVERTED"})
        collection_w.load()
        num_entities = collection_w.num_entities
        res, _ = collection_w.query(
            expr="",
            output_fields=["count(*)"]
        )
        count = res[0]["count(*)"]
        assert len(data) == num_entities
        assert len(data) == count

        # upsert in half of the data
        upsert_data = [
            {
                "id": i,
                "word": fake.word().lower(),
                "sentence": fake.sentence().lower(),
                "paragraph": fake.paragraph().lower(),
                "text": fake.text().lower(),
                "emb": [random.random() for _ in range(dim)],
            }
            for i in range(data_size // 2)
        ]
        upsert_data += data[data_size // 2:]
        for i in range(0, len(upsert_data), batch_size):
>           collection_w.upsert(
                upsert_data[i: i + batch_size]
                if i + batch_size < len(upsert_data)
                else upsert_data[i: len(upsert_data)]
            )

testcases/test_full_text_search.py:1383: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
utils/wrapper.py:33: in inner_wrapper
    res, result = func(*args, **kwargs)
base/collection_wrapper.py:338: in upsert
    check_result = ResponseChecker(res, func_name, check_task, check_items, check, **kwargs).run()
check/func_check.py:34: in run
    result = self.assert_succ(self.succ, True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <check.func_check.ResponseChecker object at 0x1300563d0>, actual = False, expect = True

    def assert_succ(self, actual, expect):
>       assert actual is expect, f"Response of API {self.func_name} expect {expect}, but got {actual}"
E       AssertionError: Response of API upsert expect True, but got False

check/func_check.py:116: AssertionError
------------------------------------------------------------------------------------------------------- Captured log setup --------------------------------------------------------------------------------------------------------
[2024-10-21 15:32:36 - INFO - ci_test]: [setup_class] Start setup class... (client_base.py:41)
[2024-10-21 15:32:36 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:47)
[2024-10-21 15:32:36 - INFO - ci_test]: pymilvus version: 2.5.0rc101 (client_base.py:48)
[2024-10-21 15:32:36 - INFO - ci_test]: [setup_method] Start setup test case test_upsert_with_full_text_search. (client_base.py:49)
-------------------------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------------------------
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:62)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_response) : False  (api_request.py:37)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': '10.104.20.97', 'port': 19530} (api_request.py:62)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)
[2024-10-21 15:32:36 - INFO - ci_test]: server version: 346510e-dev (client_base.py:166)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['full_text_search_collection_38VQKkPZ', {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': ......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_response) : <Collection>:
-------------
<name>: full_text_search_collection_38VQKkPZ
<description>: test collection
<schema>: {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'de......  (api_request.py:37)
[2024-10-21 15:32:37 - INFO - ci_test]: dataframe
        id        word                                       sentence                                          paragraph                                               text                                                emb
0        0   scientist  interesting many boy oil be opportunity when.  carry form develop. cost rest dream energy. ch...  image front name provide should work. dark add...  [0.03767875334182125, 0.3989433904957238, 0.10...
1        1        plan      tax leave continue man trouble including.                  ability possible growth shoulder.  remember mrs couple hundred those. future phon...  [0.1468846081354882, 0.8416706721768779, 0.261...
2        2  population           positive later fight once knowledge.  hand hot attorney scene hard and. environmenta...  financial wife no stay listen top. very book r...  [0.31837460695144915, 0.31364729506342315, 0.8...
3        3       smile   partner kitchen floor until anything strong.  ball deal large concern fact institution note....  fly management step relationship officer. atte...  [0.30731171101897714, 0.5574160371688782, 0.87...
4        4   attention                               with black drop.  message too voice night. can control pm politi...  those keep lawyer there article these. somethi...  [0.6528922817343126, 0.8972341979186024, 0.373...
...    ...         ...                                            ...                                                ...                                                ...                                                ...
4995  4995    marriage       develop site live cut threat commercial.       defense may perform. my glass town although.  likely approach answer range. camera carry cat...  [0.25508427017015445, 0.6554567409136012, 0.32...
4996  4996        line                address vote scene eye medical.  society officer decision she bag step. bank st...  eye military dream will protect pass term pare...  [0.6205571146207843, 0.31696026115760145, 0.07...
4997  4997       hotel                           technology hold him.  social debate whether almost same different. f...  human cut table benefit set deep. ability ask ...  [0.23176432966854688, 0.1439607792449047, 0.85...
4998  4998        hair                      color husband over eight.  base safe speak against. call particularly str...  marriage unit office age pm. address under con...  [0.9673645756355516, 0.4313184485314254, 0.842...
4999  4999        wait     listen yes purpose none several walk song.  area across show require until police. easy lo...  strategy animal management base fast. radio dr...  [0.34513015065353203, 0.4114163562913943, 0.01...

[5000 rows x 6 columns] (test_full_text_search.py:1333)
[2024-10-21 15:32:37 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[{'id': 0, 'word': 'scientist', 'sentence': 'interesting many boy oil be opportunity when.', 'paragraph': 'carry form develop. cost rest dream energy. church fill their heart. our personal all.', 'text': 'image front name provide should work. dark address several. determine see magazine hear.\nsens......, kwargs: {'timeout': 180} (api_request.py:62)
[2024-10-21 15:32:38 - DEBUG - ci_test]: (api_response) : (insert count: 5000, delete count: 0, upsert count: 0, timestamp: 453376988575957001, success count: 5000, err count: 0  (api_request.py:37)
[2024-10-21 15:32:38 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)
[2024-10-21 15:32:41 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)
[2024-10-21 15:32:41 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['emb', {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 16, 'efConstruction': 500}}, 1200], kwargs: {'index_name': ''} (api_request.py:62)
[2024-10-21 15:33:43 - DEBUG - ci_test]: (api_response) : Status(code=0, message=)  (api_request.py:37)
[2024-10-21 15:33:43 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['text_sparse_emb', {'index_type': 'SPARSE_INVERTED_INDEX', 'metric_type': 'BM25', 'params': {'drop_ratio_build': 0.3, 'bm25_k1': 1.5, 'bm25_b': 0.75}}, 1200], kwargs: {'index_name': ''} (api_request.py:62)
[2024-10-21 15:33:59 - DEBUG - ci_test]: (api_response) : Status(code=0, message=)  (api_request.py:37)
[2024-10-21 15:33:59 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['text', {'index_type': 'INVERTED'}, 1200], kwargs: {'index_name': ''} (api_request.py:62)
[2024-10-21 15:34:38 - DEBUG - ci_test]: (api_response) : Status(code=0, message=)  (api_request.py:37)
[2024-10-21 15:34:38 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)
[2024-10-21 15:34:42 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)
[2024-10-21 15:34:42 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)
[2024-10-21 15:34:42 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)
[2024-10-21 15:34:42 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ['', ['count(*)'], None, 180], kwargs: {} (api_request.py:62)
[2024-10-21 15:34:43 - DEBUG - ci_test]: (api_response) : data: ["{'count(*)': 5000}"]   (api_request.py:37)
[2024-10-21 15:34:43 - DEBUG - ci_test]: (api_request)  : [Collection.upsert] args: [[{'id': 0, 'word': 'military', 'sentence': 'various couple role structure leader.', 'paragraph': 'long voice our. community bit writer usually camera.', 'text': 'particularly onto market claim. possible above charge admit.\nrequire quality too push few past. weight stage here compare.', 'emb': [0.0......, kwargs: {} (api_request.py:62)
[2024-10-21 15:34:43 - ERROR - pymilvus.decorators]: Unexpected error: [upsert_rows], 'text_sparse_emb', <Time: {'RPC start': '2024-10-21 15:34:43.746189', 'Exception': '2024-10-21 15:34:43.746378'}> (decorators.py:158)
[2024-10-21 15:34:43 - ERROR - ci_test]: Traceback (most recent call last):
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 137, in handler
    return func(*args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 176, in handler
    return func(self, *args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 118, in handler
    raise e from e
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 86, in handler
    return func(*args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 715, in upsert_rows
    request = self._prepare_row_upsert_request(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 696, in _prepare_row_upsert_request
    return Prepare.row_upsert_param(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/prepare.py", line 604, in row_upsert_param
    return cls._parse_upsert_row_request(request, fields_info, enable_dynamic, entities)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/prepare.py", line 520, in _parse_upsert_row_request
    field_info, field_data = field_info_map[key], fields_data[key]
KeyError: 'text_sparse_emb'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/zilliz/workspace/milvus/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
    res = func(*args, **_kwargs)
  File "/Users/zilliz/workspace/milvus/tests/python_client/utils/api_request.py", line 63, in api_request
    return func(*arg, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 635, in upsert
    res = conn.upsert_rows(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 159, in handler
    raise MilvusException(message=f"Unexpected error, message=<{e!s}>") from e
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Unexpected error, message=<'text_sparse_emb'>)>
 (api_request.py:45)
[2024-10-21 15:34:43 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Unexpected error, message=<'text_sparse_emb'>)> (api_request.py:46)

Expected Behavior

Upsert should behave the same as insert, and there's no need to assign a value to the output field in the function.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

zhuwenxing commented 3 days ago

/assign @zhengbuqian

xiaofan-luan commented 2 days ago

nice catch! @zhuwenxing

zhengbuqian commented 1 day ago

error was introduced in https://github.com/milvus-io/pymilvus/pull/2303 where I try to simplify the logic to check if the insert/request data matches the schema. obviously there are 2 lines I forgot to update and our CI in both pymilvus and milvus didn't catch that.

Updated in https://github.com/milvus-io/pymilvus/pull/2309.

zhengbuqian commented 22 hours ago

/assign @zhuwenxing /unassign please verify, thanks!