milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.46k stars 2.92k forks source link

[Bug]: When full-text search is enabled, an "Insert missed an field `text_sparse_emb` to collection without set nullable==true or set default_value" error occurs during insertion. #36860

Closed zhuwenxing closed 1 month ago

zhuwenxing commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version:5ec4163-dev
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc95
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

text_sparse_emb is a field in function output, so it does not need with data

[2024-10-14 20:37:59 - INFO - ci_test]: [initialize_milvus] Log cleaned up, start testing... (conftest.py:233)
[2024-10-14 20:37:59 - INFO - ci_test]: [setup_class] Start setup class... (client_base.py:40)
[2024-10-14 20:37:59 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:46)
[2024-10-14 20:37:59 - INFO - ci_test]: pymilvus version: 2.5.0rc95 (client_base.py:47)
[2024-10-14 20:37:59 - INFO - ci_test]: [setup_method] Start setup test case test_full_text_search_default. (client_base.py:49)
-------------------------------- live log call ---------------------------------
[2024-10-14 20:37:59 - INFO - ci_test]: server version: 5ec4163-dev (client_base.py:165)
[2024-10-14 20:37:59 - INFO - ci_test]: dataframe
        id       word                                           sentence                                          paragraph                                               text                                                emb
0        0     career                                      them at some.  stage us ok office sit rate. think cold marria...  election new risk along. start admit parent be...  [0.38130239080312345, 0.4129195604198841, 0.94...
1        1       same                       whose idea expect party far.  nothing water bank full close. drop strong fiv...  could popular world clearly lot. method star o...  [0.4998943407873816, 0.9610945003226037, 0.609...
2        2    mention  try offer citizen because discuss station arti...  three order rather network fund none. owner co...  month something their. all side focus once onl...  [0.625669641698085, 0.7938970243204132, 0.5171...
3        3  necessary        discuss share month establish they account.  day financial red ahead watch design. notice r...  special moment fire loss best pick. mr full pl...  [0.8634757822550295, 0.1427588550215242, 0.572...
4        4       that                       account guess live continue.  worry page night design. discussion will road ...  field full include five middle goal. specific ...  [0.6197244834058141, 0.5557701233091245, 0.193...
...    ...        ...                                                ...                                                ...                                                ...                                                ...
4995  4995    mention  method finish show present of money everything...  none keep stage at him others herself enjoy. c...  say traditional view term. per admit ability e...  [0.7785692289302282, 0.7533609272719551, 0.788...
4996  4996  agreement                       continue probably per class.  season structure pull defense concern pay figu...  happen what guess and personal year three. fou...  [0.1843558041570924, 0.04934656676373783, 0.26...
4997  4997       game       mrs trial choice evening economy first drug.  word value nation past race have happen. toget...  force go along represent skin. meet threat fly...  [0.31889645245729215, 0.18197038708099578, 0.9...
4998  4998       read                    mean image western detail also.  agent night skill our boy. down real power ite...  themselves writer themselves list realize appr...  [0.37054949583512764, 0.3374745492036224, 0.19...
4999  4999    country  type conference become career value sense scor...  hundred matter tend ground anyone guy now baby...  pass adult effect school while benefit east he...  [0.05222852868069561, 0.6383223384677691, 0.85...

[5000 rows x 6 columns] (test_full_text_search.py:1465)
[2024-10-14 20:38:00 - INFO - ci_test]: Analyze document cost time: 0.20666980743408203 (common_func.py:340)
[2024-10-14 20:38:00 - ERROR - pymilvus.decorators]: RPC error: [insert_rows], <DataNotMatchException: (code=1, message=Insert missed an field `text_sparse_emb` to collection without set nullable==true or set default_value)>, <Time:{'RPC start': '2024-10-14 20:38:00.579745', 'RPC error': '2024-10-14 20:38:00.592640'}> (decorators.py:140)
[2024-10-14 20:38:00 - ERROR - ci_test]: Traceback (most recent call last):
  File "/Users/zilliz/workspace/milvus/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
    res = func(*args, **_kwargs)
  File "/Users/zilliz/workspace/milvus/tests/python_client/utils/api_request.py", line 63, in api_request
    return func(*arg, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 507, in insert
    return conn.insert_rows(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 141, in handler
    raise e from e
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 137, in handler
    return func(*args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 176, in handler
    return func(self, *args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 116, in handler
    raise e from e
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 86, in handler
    return func(*args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 493, in insert_rows
    request = self._prepare_row_insert_request(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 519, in _prepare_row_insert_request
    return Prepare.row_insert_param(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/prepare.py", line 583, in row_insert_param
    return cls._parse_row_request(request, fields_info, enable_dynamic, entities)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/prepare.py", line 443, in _parse_row_request
    raise DataNotMatchException(
pymilvus.exceptions.DataNotMatchException: <DataNotMatchException: (code=1, message=Insert missed an field `text_sparse_emb` to collection without set nullable==true or set default_value)>
 (api_request.py:45)
[2024-10-14 20:38:00 - ERROR - ci_test]: (api_response) : <DataNotMatchException: (code=1, message=Insert missed an field `text_sparse_emb` to collection without set nullable==true or set default_value)> (api_request.py:46)
FAILED
testcases/test_full_text_search.py:1375 (TestSearchWithFullTextSearch.test_full_text_search_default[default-None-SPARSE_INVERTED_INDEX-True-True-0])
self = <test_full_text_search.TestSearchWithFullTextSearch object at 0x136912220>
tokenizer = 'default', expr = None, enable_inverted_index = True
enable_partition_key = True, empty_percent = 0
index_type = 'SPARSE_INVERTED_INDEX'

    @pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("empty_percent", [0])
    @pytest.mark.parametrize("enable_partition_key", [True])
    @pytest.mark.parametrize("enable_inverted_index", [True])
    @pytest.mark.parametrize("index_type", ["SPARSE_INVERTED_INDEX"])
    @pytest.mark.parametrize("expr", [None, "text_match", "id_range"])
    @pytest.mark.parametrize("tokenizer", ["default"])
    def test_full_text_search_default(
            self, tokenizer, expr, enable_inverted_index, enable_partition_key, empty_percent, index_type
    ):
        """
        target: test full text search
        method: 1. enable full text search and insert data with varchar
                2. search with text
                3. verify the result
        expected: full text search successfully and result is correct
        """
        tokenizer_params = {
            "tokenizer": tokenizer,
        }
        dim = 128
        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
            FieldSchema(
                name="word",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
                is_partition_key=enable_partition_key,
            ),
            FieldSchema(
                name="sentence",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="paragraph",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="text",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                enable_match=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim),
            FieldSchema(name="text_sparse_emb", dtype=DataType.SPARSE_FLOAT_VECTOR),
        ]
        schema = CollectionSchema(fields=fields, description="test collection")
        bm25_function = Function(
            name="text_bm25_emb",
            function_type=FunctionType.BM25,
            input_field_names=["text"],
            output_field_names=["text_sparse_emb"],
            params={},
        )
        schema.add_function(bm25_function)
        data_size = 5000
        collection_w = self.init_collection_wrap(
            name=cf.gen_unique_str(prefix), schema=schema
        )
        fake = fake_en
        if tokenizer == "jieba":
            language = "zh"
            fake = fake_zh
        else:
            language = "en"

        data = [
            {
                "id": i,
                "word": fake.word().lower() if random.random() >= empty_percent else "",
                "sentence": fake.sentence().lower() if random.random() >= empty_percent else "",
                "paragraph": fake.paragraph().lower() if random.random() >= empty_percent else "",
                "text": fake.text().lower() if random.random() >= empty_percent else "",
                "emb": [random.random() for _ in range(dim)],
            }
            for i in range(data_size)
        ]
        df = pd.DataFrame(data)
        corpus = df["text"].to_list()
        log.info(f"dataframe\n{df}")
        texts = df["text"].to_list()
        word_freq = cf.analyze_documents(texts, language=language)
        tokens = list(word_freq.keys())
        if len(tokens) == 0:
            log.info(f"empty tokens, add a dummy token")
            tokens = ["dummy"]
        batch_size = 5000
        for i in range(0, len(df), batch_size):
>           collection_w.insert(
                data[i : i + batch_size]
                if i + batch_size < len(df)
                else data[i : len(df)]
            )

test_full_text_search.py:1474: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../utils/wrapper.py:33: in inner_wrapper
    res, result = func(*args, **kwargs)
../base/collection_wrapper.py:130: in insert
    check_result = ResponseChecker(res, func_name, check_task, check_items, check,
../check/func_check.py:34: in run
    result = self.assert_succ(self.succ, True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <check.func_check.ResponseChecker object at 0x136912460>, actual = False
expect = True

    def assert_succ(self, actual, expect):
>       assert actual is expect, f"Response of API {self.func_name} expect {expect}, but got {actual}"
E       AssertionError: Response of API insert expect True, but got False

../check/func_check.py:116: AssertionError

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

zhuwenxing commented 1 month ago

/assign @zhengbuqian /assign @aoiasd PTAL

yanliang567 commented 1 month ago

/unassign

zhuwenxing commented 1 month ago

verified and fixed with https://github.com/milvus-io/pymilvus/pull/2298