milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.62k stars 2.92k forks source link

[Bug]: [null & default] The searched results number is larger than expected when search with expression "field_name == 0" on nullable field with None data without flush #37734

Open binbinlv opened 2 hours ago

binbinlv commented 2 hours ago

Is there an existing issue for this?

Environment

- Milvus version: master-20241115-d1596297-amd64
- Deployment mode(standalone or cluster):both
- MQ type(rocksmq, pulsar or kafka):    all
- SDK version(e.g. pymilvus v2.0.0rc2): 2.5.0rc121
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The searched results number is larger than expected when search with expression "field_name == 0" on nullable field with None data without flush

search_results_check: limit(topK) searched (10) is not equal with expected (1) (func_check.py:346)

Expected Behavior

search_results_check: limit(topK) searched (1) is not equal with expected (1) (func_check.py:346)

Steps To Reproduce

    @pytest.mark.tags(CaseLabel.L1)
    # @pytest.mark.skip(reason="issue #37547")
    def test_search_none_data_expr_cache(self, is_flush):
        """
        target: test search case with none data to test expr cache
        method: 1. create collection with double datatype as nullable field
                2. search with expr "nullableFid == 0"
                3. drop this collection
                4. create collection with same collection name and same field name but modify the type of nullable field
                   as varchar datatype
                5. search with expr "nullableFid == 0" again
        expected: 1. search successfully with limit(topK) for the first collection
                  2. report error for the second collection with the same name
        """
        # 1. initialize with data
        collection_w, _, _, insert_ids, time_stamp = \
            self.init_collection_general(prefix, True, is_flush=is_flush)[0:5]
        collection_name = collection_w.name
        # 2. generate search data
        vectors = cf.gen_vectors_based_on_vector_type(default_nq, default_dim)
        # 3. search with expr "nullableFid == 0"
        search_exp = f"{ct.default_float_field_name} == 0"
        output_fields = [default_int64_field_name, default_float_field_name]
        collection_w.search(vectors[:default_nq], default_search_field,
                            default_search_params, default_limit,
                            search_exp,
                            output_fields=output_fields,
                            check_task=CheckTasks.check_search_results,
                            check_items={"nq": default_nq,
                                         "ids": insert_ids,
                                         "limit": 1,
                                         "output_fields": output_fields})
        # 4. drop collection
        collection_w.drop()
        # 5. create the same collection name with same field name but varchar field type
        int64_field = cf.gen_int64_field(is_primary=True)
        string_field = cf.gen_string_field(ct.default_float_field_name)
        json_field = cf.gen_json_field()
        float_vector_field = cf.gen_float_vec_field()
        fields = [int64_field, string_field, json_field, float_vector_field]
        schema = cf.gen_collection_schema(fields)
        collection_w = self.init_collection_wrap(name=collection_name, schema=schema)
        int64_values = pd.Series(data=[i for i in range(default_nb)])
        string_values = pd.Series(data=[str(i) for i in range(default_nb)], dtype="string")
        json_values = [{"number": i, "string": str(i), "bool": bool(i),
                        "list": [j for j in range(i, i + ct.default_json_list_length)]} for i in range(default_nb)]
        float_vec_values = cf.gen_vectors(default_nb, default_dim)
        df = pd.DataFrame({
            ct.default_int64_field_name: int64_values,
            ct.default_float_field_name: string_values,
            ct.default_json_field_name: json_values,
            ct.default_float_vec_field_name: float_vec_values
        })
        collection_w.insert(df)
        collection_w.create_index(ct.default_float_vec_field_name, ct.default_flat_index)
        collection_w.load()
        collection_w.flush()
        collection_w.search(vectors[:default_nq], default_search_field,
                            default_search_params, default_limit,
                            search_exp,
                            output_fields=output_fields,
                            check_task=CheckTasks.err_res,
                            check_items={"err_code": 1100,
                                         "err_msg": "failed to create query plan: cannot parse expression: float == 0, "
                                                    "error: comparisons between VarChar and Int64 are not supported: "
                                                    "invalid parameter"})

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22test-null-master-cjqsw.*%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

Anything else?

collection name: search_collection_euwUGGzx

binbinlv commented 2 hours ago
  1. if search after flush, it is ok, the number search is 1 not 10.
  2. if search with ""field_name == 1', it is OK, the number search is 1 not 10.
binbinlv commented 2 hours ago

And after verifying the crash issue #37547 (now the crash issue is fixed), this issue exposed using the same case.