milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.25k stars 2.9k forks source link

[Bug]: collection.is_empty sometimes get wrong result #35866

Open NicoYuan1986 opened 2 months ago

NicoYuan1986 commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: master-20240823-e8e3544a-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

collection.is_empty sometimes get wrong result.

[2024-08-24T13:52:14.275Z] ___ TestCollectionSearch.test_search_HNSW_index_with_min_ef[False-10-512-4] ____
...
[2024-08-24T13:52:14.276Z] collection_w = <base.collection_wrapper.ApiCollectionWrapper object at 0x7fb648013bb0>
[2024-08-24T13:52:14.276Z] 
[2024-08-24T13:52:14.276Z]     def init_collection_general(self, prefix="test", insert_data=False, nb=ct.default_nb,
[2024-08-24T13:52:14.276Z]                                 partition_num=0, is_binary=False, is_all_data_type=False,
[2024-08-24T13:52:14.276Z]                                 auto_id=False, dim=ct.default_dim, is_index=True,
[2024-08-24T13:52:14.276Z]                                 primary_field=ct.default_int64_field_name, is_flush=True, name=None,
[2024-08-24T13:52:14.276Z]                                 enable_dynamic_field=False, with_json=True, random_primary_key=False,
[2024-08-24T13:52:14.276Z]                                 multiple_dim_array=[], is_partition_key=None, vector_data_type="FLOAT_VECTOR",
[2024-08-24T13:52:14.276Z]                                 **kwargs):
[2024-08-24T13:52:14.276Z]         """
[2024-08-24T13:52:14.276Z]         target: create specified collections
[2024-08-24T13:52:14.276Z]         method: 1. create collections (binary/non-binary, default/all data type, auto_id or not)
[2024-08-24T13:52:14.276Z]                 2. create partitions if specified
[2024-08-24T13:52:14.276Z]                 3. insert specified (binary/non-binary, default/all data type) data
[2024-08-24T13:52:14.276Z]                    into each partition if any
[2024-08-24T13:52:14.276Z]                 4. not load if specifying is_index as True
[2024-08-24T13:52:14.276Z]         expected: return collection and raw data, insert ids
[2024-08-24T13:52:14.276Z]         """
[2024-08-24T13:52:14.276Z]         log.info("Test case of search interface: initialize before test case")
[2024-08-24T13:52:14.276Z]         if not self.connection_wrap.has_connection(alias=DefaultConfig.DEFAULT_USING)[0]:
[2024-08-24T13:52:14.276Z]             self._connect()
[2024-08-24T13:52:14.276Z]         collection_name = cf.gen_unique_str(prefix)
[2024-08-24T13:52:14.276Z]         if name is not None:
[2024-08-24T13:52:14.276Z]             collection_name = name
[2024-08-24T13:52:14.276Z]         vectors = []
[2024-08-24T13:52:14.276Z]         binary_raw_vectors = []
[2024-08-24T13:52:14.276Z]         insert_ids = []
[2024-08-24T13:52:14.276Z]         time_stamp = 0
[2024-08-24T13:52:14.276Z]         # 1 create collection
[2024-08-24T13:52:14.276Z]         default_schema = cf.gen_default_collection_schema(auto_id=auto_id, dim=dim, primary_field=primary_field,
[2024-08-24T13:52:14.276Z]                                                           enable_dynamic_field=enable_dynamic_field,
[2024-08-24T13:52:14.276Z]                                                           with_json=with_json, multiple_dim_array=multiple_dim_array,
[2024-08-24T13:52:14.276Z]                                                           is_partition_key=is_partition_key,
[2024-08-24T13:52:14.276Z]                                                           vector_data_type=vector_data_type)
[2024-08-24T13:52:14.276Z]         if is_binary:
[2024-08-24T13:52:14.276Z]             default_schema = cf.gen_default_binary_collection_schema(auto_id=auto_id, dim=dim,
[2024-08-24T13:52:14.276Z]                                                                      primary_field=primary_field)
[2024-08-24T13:52:14.276Z]         if vector_data_type == ct.sparse_vector:
[2024-08-24T13:52:14.276Z]             default_schema = cf.gen_default_sparse_schema(auto_id=auto_id, primary_field=primary_field,
[2024-08-24T13:52:14.276Z]                                                                      enable_dynamic_field=enable_dynamic_field,
[2024-08-24T13:52:14.276Z]                                                                      with_json=with_json,
[2024-08-24T13:52:14.276Z]                                                                      multiple_dim_array=multiple_dim_array)
[2024-08-24T13:52:14.276Z]         if is_all_data_type:
[2024-08-24T13:52:14.276Z]             default_schema = cf.gen_collection_schema_all_datatype(auto_id=auto_id, dim=dim,
[2024-08-24T13:52:14.276Z]                                                                    primary_field=primary_field,
[2024-08-24T13:52:14.276Z]                                                                    enable_dynamic_field=enable_dynamic_field,
[2024-08-24T13:52:14.276Z]                                                                    with_json=with_json,
[2024-08-24T13:52:14.276Z]                                                                    multiple_dim_array=multiple_dim_array)
[2024-08-24T13:52:14.276Z]         log.info("init_collection_general: collection creation")
[2024-08-24T13:52:14.276Z]         collection_w = self.init_collection_wrap(name=collection_name, schema=default_schema, **kwargs)
[2024-08-24T13:52:14.276Z]         vector_name_list = cf.extract_vector_field_name_list(collection_w)
[2024-08-24T13:52:14.276Z]         # 2 add extra partitions if specified (default is 1 partition named "_default")
[2024-08-24T13:52:14.276Z]         if partition_num > 0:
[2024-08-24T13:52:14.276Z] [get_env_variable] failed to get environment variables : 'CI_LOG_PATH', use default path : /tmp/ci_logs
[2024-08-24T13:52:14.276Z] [create_path] folder(/tmp/ci_logs) is not exist.
[2024-08-24T13:52:14.276Z] [create_path] create path now...
[2024-08-24T13:52:14.276Z]             cf.gen_partitions(collection_w, partition_num)
[2024-08-24T13:52:14.276Z]         # 3 insert data if specified
[2024-08-24T13:52:14.276Z]         if insert_data:
[2024-08-24T13:52:14.276Z]             collection_w, vectors, binary_raw_vectors, insert_ids, time_stamp = \
[2024-08-24T13:52:14.276Z]                 cf.insert_data(collection_w, nb, is_binary, is_all_data_type, auto_id=auto_id,
[2024-08-24T13:52:14.276Z]                                dim=dim, enable_dynamic_field=enable_dynamic_field, with_json=with_json,
[2024-08-24T13:52:14.276Z]                                random_primary_key=random_primary_key, multiple_dim_array=multiple_dim_array,
[2024-08-24T13:52:14.276Z]                                primary_field=primary_field, vector_data_type=vector_data_type)
[2024-08-24T13:52:14.276Z]             if is_flush:
[2024-08-24T13:52:14.276Z] >               assert collection_w.is_empty is False
[2024-08-24T13:52:14.276Z] E               AssertionError
[2024-08-24T13:52:14.276Z] 
[2024-08-24T13:52:14.276Z] ../base/client_base.py:290: AssertionError
[2024-08-24T13:52:14.276Z] ------------------------------ Captured log call -------------------------------
...
[2024-08-24T13:52:14.277Z] [2024-08-24 13:30:53 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[{'float': 2500.0, 'varchar': '2500', 'json_field': {'number': 2500, 'float': 2500.0}, 'float_vector': [0.2555832088574461, 0.18711933125594346, 0.1027027432418517, 0.25024003965536257, 0.05308787333474047, 0.20607145507900107, 0.3754486942435134, 0.2404584083886309, 0.3446717130805288, 0.194732842......, kwargs: {'timeout': 180} (api_request.py:62)
[2024-08-24T13:52:14.277Z] [2024-08-24 13:30:53 - DEBUG - ci_test]: (api_response) : (insert count: 2500, delete count: 0, upsert count: 0, timestamp: 452068967420788737, success count: 2500, err count: 0  (api_request.py:37)
[2024-08-24T13:52:14.277Z] [2024-08-24 13:30:53 - INFO - ci_test]: inserted 2500 data into collection search_collection_fwwxQieH (common_func.py:1841)
[2024-08-24T13:52:14.277Z] [2024-08-24 13:30:53 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)
[2024-08-24T13:52:14.277Z] [2024-08-24 13:30:56 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

Expected Behavior

collection.is_empty is False

Steps To Reproduce

No response

Milvus Log

link: https://qa-jenkins.milvus.io/blue/organizations/jenkins/E2E%20Test/detail/E2E%20Test/825/pipeline/ log: artifacts-e2e-test-825-server-logs.tar.gz

Anything else?

No response

yanliang567 commented 2 months ago

/assign @longjiquan /unassign

xiaofan-luan commented 3 days ago

@NicoYuan1986 can we still reproce it?