milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.36k stars 2.91k forks source link

[Bug]: flush timeout after upgrading from v2.3.15 to master-20240918-23b95aeb-amd64 #36375

Closed zhuwenxing closed 1 month ago

zhuwenxing commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version:v2.3.15 to master-20240918-23b95aeb-amd64
- Deployment mode(standalone or cluster):both
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-09-18T10:43:39.660Z] self = <pymilvus.client.grpc_handler.GrpcHandler object at 0x7f7d581a92b0>

[2024-09-18T10:43:39.660Z] segment_ids = [452631689923239759]

[2024-09-18T10:43:39.660Z] collection_name = 'deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000'

[2024-09-18T10:43:39.660Z] flush_ts = 452632487711211523, timeout = None, kwargs = {}, flush_ret = False

[2024-09-18T10:43:39.660Z] start = 1726655913.281913, end = 1726656152.0023956

[2024-09-18T10:43:39.660Z] 

[2024-09-18T10:43:39.660Z]     def _wait_for_flushed(

[2024-09-18T10:43:39.660Z]         self,

[2024-09-18T10:43:39.660Z]         segment_ids: List[int],

[2024-09-18T10:43:39.660Z]         collection_name: str,

[2024-09-18T10:43:39.660Z]         flush_ts: int,

[2024-09-18T10:43:39.660Z]         timeout: Optional[float] = None,

[2024-09-18T10:43:39.660Z]         **kwargs,

[2024-09-18T10:43:39.660Z]     ):

[2024-09-18T10:43:39.660Z]         flush_ret = False

[2024-09-18T10:43:39.660Z]         start = time.time()

[2024-09-18T10:43:39.660Z]         while not flush_ret:

[2024-09-18T10:43:39.660Z]             flush_ret = self.get_flush_state(

[2024-09-18T10:43:39.660Z]                 segment_ids, collection_name, flush_ts, timeout, **kwargs

[2024-09-18T10:43:39.660Z]             )

[2024-09-18T10:43:39.660Z]             end = time.time()

[2024-09-18T10:43:39.660Z]             if timeout is not None and end - start > timeout:

[2024-09-18T10:43:39.660Z]                 raise MilvusException(

[2024-09-18T10:43:39.660Z]                     message=f"wait for flush timeout, collection: {collection_name}, flusht_ts: {flush_ts}"

[2024-09-18T10:43:39.660Z]                 )

[2024-09-18T10:43:39.660Z]     

[2024-09-18T10:43:39.660Z]             if not flush_ret:

[2024-09-18T10:43:39.660Z] >               time.sleep(0.5)

[2024-09-18T10:43:39.660Z] E               Failed: Timeout >240.0s

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/2798/pipeline

log: artifacts-pulsar-cluster-upgrade-2798-server-logs.tar.gz

Anything else?

No response

zhuwenxing commented 1 month ago

/assign @XuanYang-cn PTAL It is a stable reproduced issue /unassign @yanliang567

bigsheeper commented 1 month ago

Cannot find segment, and the segmentID is the same as the collectionID image

bigsheeper commented 1 month ago

related to: https://github.com/milvus-io/milvus/pull/36359

bigsheeper commented 1 month ago

/assign @zhuwenxing /unassign @XuanYang-cn should be fixed.

zhuwenxing commented 1 month ago

verified and passed with master-20240920-eb23e23c-amd64