milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.19k stars 2.89k forks source link

[Bug]: After Milvus recovered from the Pulsar pod kill chaos, a flush timeout occurred. #37171

Open zhuwenxing opened 14 hours ago

zhuwenxing commented 14 hours ago

Is there an existing issue for this?

Environment

- Milvus version:2.4-20241025-da897e41-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-10-27T19:16:47.003Z] ------------------------------ Captured log setup ------------------------------

[2024-10-27T19:16:47.003Z] [2024-10-27 19:13:12 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:44)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:13:12 - INFO - ci_test]: [setup_method] Start setup test case test_milvus_default. (client_base.py:45)

[2024-10-27T19:16:47.003Z] ------------------------------ Captured log call -------------------------------

[2024-10-27T19:16:47.003Z] [2024-10-27 19:13:12 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': '10.255.60.164', 'port': 19530} (api_request.py:62)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:13:12 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:13:12 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:62)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:13:12 - DEBUG - ci_test]: (api_response) : True  (api_request.py:37)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:13:12 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['InsertChecker__6kz5Rxiu', {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar', 'description': '', 'type': <Da......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:13:12 - DEBUG - ci_test]: (api_response) : <Collection>:

[2024-10-27T19:16:47.003Z] -------------

[2024-10-27T19:16:47.003Z] <name>: InsertChecker__6kz5Rxiu

[2024-10-27T19:16:47.003Z] <description>: 

[2024-10-27T19:16:47.003Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT......  (api_request.py:37)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:13:12 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:16:12 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:107)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:16:12 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: InsertChecker__6kz5Rxiu, flusht_ts: 453523902911217670)>, <Time:{'RPC start': '2024-10-27 19:13:12.335034', 'RPC error': '2024-10-27 19:16:12.896757'}> (decorators.py:147)

[2024-10-27T19:16:47.003Z] [2024-10-27 19:16:12 - ERROR - ci_test]: Traceback (most recent call last):

[2024-10-27T19:16:47.003Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-10-27T19:16:47.003Z]     res = func(*args, **_kwargs)

[2024-10-27T19:16:47.003Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-10-27T19:16:47.003Z]     return func(*arg, **kwargs)

[2024-10-27T19:16:47.003Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 318, in flush

[2024-10-27T19:16:47.004Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-10-27T19:16:47.004Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 148, in handler

[2024-10-27T19:16:47.004Z]     raise e from e

[2024-10-27T19:16:47.004Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 144, in handler

[2024-10-27T19:16:47.004Z]     return func(*args, **kwargs)

[2024-10-27T19:16:47.004Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 183, in handler

[2024-10-27T19:16:47.004Z]     return func(self, *args, **kwargs)

[2024-10-27T19:16:47.004Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 108, in handler

[2024-10-27T19:16:47.004Z]     raise MilvusException(

[2024-10-27T19:16:47.004Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: InsertChecker__6kz5Rxiu, flusht_ts: 453523902911217670)>

[2024-10-27T19:16:47.004Z]  (api_request.py:45)

[2024-10-27T19:16:47.004Z] [2024-10-27 19:16:12 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: InsertChecker__6kz5Rxiu, flusht_ts: 453523902911217670)> (api_request.py:46)

[2024-10-27T19:16:47.004Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/17136/pipeline

log: artifacts-pulsar-pod-kill-17136-server-logs.tar.gz

cluster: 4am ns: chaos-testing pod info

[2024-10-27T19:10:14.369Z] + kubectl get pods -o wide

[2024-10-27T19:10:14.370Z] + grep pulsar-pod-kill-17136

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-bookie-0                                    1/1     Running       0                8m14s   10.104.21.239   4am-node24   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-bookie-1                                    1/1     Running       0                8m14s   10.104.18.174   4am-node25   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-bookie-init-jcm8p                           0/1     Completed     0                31m     10.104.30.212   4am-node38   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-broker-0                                    1/1     Running       0                8m14s   10.104.13.80    4am-node16   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-etcd-0                                      1/1     Running       0                31m     10.104.18.160   4am-node25   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-etcd-1                                      1/1     Running       0                31m     10.104.33.12    4am-node36   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-etcd-2                                      1/1     Running       0                31m     10.104.21.207   4am-node24   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-datanode-7bfc9c4d75-kb7t8            1/1     Running       3 (8m43s ago)    31m     10.104.27.69    4am-node31   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-datanode-7bfc9c4d75-kdrzf            1/1     Running       3 (9m13s ago)    31m     10.104.13.71    4am-node16   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-indexnode-7c99d57856-ldwq4           1/1     Running       2 (31m ago)      31m     10.104.15.44    4am-node20   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-indexnode-7c99d57856-xcskj           1/1     Running       2 (31m ago)      31m     10.104.14.166   4am-node18   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-indexnode-7c99d57856-zhncv           1/1     Running       2 (31m ago)      31m     10.104.9.243    4am-node14   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-mixcoord-bff96dc5d-gmbfj             1/1     Running       2 (31m ago)      31m     10.104.4.194    4am-node11   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-proxy-79569d96b-tnhct                1/1     Running       2 (31m ago)      31m     10.104.4.196    4am-node11   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-querynode-678f5f8-2cb8v              1/1     Running       2 (31m ago)      31m     10.104.16.227   4am-node21   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-querynode-678f5f8-5mjgq              1/1     Running       2 (31m ago)      31m     10.104.4.195    4am-node11   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-milvus-querynode-678f5f8-drwml              1/1     Running       2                31m     10.104.18.153   4am-node25   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-minio-0                                     1/1     Running       0                31m     10.104.21.203   4am-node24   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-minio-1                                     1/1     Running       0                31m     10.104.33.11    4am-node36   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-minio-2                                     1/1     Running       0                31m     10.104.30.214   4am-node38   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-minio-3                                     1/1     Running       0                31m     10.104.18.163   4am-node25   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-proxy-0                                     1/1     Running       0                8m14s   10.104.13.82    4am-node16   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-pulsar-init-lkfj4                           0/1     Completed     0                31m     10.104.21.195   4am-node24   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-recovery-0                                  1/1     Running       0                8m14s   10.104.13.81    4am-node16   <none>           <none>

[2024-10-27T19:10:14.625Z] pulsar-pod-kill-17136-zookeeper-0                                 1/1     Running       0                8m14s   10.104.21.240   4am-node24   <none>           <none>

Anything else?

No response

yanliang567 commented 13 hours ago

/assign @weiliu1031 /unassign