[Bug]: Flush or compact timeout after milvus recover from datanode/datacoord/querynode pod kill/pod failure chaos test

zhuwenxing commented 8 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version:master-20240118-ddccccbc
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-01-18T19:48:36.785Z] <name>: Checker__Dw9ftQCV

[2024-01-18T19:48:36.785Z] <description>: 

[2024-01-18T19:48:36.785Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}......  (api_request.py:37)

[2024-01-18T19:48:36.785Z] [2024-01-18 19:45:30 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-01-18T19:48:36.785Z] [2024-01-18 19:48:31 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:100)

[2024-01-18T19:48:36.785Z] [2024-01-18 19:48:31 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Checker__Dw9ftQCV)>, <Time:{'RPC start': '2024-01-18 19:45:30.613274', 'RPC error': '2024-01-18 19:48:31.023956'}> (decorators.py:134)

[2024-01-18T19:48:36.785Z] [2024-01-18 19:48:31 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-18T19:48:36.785Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-18T19:48:36.785Z]     res = func(*args, **_kwargs)

[2024-01-18T19:48:36.785Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-18T19:48:36.785Z]     return func(*arg, **kwargs)

[2024-01-18T19:48:36.785Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 314, in flush

[2024-01-18T19:48:36.785Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-01-18T19:48:36.785Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 135, in handler

[2024-01-18T19:48:36.785Z]     raise e from e

[2024-01-18T19:48:36.785Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-18T19:48:36.785Z]     return func(*args, **kwargs)

[2024-01-18T19:48:36.785Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-18T19:48:36.785Z]     return func(self, *args, **kwargs)

[2024-01-18T19:48:36.785Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 101, in handler

[2024-01-18T19:48:36.785Z]     raise MilvusException(

[2024-01-18T19:48:36.785Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: Checker__Dw9ftQCV)>

[2024-01-18T19:48:36.785Z]  (api_request.py:45)

[2024-01-18T20:45:24.050Z] <name>: Checker__UWvuDf31

[2024-01-18T20:45:24.050Z] <description>: 

[2024-01-18T20:45:24.050Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}......  (api_request.py:37)

[2024-01-18T20:45:24.050Z] [2024-01-18 20:23:38 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-01-18T20:45:24.050Z] [2024-01-18 20:24:04 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-01-18T20:45:24.050Z] [2024-01-18 20:24:04 - DEBUG - ci_test]: (api_request)  : [Collection.compact] args: [180], kwargs: {} (api_request.py:62)

[2024-01-18T20:45:24.050Z] [2024-01-18 20:27:04 - ERROR - pymilvus.decorators]: grpc RpcError: [compact], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-01-18 20:24:04.540662', 'gRPC error': '2024-01-18 20:27:04.543365'}> (decorators.py:145)

[2024-01-18T20:45:24.050Z] [2024-01-18 20:27:04 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-18T20:45:24.050Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-18T20:45:24.050Z]     res = func(*args, **_kwargs)

[2024-01-18T20:45:24.050Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-18T20:45:24.050Z]     return func(*arg, **kwargs)

[2024-01-18T20:45:24.050Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 1493, in compact

[2024-01-18T20:45:24.050Z]     self.compaction_id = conn.compact(self._name, timeout=timeout, **kwargs)

[2024-01-18T20:45:24.050Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 149, in handler

[2024-01-18T20:45:24.050Z]     raise e from e

[2024-01-18T20:45:24.050Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-18T20:45:24.050Z]     return func(*args, **kwargs)

[2024-01-18T20:45:24.050Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-18T20:45:24.050Z]     return func(self, *args, **kwargs)

[2024-01-18T20:45:24.050Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2024-01-18T20:45:24.050Z]     raise e from e

[2024-01-18T20:45:24.050Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 74, in handler

[2024-01-18T20:45:24.050Z]     return func(*args, **kwargs)

[2024-01-18T20:45:24.050Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1468, in compact

[2024-01-18T20:45:24.050Z]     response = future.result()

[2024-01-18T20:45:24.050Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 881, in result

[2024-01-18T20:45:24.051Z]     raise self

[2024-01-18T20:45:24.051Z] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:

[2024-01-18T20:45:24.051Z]  status = StatusCode.DEADLINE_EXCEEDED

[2024-01-18T20:45:24.051Z]  details = "Deadline Exceeded"

[2024-01-18T20:45:24.051Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-01-18T20:27:04.542839932+00:00"}"

[2024-01-18T20:45:24.051Z] >

[2024-01-18T20:45:24.051Z]  (api_request.py:45)

[2024-01-18T20:45:24.051Z] [2024-01-18 20:27:04 - ERROR - ci_test]: (api_response) : <_MultiThreadedRendezvous of RPC that terminated with:

[2024-01-18T20:45:24.051Z]  status = StatusCode.DEADLINE_EXCEEDED

[2024-01-18T20:45:24.051Z]  details = "Deadline Exceeded"

[2024-01-18T20:45:24.051Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-01-18T20:27:04.542839932+00:00"}"

[2024-01-18T20:45:24.051Z] > (api_request.py:46)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

compact timeout:

datanode pod-failure: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/10690/pipeline

flush timeout:

datacoord pod-failure: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/10766/pipeline
querynode pod-kill:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/10682/pipeline

log: artifacts-querynode-pod-kill-10682-server-logs.tar.gz

artifacts-datanode-pod-failure-10690-server-logs.tar.gz

artifacts-datacoord-pod-failure-10766-server-logs.tar.gz

Anything else?

No response

zhuwenxing commented 8 months ago

/assign @XuanYang-cn PTAL

ThreadDao commented 7 months ago

@XuanYang-cn Stable recurrence. Flush Unrecoverable error or Deadline Exceeded after remove chaos(pod-kill or pod-failure) from datacoord and minio

XuanYang-cn commented 7 months ago

@XuanYang-cn Stable recurrence. Flush Unrecoverable error or Deadline Exceeded after remove chaos(pod-kill or pod-failure) from datacoord and minio

datacoord: https://qa-jenkins.milvus.io/job/chaos-test-kafka-cron/11168/

minio: https://qa-jenkins.milvus.io/job/chaos-test-kafka-cron/11150/

Flush stucked here at DataNode FlushChannels

XuanYang-cn commented 7 months ago

Looks like a deadlock

xiaofan-luan commented 7 months ago

this could a important issue also for 2.3

XuanYang-cn commented 7 months ago

Not deadlocked, looks like network issue Proxy receive Flush at 20:32:54 DataCoord receive Flush at 20:35:54, already timeout and failed for flush timeout=180s

xiaofan-luan commented 7 months ago

could this be some queue in the system internally?

zhuwenxing commented 7 months ago

Now, this issue mainly and stably happened in the datacoord pod kill chaos test

failed job:

image tag: master-20240220-e5a16050

xiaofan-luan commented 7 months ago

We need a debug mode to reproduce these features:

all logs are enabled to debug level
GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info to enable more detailed grpc log
periodically run pprof including
1. Goroutine pprofile
2. Block pprofile
3. CPU profile
4. Heap profile
5. Trace for last 10s

XuanYang-cn commented 6 months ago

Compaction timeout & some flush timeout reason:

Unfinished compaction will block thoes segments flush and compaction
Restart of DC will clear previous compaction tasks from DC, making tasks generated before restart never going to be finished
DN will block at flush & compaction of those segments

30850 would fix the problem in master

XuanYang-cn commented 6 months ago

2.3 has the same problem, but never tested out, bacause

compaction is unable to block those segments once the flush happened after compaction inject done(which need to be fixed????)
The regenerated compaction task after DC restarted is unable to be blocked by previous compaction

XuanYang-cn commented 6 months ago

@zhuwenxing Please help veirify the master branch

zhuwenxing commented 6 months ago

not happen in the datacoord chaos test but in the etcd chaos test

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/12138/pipeline

log: artifacts-etcd-pod-kill-12138-server-logs.tar.gz

zhuwenxing commented 6 months ago

It also happened after reinstallation or upgrading reinstallation failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/2001/pipeline log: artifacts-pulsar-cluster-reinstall-2001-server-logs.tar.gz

upgrading failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/2003/pipeline log: artifacts-pulsar-cluster-upgrade-2003-server-logs.tar.gz

zhuwenxing commented 6 months ago

@XuanYang-cn It still reproduced in master-20240312-de2c95d0 which should have already enabled L0 compact. failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/12370/pipeline log:

artifacts-datanode-pod-kill-12370-server-logs.tar.gz

aoiasd commented 6 months ago

@XuanYang-cn It still reproduced in master-20240312-de2c95d0 which should have already enabled L0 compact. failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/12370/pipeline log:

artifacts-datanode-pod-kill-12370-server-logs.tar.gz

Find segment loss in datanode matacache cause insert buffer failed and data loss. And cause channel checkpoint wrong and flush block. Reason was that flowgraph add segment to metacache which was created last watch but get segment from new channel meta cache.(channel balance when kill some node cause some channel watch more than one times in some node). Will fix by @congqixia

XuanYang-cn commented 6 months ago

/assign @zhuwenxing Please help verify /unassign

zhuwenxing commented 6 months ago

/assign @XuanYang-cn still reproduced in image tagmaster-20240321-09281a07-amd64 after minio chaos test

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/12681/pipeline log: artifacts-s3-pod-failure-12681-server-logs.tar.gz


[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__xuA9G03s] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__vXi9oJPB] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__sj8FMQcA] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__U7kjLPDQ] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[InsertChecker__lYgFQuGk] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[DropChecker__hs2q8puA] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[IndexChecker__v54pvpgb] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[CreateChecker__2QIjSMIY] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[DeleteChecker__x9QA0oRx] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[FlushChecker__xlx7s4Xu] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__dB4Zakqq] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[HybridSearchChecker__6Yjmdgm0] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[SearchChecker__uBtpdNwI] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[UpsertChecker__s82T1rNu] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[QueryChecker__JGHbptZU] - AssertionError: Response of API flush expect True, but got False

[2024-03-21T21:32:53.426Z] ======================== 15 failed in 888.90s (0:14:48) ========================

cluster: 4am namespace: chaos-testing pod info

[2024-03-21T21:17:35.045Z] + kubectl get pods -o wide

[2024-03-21T21:17:35.047Z] + grep s3-pod-failure-12681

[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-etcd-0                                       1/1     Running       0               38m     10.104.24.144   4am-node29   <none>           <none>

[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-etcd-1                                       1/1     Running       0               38m     10.104.32.104   4am-node39   <none>           <none>

[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-etcd-2                                       1/1     Running       0               38m     10.104.20.199   4am-node22   <none>           <none>

[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-datacoord-6d4858cf6c-fshml            1/1     Running       0               38m     10.104.25.62    4am-node30   <none>           <none>

[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-datanode-84655fbfc4-ktlkk             1/1     Running       0               38m     10.104.14.79    4am-node18   <none>           <none>

[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-datanode-84655fbfc4-kvws8             1/1     Running       4 (6m46s ago)   38m     10.104.25.67    4am-node30   <none>           <none>

[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-indexcoord-66cdcc9fd6-t2sqq           1/1     Running       0               38m     10.104.25.61    4am-node30   <none>           <none>

[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-indexnode-554988c759-4p4qr            1/1     Running       0               38m     10.104.29.102   4am-node35   <none>           <none>

[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-indexnode-554988c759-c4flk            1/1     Running       0               38m     10.104.1.248    4am-node10   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-indexnode-554988c759-cc77g            1/1     Running       0               38m     10.104.33.38    4am-node36   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-proxy-5889f7d6f4-kcww5                1/1     Running       0               38m     10.104.25.63    4am-node30   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-querycoord-7bf9466ddf-xqz2g           1/1     Running       0               38m     10.104.25.66    4am-node30   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-querynode-567cd6cc75-6vw9z            1/1     Running       0               38m     10.104.25.60    4am-node30   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-querynode-567cd6cc75-f6wlr            1/1     Running       0               38m     10.104.19.69    4am-node28   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-querynode-567cd6cc75-p7mcm            1/1     Running       0               38m     10.104.33.37    4am-node36   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-rootcoord-78497f6b47-55db9            1/1     Running       0               38m     10.104.25.59    4am-node30   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-minio-0                                      1/1     Running       8 (14m ago)     38m     10.104.17.127   4am-node23   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-minio-1                                      1/1     Running       8 (15m ago)     38m     10.104.20.196   4am-node22   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-minio-2                                      1/1     Running       8 (14m ago)     38m     10.104.24.146   4am-node29   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-minio-3                                      1/1     Running       8 (14m ago)     38m     10.104.32.107   4am-node39   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-bookie-0                              1/1     Running       0               38m     10.104.32.102   4am-node39   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-bookie-1                              1/1     Running       0               38m     10.104.20.198   4am-node22   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-bookie-init-5kfh6                     0/1     Completed     0               38m     10.104.5.170    4am-node12   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-broker-0                              1/1     Running       0               38m     10.104.9.171    4am-node14   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-proxy-0                               1/1     Running       0               38m     10.104.5.171    4am-node12   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-pulsar-init-k6dzw                     0/1     Completed     0               38m     10.104.5.169    4am-node12   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-recovery-0                            1/1     Running       0               38m     10.104.5.172    4am-node12   <none>           <none>

[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-zookeeper-0                           1/1     Running       0               38m     10.104.17.125   4am-node23   <none>           <none>

XuanYang-cn commented 6 months ago

One of the DataNode restarted twice after 12:17:35, making flush during 12:17-12:30 timeout.

XuanYang-cn commented 6 months ago

Comsumer busy cased by nodeID=0 for dispatcher: #31516

XuanYang-cn commented 6 months ago

/assign

XuanYang-cn commented 5 months ago

Releated: #31518 /assign @zhuwenxing Please help verify in master

zhuwenxing commented 5 months ago

still reproduced in 2.4-20240405-841f9e4f-amd64


[2024-04-06T22:02:44.885Z] <name>: Checker__gDhcHH2H

[2024-04-06T22:02:44.885Z] <description>: 

[2024-04-06T22:02:44.885Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}......  (api_request.py:37)

[2024-04-06T22:02:44.885Z] [2024-04-06 21:47:57 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-04-06T22:02:44.885Z] [2024-04-06 21:50:57 - ERROR - pymilvus.decorators]: grpc RpcError: [flush], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-04-06 21:47:57.006781', 'gRPC error': '2024-04-06 21:50:57.008366'}> (decorators.py:157)

[2024-04-06T22:02:44.885Z] [2024-04-06 21:50:57 - ERROR - ci_test]: Traceback (most recent call last):

[2024-04-06T22:02:44.885Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-04-06T22:02:44.885Z]     res = func(*args, **_kwargs)

[2024-04-06T22:02:44.885Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-04-06T22:02:44.885Z]     return func(*arg, **kwargs)

[2024-04-06T22:02:44.885Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 317, in flush

[2024-04-06T22:02:44.885Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-04-06T22:02:44.885Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 161, in handler

[2024-04-06T22:02:44.885Z]     raise e from e

[2024-04-06T22:02:44.885Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 143, in handler

[2024-04-06T22:02:44.885Z]     return func(*args, **kwargs)

[2024-04-06T22:02:44.885Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 182, in handler

[2024-04-06T22:02:44.885Z]     return func(self, *args, **kwargs)

[2024-04-06T22:02:44.885Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 91, in handler

[2024-04-06T22:02:44.885Z]     raise e from e

[2024-04-06T22:02:44.885Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 87, in handler

[2024-04-06T22:02:44.885Z]     return func(*args, **kwargs)

[2024-04-06T22:02:44.885Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1396, in flush

[2024-04-06T22:02:44.885Z]     response = future.result()

[2024-04-06T22:02:44.885Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 881, in result

[2024-04-06T22:02:44.885Z]     raise self

[2024-04-06T22:02:44.885Z] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:

[2024-04-06T22:02:44.885Z]  status = StatusCode.DEADLINE_EXCEEDED

[2024-04-06T22:02:44.885Z]  details = "Deadline Exceeded"

[2024-04-06T22:02:44.885Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-04-06T21:50:57.007850385+00:00"}"

[2024-04-06T22:02:44.885Z] >

[2024-04-06T22:02:44.885Z]  (api_request.py:45)

[2024-04-06T22:02:44.885Z] [2024-04-06 21:50:57 - ERROR - ci_test]: (api_response) : <_MultiThreadedRendezvous of RPC that terminated with:

[2024-04-06T22:02:44.885Z]  status = StatusCode.DEADLINE_EXCEEDED

[2024-04-06T22:02:44.885Z]  details = "Deadline Exceeded"

[2024-04-06T22:02:44.885Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-04-06T21:50:57.007850385+00:00"}"

[2024-04-06T22:02:44.885Z] > (api_request.py:46)

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/11721/pipeline log: artifacts-rootcoord-pod-kill-11721-server-logs.tar.gz

XuanYang-cn commented 5 months ago

Not deadlocked, looks like network issue Proxy receive Flush at 20:32:54 DataCoord receive Flush at 20:35:54, already timeout and failed for flush timeout=180s

Back to this problem

xiaofan-luan commented 5 months ago

Not deadlocked, looks like network issue Proxy receive Flush at 20:32:54 DataCoord receive Flush at 20:35:54, already timeout and failed for flush timeout=180s

Back to this problem

Did we have any metrics for the queue size? Maybe it's just because of flush can not catch up and queue size accumulate in proxy

yanliang567 commented 4 months ago

file a new issues for individual chaos failure, close this one for now

milvus-io / milvus