Closed zhuwenxing closed 4 months ago
/assign @XuanYang-cn PTAL
@XuanYang-cn
Stable recurrence. Flush Unrecoverable error
or Deadline Exceeded
after remove chaos(pod-kill or pod-failure) from datacoord and minio
@XuanYang-cn Stable recurrence. Flush
Unrecoverable error
orDeadline Exceeded
after remove chaos(pod-kill or pod-failure) from datacoord and minio
Flush stucked here at DataNode FlushChannels
Looks like a deadlock
this could a important issue also for 2.3
Not deadlocked, looks like network issue
Proxy receive Flush at 20:32:54
DataCoord receive Flush at 20:35:54
, already timeout and failed for flush timeout=180s
could this be some queue in the system internally?
Now, this issue mainly and stably happened in the datacoord pod kill chaos test
failed job:
image tag: master-20240220-e5a16050
We need a debug mode to reproduce these features:
Compaction timeout & some flush timeout reason:
2.3 has the same problem, but never tested out, bacause
@zhuwenxing Please help veirify the master branch
not happen in the datacoord chaos test but in the etcd chaos test
It also happened after reinstallation or upgrading reinstallation failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/2001/pipeline log: artifacts-pulsar-cluster-reinstall-2001-server-logs.tar.gz
upgrading failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/2003/pipeline log: artifacts-pulsar-cluster-upgrade-2003-server-logs.tar.gz
@XuanYang-cn
It still reproduced in master-20240312-de2c95d0
which should have already enabled L0 compact.
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/12370/pipeline
log:
@XuanYang-cn It still reproduced in
master-20240312-de2c95d0
which should have already enabled L0 compact. failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/12370/pipeline log:
Find segment loss in datanode matacache cause insert buffer failed and data loss. And cause channel checkpoint wrong and flush block. Reason was that flowgraph add segment to metacache which was created last watch but get segment from new channel meta cache.(channel balance when kill some node cause some channel watch more than one times in some node). Will fix by @congqixia
/assign @zhuwenxing Please help verify /unassign
/assign @XuanYang-cn
still reproduced in image tagmaster-20240321-09281a07-amd64
after minio chaos test
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/12681/pipeline log: artifacts-s3-pod-failure-12681-server-logs.tar.gz
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__xuA9G03s] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__vXi9oJPB] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__sj8FMQcA] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__U7kjLPDQ] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[InsertChecker__lYgFQuGk] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[DropChecker__hs2q8puA] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[IndexChecker__v54pvpgb] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[CreateChecker__2QIjSMIY] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[DeleteChecker__x9QA0oRx] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[FlushChecker__xlx7s4Xu] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__dB4Zakqq] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[HybridSearchChecker__6Yjmdgm0] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[SearchChecker__uBtpdNwI] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[UpsertChecker__s82T1rNu] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[QueryChecker__JGHbptZU] - AssertionError: Response of API flush expect True, but got False
[2024-03-21T21:32:53.426Z] ======================== 15 failed in 888.90s (0:14:48) ========================
cluster: 4am namespace: chaos-testing pod info
[2024-03-21T21:17:35.045Z] + kubectl get pods -o wide
[2024-03-21T21:17:35.047Z] + grep s3-pod-failure-12681
[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-etcd-0 1/1 Running 0 38m 10.104.24.144 4am-node29 <none> <none>
[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-etcd-1 1/1 Running 0 38m 10.104.32.104 4am-node39 <none> <none>
[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-etcd-2 1/1 Running 0 38m 10.104.20.199 4am-node22 <none> <none>
[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-datacoord-6d4858cf6c-fshml 1/1 Running 0 38m 10.104.25.62 4am-node30 <none> <none>
[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-datanode-84655fbfc4-ktlkk 1/1 Running 0 38m 10.104.14.79 4am-node18 <none> <none>
[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-datanode-84655fbfc4-kvws8 1/1 Running 4 (6m46s ago) 38m 10.104.25.67 4am-node30 <none> <none>
[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-indexcoord-66cdcc9fd6-t2sqq 1/1 Running 0 38m 10.104.25.61 4am-node30 <none> <none>
[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-indexnode-554988c759-4p4qr 1/1 Running 0 38m 10.104.29.102 4am-node35 <none> <none>
[2024-03-21T21:17:35.302Z] s3-pod-failure-12681-milvus-indexnode-554988c759-c4flk 1/1 Running 0 38m 10.104.1.248 4am-node10 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-indexnode-554988c759-cc77g 1/1 Running 0 38m 10.104.33.38 4am-node36 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-proxy-5889f7d6f4-kcww5 1/1 Running 0 38m 10.104.25.63 4am-node30 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-querycoord-7bf9466ddf-xqz2g 1/1 Running 0 38m 10.104.25.66 4am-node30 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-querynode-567cd6cc75-6vw9z 1/1 Running 0 38m 10.104.25.60 4am-node30 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-querynode-567cd6cc75-f6wlr 1/1 Running 0 38m 10.104.19.69 4am-node28 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-querynode-567cd6cc75-p7mcm 1/1 Running 0 38m 10.104.33.37 4am-node36 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-milvus-rootcoord-78497f6b47-55db9 1/1 Running 0 38m 10.104.25.59 4am-node30 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-minio-0 1/1 Running 8 (14m ago) 38m 10.104.17.127 4am-node23 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-minio-1 1/1 Running 8 (15m ago) 38m 10.104.20.196 4am-node22 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-minio-2 1/1 Running 8 (14m ago) 38m 10.104.24.146 4am-node29 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-minio-3 1/1 Running 8 (14m ago) 38m 10.104.32.107 4am-node39 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-bookie-0 1/1 Running 0 38m 10.104.32.102 4am-node39 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-bookie-1 1/1 Running 0 38m 10.104.20.198 4am-node22 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-bookie-init-5kfh6 0/1 Completed 0 38m 10.104.5.170 4am-node12 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-broker-0 1/1 Running 0 38m 10.104.9.171 4am-node14 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-proxy-0 1/1 Running 0 38m 10.104.5.171 4am-node12 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-pulsar-init-k6dzw 0/1 Completed 0 38m 10.104.5.169 4am-node12 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-recovery-0 1/1 Running 0 38m 10.104.5.172 4am-node12 <none> <none>
[2024-03-21T21:17:35.303Z] s3-pod-failure-12681-pulsar-zookeeper-0 1/1 Running 0 38m 10.104.17.125 4am-node23 <none> <none>
One of the DataNode restarted twice after 12:17:35, making flush during 12:17-12:30 timeout.
Comsumer busy cased by nodeID=0 for dispatcher: #31516
/assign
Releated: #31518 /assign @zhuwenxing Please help verify in master
still reproduced in 2.4-20240405-841f9e4f-amd64
[2024-04-06T22:02:44.885Z] <name>: Checker__gDhcHH2H
[2024-04-06T22:02:44.885Z] <description>:
[2024-04-06T22:02:44.885Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}...... (api_request.py:37)
[2024-04-06T22:02:44.885Z] [2024-04-06 21:47:57 - DEBUG - ci_test]: (api_request) : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)
[2024-04-06T22:02:44.885Z] [2024-04-06 21:50:57 - ERROR - pymilvus.decorators]: grpc RpcError: [flush], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-04-06 21:47:57.006781', 'gRPC error': '2024-04-06 21:50:57.008366'}> (decorators.py:157)
[2024-04-06T22:02:44.885Z] [2024-04-06 21:50:57 - ERROR - ci_test]: Traceback (most recent call last):
[2024-04-06T22:02:44.885Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2024-04-06T22:02:44.885Z] res = func(*args, **_kwargs)
[2024-04-06T22:02:44.885Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2024-04-06T22:02:44.885Z] return func(*arg, **kwargs)
[2024-04-06T22:02:44.885Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 317, in flush
[2024-04-06T22:02:44.885Z] conn.flush([self.name], timeout=timeout, **kwargs)
[2024-04-06T22:02:44.885Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 161, in handler
[2024-04-06T22:02:44.885Z] raise e from e
[2024-04-06T22:02:44.885Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 143, in handler
[2024-04-06T22:02:44.885Z] return func(*args, **kwargs)
[2024-04-06T22:02:44.885Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 182, in handler
[2024-04-06T22:02:44.885Z] return func(self, *args, **kwargs)
[2024-04-06T22:02:44.885Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 91, in handler
[2024-04-06T22:02:44.885Z] raise e from e
[2024-04-06T22:02:44.885Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 87, in handler
[2024-04-06T22:02:44.885Z] return func(*args, **kwargs)
[2024-04-06T22:02:44.885Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1396, in flush
[2024-04-06T22:02:44.885Z] response = future.result()
[2024-04-06T22:02:44.885Z] File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 881, in result
[2024-04-06T22:02:44.885Z] raise self
[2024-04-06T22:02:44.885Z] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
[2024-04-06T22:02:44.885Z] status = StatusCode.DEADLINE_EXCEEDED
[2024-04-06T22:02:44.885Z] details = "Deadline Exceeded"
[2024-04-06T22:02:44.885Z] debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-04-06T21:50:57.007850385+00:00"}"
[2024-04-06T22:02:44.885Z] >
[2024-04-06T22:02:44.885Z] (api_request.py:45)
[2024-04-06T22:02:44.885Z] [2024-04-06 21:50:57 - ERROR - ci_test]: (api_response) : <_MultiThreadedRendezvous of RPC that terminated with:
[2024-04-06T22:02:44.885Z] status = StatusCode.DEADLINE_EXCEEDED
[2024-04-06T22:02:44.885Z] details = "Deadline Exceeded"
[2024-04-06T22:02:44.885Z] debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-04-06T21:50:57.007850385+00:00"}"
[2024-04-06T22:02:44.885Z] > (api_request.py:46)
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/11721/pipeline log: artifacts-rootcoord-pod-kill-11721-server-logs.tar.gz
Not deadlocked, looks like network issue Proxy receive Flush at
20:32:54
DataCoord receive Flush at20:35:54
, already timeout and failed forflush timeout=180s
Back to this problem
Not deadlocked, looks like network issue Proxy receive Flush at
20:32:54
DataCoord receive Flush at20:35:54
, already timeout and failed forflush timeout=180s
Back to this problem
Did we have any metrics for the queue size? Maybe it's just because of flush can not catch up and queue size accumulate in proxy
file a new issues for individual chaos failure, close this one for now
Is there an existing issue for this?
Environment
Current Behavior
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
compact timeout:
flush timeout:
log: artifacts-querynode-pod-kill-10682-server-logs.tar.gz
artifacts-datanode-pod-failure-10690-server-logs.tar.gz
artifacts-datacoord-pod-failure-10766-server-logs.tar.gz
Anything else?
No response