milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.29k stars 2.91k forks source link

[Bug]: Under read-write mixed workload after chaos, RPC timeout issues may occur with search/query/hybrid search requests #37231

Open zhuwenxing opened 1 week ago

zhuwenxing commented 1 week ago

Is there an existing issue for this?

Environment

- Milvus version:2.4-20241028-9c0f5948-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-10-28T15:17:27.144Z] [2024-10-28 15:16:44 - ERROR - pymilvus.decorators]: grpc RpcError: [query], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-10-28 15:16:34.090216', 'gRPC error': '2024-10-28 15:16:44.092060'}> (decorators.py:158)

[2024-10-28T15:17:27.144Z] [2024-10-28 15:16:44 - ERROR - ci_test]: Traceback (most recent call last):

[2024-10-28T15:17:27.144Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-10-28T15:17:27.144Z]     res = func(*args, **_kwargs)

[2024-10-28T15:17:27.144Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-10-28T15:17:27.144Z]     return func(*arg, **kwargs)

[2024-10-28T15:17:27.144Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 1076, in query

[2024-10-28T15:17:27.144Z]     return conn.query(

[2024-10-28T15:17:27.144Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 162, in handler

[2024-10-28T15:17:27.144Z]     raise e from e

[2024-10-28T15:17:27.144Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 144, in handler

[2024-10-28T15:17:27.144Z]     return func(*args, **kwargs)

[2024-10-28T15:17:27.144Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 183, in handler

[2024-10-28T15:17:27.144Z]     return func(self, *args, **kwargs)

[2024-10-28T15:17:27.144Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 91, in handler

[2024-10-28T15:17:27.144Z]     raise e from e

[2024-10-28T15:17:27.144Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 87, in handler

[2024-10-28T15:17:27.144Z]     return func(*args, **kwargs)

[2024-10-28T15:17:27.144Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1533, in query

[2024-10-28T15:17:27.144Z]     response = future.result()

[2024-10-28T15:17:27.144Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 881, in result

[2024-10-28T15:17:27.144Z]     raise self

[2024-10-28T15:17:27.144Z] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:

[2024-10-28T15:17:27.144Z]  status = StatusCode.DEADLINE_EXCEEDED

[2024-10-28T15:17:27.144Z]  details = "Deadline Exceeded"

[2024-10-28T15:17:27.144Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-10-28T15:16:44.091564992+00:00"}"

[2024-10-28T15:17:27.144Z] >

[2024-10-28T15:17:27.144Z]  (api_request.py:45)

[2024-10-28T15:17:27.144Z] [2024-10-28 15:16:44 - ERROR - ci_test]: (api_response) : <_MultiThreadedRendezvous of RPC that terminated with:

[2024-10-28T15:17:27.144Z]  status = StatusCode.DEADLINE_EXCEEDED

[2024-10-28T15:17:27.144Z]  details = "Deadline Exceeded"

[2024-10-28T15:17:27.144Z]  debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-10-28T15:16:44.091564992+00:00"}"

[2024-10-28T15:17:27.144Z] > (api_request.py:46)

[2024-10-28T15:12:28.587Z] =========================== short test summary info ============================

[2024-10-28T15:12:28.587Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[None-1-2] - pytest_assume.plugin.FailedAssumption: 

[2024-10-28T15:12:28.587Z] 3 Failed Assumptions:

[2024-10-28T15:12:28.587Z] 

[2024-10-28T15:12:28.587Z] chaos_commons.py:124: AssumptionFailure

[2024-10-28T15:12:28.587Z] >>   pytest.assume(

[2024-10-28T15:12:28.587Z] AssertionError: Expect Succ: Op.search succ rate 0.35714285714285715, total: 42, average time: 2.5670

[2024-10-28T15:12:28.587Z] assert False

[2024-10-28T15:12:28.587Z] 

[2024-10-28T15:12:28.587Z] chaos_commons.py:124: AssumptionFailure

[2024-10-28T15:12:28.587Z] >>   pytest.assume(

[2024-10-28T15:12:28.588Z] AssertionError: Expect Succ: Op.hybrid_search succ rate 0.30952380952380953, total: 42, average time: 1.4015

[2024-10-28T15:12:28.588Z] assert False

[2024-10-28T15:12:28.588Z] 

[2024-10-28T15:12:28.588Z] chaos_commons.py:124: AssumptionFailure

[2024-10-28T15:12:28.588Z] >>   pytest.assume(

[2024-10-28T15:12:28.588Z] AssertionError: Expect Succ: Op.query succ rate 0.5272727272727272, total: 55, average time: 1.3124

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

proxy pod failure chaos test failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release-cron/detail/chaos-test-kafka-for-release-cron/17669/pipeline

log: artifacts-proxy-pod-failure-17669-server-logs.tar.gz

cluster: 4am ns: chaos-testing pod info

[2024-10-28T15:00:40.667Z] + grep proxy-pod-failure-17669

[2024-10-28T15:00:40.669Z] + kubectl get pods -o wide

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-etcd-0                                    1/1     Running       0                32m     10.104.34.144   4am-node37   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-etcd-1                                    1/1     Running       0                32m     10.104.20.126   4am-node22   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-etcd-2                                    1/1     Running       0                32m     10.104.15.115   4am-node20   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-kafka-0                                   2/2     Running       1 (31m ago)      32m     10.104.34.145   4am-node37   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-kafka-1                                   2/2     Running       0                32m     10.104.20.128   4am-node22   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-kafka-2                                   2/2     Running       0                32m     10.104.18.185   4am-node25   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-kafka-exporter-559768fcc4-f7jxz           1/1     Running       4 (31m ago)      32m     10.104.14.135   4am-node18   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-datanode-5787dfd786-7c2sm          1/1     Running       2 (32m ago)      32m     10.104.13.39    4am-node16   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-datanode-5787dfd786-ljd8c          1/1     Running       2 (32m ago)      32m     10.104.15.97    4am-node20   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-indexnode-5dd78f89fc-4b9qc         1/1     Running       2 (32m ago)      32m     10.104.14.136   4am-node18   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-indexnode-5dd78f89fc-6gpnc         1/1     Running       2 (32m ago)      32m     10.104.18.177   4am-node25   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-indexnode-5dd78f89fc-j5rlb         1/1     Running       2 (32m ago)      32m     10.104.9.99     4am-node14   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-mixcoord-59b859784f-mn2nx          1/1     Running       2 (32m ago)      32m     10.104.13.38    4am-node16   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-proxy-899c545fc-j7fng              1/1     Running       5 (8m12s ago)    32m     10.104.24.49    4am-node29   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-querynode-674b869695-b8hgl         1/1     Running       2 (32m ago)      32m     10.104.24.48    4am-node29   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-querynode-674b869695-cgqgs         1/1     Running       2 (32m ago)      32m     10.104.16.251   4am-node21   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-milvus-querynode-674b869695-d54mz         1/1     Running       2 (32m ago)      32m     10.104.23.192   4am-node27   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-minio-0                                   1/1     Running       0                32m     10.104.34.143   4am-node37   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-minio-1                                   1/1     Running       0                32m     10.104.20.125   4am-node22   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-minio-2                                   1/1     Running       0                32m     10.104.15.105   4am-node20   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-minio-3                                   1/1     Running       0                32m     10.104.30.249   4am-node38   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-zookeeper-0                               1/1     Running       0                32m     10.104.20.127   4am-node22   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-zookeeper-1                               1/1     Running       0                32m     10.104.34.146   4am-node37   <none>           <none>

[2024-10-28T15:00:40.925Z] proxy-pod-failure-17669-zookeeper-2                               1/1     Running       0                32m     10.104.18.186   4am-node25   <none>           <none>

indexnode pod failure chaos test failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release-cron/detail/chaos-test-kafka-for-release-cron/17668/pipeline log: artifacts-indexnode-pod-failure-17668-server-logs.tar.gz

Anything else?

No response

zhuwenxing commented 1 week ago

image

The latency of query/search/hybrid search has increased significantly after chaos elimination.

yanliang567 commented 1 week ago

/assign @weiliu1031 /unassign