milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.88k stars 2.94k forks source link

[Bug]: After Milvus recovers from etcd pod failure chaos, the querynode got crash during verification test #37765

Closed zhuwenxing closed 3 days ago

zhuwenxing commented 4 days ago

Is there an existing issue for this?

Environment

- Milvus version:master-20241116-f7c7ac51-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-11-16T08:59:16.807Z] [2024-11-16 08:53:14 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ['int64 > 0', ['int64'], None, 180], kwargs: {'partition_name': '_default'} (api_request.py:62)

[2024-11-16T08:59:16.807Z] [2024-11-16 08:53:35 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=503, message=failed to query: node offline[node=16]: channel not available[channel=by-dev-rootcoord-dml_5_453966585423805532v0])>, <Time:{'RPC start': '2024-11-16 08:53:14.099525', 'RPC error': '2024-11-16 08:53:35.115113'}> (decorators.py:140)

[2024-11-16T08:59:16.807Z] [2024-11-16 08:53:35 - ERROR - ci_test]: Traceback (most recent call last):

[2024-11-16T08:59:16.807Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-11-16T08:59:16.807Z]     res = func(*args, **_kwargs)

[2024-11-16T08:59:16.807Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-11-16T08:59:16.807Z]     return func(*arg, **kwargs)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 1076, in query

[2024-11-16T08:59:16.807Z]     return conn.query(

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler

[2024-11-16T08:59:16.807Z]     raise e from e

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 137, in handler

[2024-11-16T08:59:16.807Z]     return func(*args, **kwargs)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 176, in handler

[2024-11-16T08:59:16.807Z]     return func(self, *args, **kwargs)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 116, in handler

[2024-11-16T08:59:16.807Z]     raise e from e

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler

[2024-11-16T08:59:16.807Z]     return func(*args, **kwargs)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1542, in query

[2024-11-16T08:59:16.807Z]     check_status(response.status)

[2024-11-16T08:59:16.807Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 63, in check_status

[2024-11-16T08:59:16.807Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-11-16T08:59:16.807Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to query: node offline[node=16]: channel not available[channel=by-dev-rootcoord-dml_5_453966585423805532v0])>

[2024-11-16T08:59:16.807Z]  (api_request.py:45)

[2024-11-16T08:59:16.807Z] [2024-11-16 08:53:35 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to query: node offline[node=16]: channel not available[channel=by-dev-rootcoord-dml_5_453966585423805532v0])> (api_request.py:46)

[2024-11-16T08:51:23.677Z] [2024-11-16 08:50:56 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=503, message=failed to search: service internal error: target version mismatch, collection: 453966585423805532, channel: by-dev-rootcoord-dml_6_453966585423805532v1,  current target version: 1731746590603413177, leader version: 0: channel not available[channel=by-dev-rootcoord-dml_6_453966585423805532v1])>, <Time:{'RPC start': '2024-11-16 08:50:35.438513', 'RPC error': '2024-11-16 08:50:56.459598'}> (decorators.py:140)

[2024-11-16T08:51:23.677Z] [2024-11-16 08:50:56 - ERROR - ci_test]: Traceback (most recent call last):

[2024-11-16T08:51:23.677Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-11-16T08:51:23.677Z]     res = func(*args, **_kwargs)

[2024-11-16T08:51:23.677Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-11-16T08:51:23.677Z]     return func(*arg, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 801, in search

[2024-11-16T08:51:23.677Z]     resp = conn.search(

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler

[2024-11-16T08:51:23.677Z]     raise e from e

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 137, in handler

[2024-11-16T08:51:23.677Z]     return func(*args, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 176, in handler

[2024-11-16T08:51:23.677Z]     return func(self, *args, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 116, in handler

[2024-11-16T08:51:23.677Z]     raise e from e

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler

[2024-11-16T08:51:23.677Z]     return func(*args, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 806, in search

[2024-11-16T08:51:23.677Z]     return self._execute_search(request, timeout, round_decimal=round_decimal, **kwargs)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 747, in _execute_search

[2024-11-16T08:51:23.677Z]     raise e from e

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 736, in _execute_search

[2024-11-16T08:51:23.677Z]     check_status(response.status)

[2024-11-16T08:51:23.677Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 63, in check_status

[2024-11-16T08:51:23.677Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-11-16T08:51:23.677Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: service internal error: target version mismatch, collection: 453966585423805532, channel: by-dev-rootcoord-dml_6_453966585423805532v1,  current target version: 1731746590603413177, leader version: 0: channel not available[channel=by-dev-rootcoord-dml_6_453966585423805532v1])>

[2024-11-16T08:51:23.677Z]  (api_request.py:45)

pod info at 2024-11-16T08:49:59.699Z before verication


[2024-11-16T08:49:59.698Z] + kubectl get pods -o wide

[2024-11-16T08:49:59.699Z] + grep etcd-pod-failure-18570

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-0                                          1/1     Running            2 (8m47s ago)      32m     10.104.24.80    4am-node29   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-1                                          1/1     Running            2 (8m47s ago)      32m     10.104.15.206   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-2                                          1/1     Running            2 (8m47s ago)      32m     10.104.16.128   4am-node21   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-kafka-0                                    2/2     Running            1 (31m ago)        32m     10.104.15.207   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-kafka-1                                    2/2     Running            1 (31m ago)        32m     10.104.24.83    4am-node29   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-kafka-2                                    2/2     Running            1 (31m ago)        32m     10.104.16.131   4am-node21   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-kafka-exporter-65d959849b-f4g8c            1/1     Running            4 (31m ago)        32m     10.104.15.194   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-datanode-7986f669c-659qd            1/1     Running            8 (10m ago)        32m     10.104.13.193   4am-node16   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-datanode-7986f669c-dknzg            1/1     Running            8 (10m ago)        32m     10.104.20.205   4am-node22   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-pzspb          1/1     Running            8 (10m ago)        32m     10.104.15.195   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-vl22l          1/1     Running            8 (11m ago)        32m     10.104.30.67    4am-node38   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-zl2q6          1/1     Running            8 (11m ago)        32m     10.104.24.76    4am-node29   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-mixcoord-666bfbdf6b-mmw59           1/1     Running            8 (11m ago)        32m     10.104.9.69     4am-node14   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-proxy-868d8d5966-mkqcm              1/1     Running            8 (10m ago)        32m     10.104.20.204   4am-node22   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-querynode-586f96ddd-9vv7t           1/1     Running            8 (11m ago)        32m     10.104.30.68    4am-node38   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-querynode-586f96ddd-ltbrv           0/1     CrashLoopBackOff   8 (4m39s ago)      32m     10.104.19.35    4am-node28   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-milvus-querynode-586f96ddd-xxxqw           1/1     Running            8 (11m ago)        32m     10.104.9.70     4am-node14   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-minio-0                                    1/1     Running            0                  32m     10.104.24.81    4am-node29   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-minio-1                                    1/1     Running            0                  32m     10.104.34.243   4am-node37   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-minio-2                                    1/1     Running            0                  32m     10.104.15.208   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-minio-3                                    1/1     Running            0                  32m     10.104.16.132   4am-node21   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-zookeeper-0                                1/1     Running            0                  32m     10.104.15.203   4am-node20   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-zookeeper-1                                1/1     Running            0                  32m     10.104.16.124   4am-node21   <none>           <none>

[2024-11-16T08:49:59.955Z] etcd-pod-failure-18570-zookeeper-2                                1/1     Running            0                  32m     10.104.32.237   4am-node39   <none>           <none>

pod info after verification

+ kubectl get pods -o wide
 + grep etcd-pod-failure-18570
 etcd-pod-failure-18570-0                                          1/1     Running            2 (18m ago)        41m     10.104.24.80    4am-node29   <none>           <none>
 etcd-pod-failure-18570-1                                          1/1     Running            2 (18m ago)        41m     10.104.15.206   4am-node20   <none>           <none>
 etcd-pod-failure-18570-2                                          1/1     Running            2 (18m ago)        41m     10.104.16.128   4am-node21   <none>           <none>
 etcd-pod-failure-18570-kafka-0                                    2/2     Running            1 (41m ago)        41m     10.104.15.207   4am-node20   <none>           <none>
 etcd-pod-failure-18570-kafka-1                                    2/2     Running            1 (41m ago)        41m     10.104.24.83    4am-node29   <none>           <none>
 etcd-pod-failure-18570-kafka-2                                    2/2     Running            1 (41m ago)        41m     10.104.16.131   4am-node21   <none>           <none>
 etcd-pod-failure-18570-kafka-exporter-65d959849b-f4g8c            1/1     Running            4 (41m ago)        41m     10.104.15.194   4am-node20   <none>           <none>
 etcd-pod-failure-18570-milvus-datanode-7986f669c-659qd            1/1     Running            8 (20m ago)        41m     10.104.13.193   4am-node16   <none>           <none>
 etcd-pod-failure-18570-milvus-datanode-7986f669c-dknzg            1/1     Running            8 (20m ago)        41m     10.104.20.205   4am-node22   <none>           <none>
 etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-pzspb          1/1     Running            8 (20m ago)        41m     10.104.15.195   4am-node20   <none>           <none>
 etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-vl22l          1/1     Running            8 (20m ago)        41m     10.104.30.67    4am-node38   <none>           <none>
 etcd-pod-failure-18570-milvus-indexnode-68fbf986f8-zl2q6          1/1     Running            8 (20m ago)        41m     10.104.24.76    4am-node29   <none>           <none>
 etcd-pod-failure-18570-milvus-mixcoord-666bfbdf6b-mmw59           1/1     Running            8 (20m ago)        41m     10.104.9.69     4am-node14   <none>           <none>
 etcd-pod-failure-18570-milvus-proxy-868d8d5966-mkqcm              1/1     Running            8 (20m ago)        41m     10.104.20.204   4am-node22   <none>           <none>
 etcd-pod-failure-18570-milvus-querynode-586f96ddd-9vv7t           1/1     Running            8 (20m ago)        41m     10.104.30.68    4am-node38   <none>           <none>
 etcd-pod-failure-18570-milvus-querynode-586f96ddd-ltbrv           1/1     Running            9 (14m ago)        41m     10.104.19.35    4am-node28   <none>           <none>
 etcd-pod-failure-18570-milvus-querynode-586f96ddd-xxxqw           1/1     Running            8 (20m ago)        41m     10.104.9.70     4am-node14   <none>           <none>
 etcd-pod-failure-18570-minio-0                                    1/1     Running            0                  41m     10.104.24.81    4am-node29   <none>           <none>
 etcd-pod-failure-18570-minio-1                                    1/1     Running            0                  41m     10.104.34.243   4am-node37   <none>           <none>
 etcd-pod-failure-18570-minio-2                                    1/1     Running            0                  41m     10.104.15.208   4am-node20   <none>           <none>
 etcd-pod-failure-18570-minio-3                                    1/1     Running            0                  41m     10.104.16.132   4am-node21   <none>           <none>
 etcd-pod-failure-18570-zookeeper-0                                1/1     Running            0                  41m     10.104.15.203   4am-node20   <none>           <none>
 etcd-pod-failure-18570-zookeeper-1                                1/1     Running            0                  41m     10.104.16.124   4am-node21   <none>           <none>
 etcd-pod-failure-18570-zookeeper-2                                1/1     Running            0                  41m     10.104.32.237   4am-node39   <none>           <none>

Two issues exist with the querynode here:

  1. Why did one querynode not return to normal after the etcd pod failure chaos was eliminated?
  2. Why did the restart count increase by one after the verification test, and what caused this restart?

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/18570/pipeline

log: artifacts-etcd-pod-failure-18570-server-logs.tar.gz

Anything else?

No response

yanliang567 commented 4 days ago

@liliu-z please help to take a look this is the first time we see this error: ervice internal error: target version mismatch

/unassign

congqixia commented 4 days ago

This is an error introduced in recent PR https://github.com/milvus-io/milvus/blob/12ed40e125338169f3fbcb3a1df38e8934f07393/internal/querycoordv2/dist/dist_handler.go#L249-L262 I shall check with the author offline

liliu-z commented 4 days ago

/assign @congqixia

weiliu1031 commented 4 days ago

should be fixed by https://github.com/milvus-io/milvus/pull/37748

weiliu1031 commented 4 days ago

/assign

weiliu1031 commented 4 days ago

/assign @zhuwenxing

zhuwenxing commented 3 days ago

not reproduced in master-20241118-f2a2fd68-amd64