milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.06k stars 2.95k forks source link

[Bug]: All test cases failed after querynode pod kill chaos test #23676

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20230424-72485c9e
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-04-24T18:56:48.079Z] [2023-04-24 18:56:47 - INFO - ci_test]: assert insert: 0.6714577674865723 (test_data_persistence.py:36)

[2023-04-24T18:56:51.335Z] [2023-04-24 18:56:50 - INFO - ci_test]: [test][2023-04-24T18:56:47Z] [3.01902166s] Hello_Milvus flush -> None (wrapper.py:30)

[2023-04-24T18:56:51.335Z] [2023-04-24 18:56:50 - INFO - ci_test]: [test][2023-04-24T18:56:50Z] [0.00494641s] Hello_Milvus flush -> None (wrapper.py:30)

[2023-04-24T18:56:51.335Z] [2023-04-24 18:56:50 - INFO - ci_test]: [test][2023-04-24T18:56:50Z] [0.00419937s] Hello_Milvus flush -> None (wrapper.py:30)

[2023-04-24T18:56:51.335Z] [2023-04-24 18:56:50 - INFO - ci_test]: assert flush: 3.0265939235687256, entities: 9000 (test_data_persistence.py:46)

[2023-04-24T18:56:51.335Z] [2023-04-24 18:56:50 - INFO - ci_test]: index info: [{'collection': 'Hello_Milvus', 'field': 'float_vector', 'index_name': 'test_DtlGWy46', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Hello_Milvus', 'field': 'varchar', 'index_name': 'test_0KWCcPWj', 'index_param': {'index_type': 'Trie'}}] (test_data_persistence.py:65)

[2023-04-24T18:56:51.335Z] [2023-04-24 18:56:50 - INFO - ci_test]: [test][2023-04-24T18:56:50Z] [0.00577945s] Hello_Milvus load -> None (wrapper.py:30)

[2023-04-24T18:57:13.218Z] [2023-04-24 18:57:10 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-04-24 18:56:50.773422', 'RPC error': '2023-04-24 18:57:10.778102'}> (decorators.py:108)

[2023-04-24T18:57:13.218Z] [2023-04-24 18:57:10 - ERROR - ci_test]: Traceback (most recent call last):

[2023-04-24T18:57:13.218Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-04-24T18:57:13.218Z]     res = func(*args, **_kwargs)

[2023-04-24T18:57:13.218Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-04-24T18:57:13.218Z]     return func(*arg, **kwargs)

[2023-04-24T18:57:13.218Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search

[2023-04-24T18:57:13.218Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-04-24T18:57:13.218Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-04-24T18:57:13.218Z]     raise e

[2023-04-24T18:57:13.218Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-04-24T18:57:13.218Z]     return func(*args, **kwargs)

[2023-04-24T18:57:13.218Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-04-24T18:57:13.218Z]     ret = func(self, *args, **kwargs)

[2023-04-24T18:57:13.218Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-04-24T18:57:13.218Z]     raise e

[2023-04-24T18:57:13.218Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-04-24T18:57:13.218Z]     return func(self, *args, **kwargs)

[2023-04-24T18:57:13.218Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search

[2023-04-24T18:57:13.218Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-04-24T18:57:13.218Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests

[2023-04-24T18:57:13.218Z]     raise pre_err

[2023-04-24T18:57:13.218Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests

[2023-04-24T18:57:13.218Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-04-24T18:57:13.219Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: context done during sleep after run#6: context deadline exceeded)>

[2023-04-24T18:57:13.219Z]  (api_request.py:39)

[2023-04-24T18:57:13.219Z] [2023-04-24 18:57:10 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441021067389381699v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-roo...... (api_request.py:40)

[2023-04-24T18:56:51.065Z] [2023-04-24 18:56:50 - INFO - ci_test]: [test][2023-04-24T18:56:50Z] [0.00246628s] e2e__UluO9hhx flush -> None (wrapper.py:30)

[2023-04-24T18:56:51.065Z] [2023-04-24 18:56:50 - INFO - ci_test]: assert flush: 3.0251872539520264, entities: 3000 (test_e2e.py:41)

[2023-04-24T18:56:55.218Z] [2023-04-24 18:56:54 - INFO - ci_test]: [test][2023-04-24T18:56:50Z] [3.51910746s] e2e__UluO9hhx create_index -> Status(code=0, message=) (wrapper.py:30)

[2023-04-24T18:56:56.576Z] [2023-04-24 18:56:56 - INFO - ci_test]: [test][2023-04-24T18:56:54Z] [2.01534920s] e2e__UluO9hhx create_index -> Status(code=0, message=) (wrapper.py:30)

[2023-04-24T18:56:56.576Z] [2023-04-24 18:56:56 - INFO - ci_test]: assert index: 5.534977436065674 (test_e2e.py:53)

[2023-04-24T18:56:56.576Z] [2023-04-24 18:56:56 - ERROR - pymilvus.decorators]: RPC error: [load_collection], <MilvusException: (code=1, message=failed to load collection, err=failed to spawn replica for collection[nodes not enough])>, <Time:{'RPC start': '2023-04-24 18:56:56.348089', 'RPC error': '2023-04-24 18:56:56.353348'}> (decorators.py:108)

[2023-04-24T18:56:56.576Z] [2023-04-24 18:56:56 - ERROR - ci_test]: Traceback (most recent call last):

[2023-04-24T18:56:56.576Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-04-24T18:56:56.576Z]     res = func(*args, **_kwargs)

[2023-04-24T18:56:56.576Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-04-24T18:56:56.576Z]     return func(*arg, **kwargs)

[2023-04-24T18:56:56.576Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 366, in load

[2023-04-24T18:56:56.576Z]     conn.load_collection(self._name, replica_number=replica_number, timeout=timeout, **kwargs)

[2023-04-24T18:56:56.576Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-04-24T18:56:56.576Z]     raise e

[2023-04-24T18:56:56.576Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-04-24T18:56:56.576Z]     return func(*args, **kwargs)

[2023-04-24T18:56:56.576Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-04-24T18:56:56.576Z]     ret = func(self, *args, **kwargs)

[2023-04-24T18:56:56.576Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-04-24T18:56:56.576Z]     raise e

[2023-04-24T18:56:56.576Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-04-24T18:56:56.576Z]     return func(self, *args, **kwargs)

[2023-04-24T18:56:56.576Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 707, in load_collection

[2023-04-24T18:56:56.576Z]     raise MilvusException(response.error_code, response.reason)

[2023-04-24T18:56:56.576Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=failed to load collection, err=failed to spawn replica for collection[nodes not enough])>

[2023-04-24T18:56:56.577Z]  (api_request.py:39)

[2023-04-24T18:56:56.577Z] [2023-04-24 18:56:56 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=failed to load collection, err=failed to spawn replica for collection[nodes not enough])> (api_request.py:40)

[2023-04-24T18:56:56.831Z] FAILED
[2023-04-24T19:01:06.814Z] [2023-04-24 19:00:45 - INFO - ci_test]: [test][2023-04-24T19:00:45Z] [0.00246655s] QueryChecker__olX4X5C5 flush -> None (wrapper.py:30)

[2023-04-24T19:01:06.814Z] [2023-04-24 19:00:45 - INFO - ci_test]: assert flush: 2.5302891731262207, entities: 6000 (test_all_collections_after_chaos.py:58)

[2023-04-24T19:01:06.814Z] [2023-04-24 19:00:45 - INFO - ci_test]: index info: [{'collection': 'QueryChecker__olX4X5C5', 'field': 'float_vector', 'index_name': 'index__T28sZZmh', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:74)

[2023-04-24T19:01:06.814Z] [2023-04-24 19:00:45 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2023-04-24T19:01:06.814Z] [2023-04-24 19:00:45 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-04-24T19:01:06.814Z] [2023-04-24 19:00:45 - INFO - ci_test]: [test][2023-04-24T19:00:45Z] [0.00486022s] QueryChecker__olX4X5C5 load -> None (wrapper.py:30)

[2023-04-24T19:01:06.814Z] [2023-04-24 19:00:45 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.04917804243695237, 0.04898487763966803, 0.0064336378497726054, 0.048405798139796195, 0.034247030305714056, 0.039842775309311525, 0.10949723212953831, 0.05472925712249067, 0.022514680074013766, 0.00511127358036623, 0.033769396831538845, 0.12624415975344813, 0.08836803470656893, 0.158031363082229......, kwargs: {} (api_request.py:56)

[2023-04-24T19:01:06.814Z] [2023-04-24 19:01:05 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_24_441021067389382774v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_24_441021067389382774v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-04-24 19:00:45.412316', 'RPC error': '2023-04-24 19:01:05.415832'}> (decorators.py:108)

[2023-04-24T19:01:06.814Z] [2023-04-24 19:01:05 - ERROR - ci_test]: Traceback (most recent call last):

[2023-04-24T19:01:06.814Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-04-24T19:01:06.814Z]     res = func(*args, **_kwargs)

[2023-04-24T19:01:06.814Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-04-24T19:01:06.814Z]     return func(*arg, **kwargs)

[2023-04-24T19:01:06.814Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search

[2023-04-24T19:01:06.814Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-04-24T19:01:06.814Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-04-24T19:01:06.814Z]     raise e

[2023-04-24T19:01:06.814Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-04-24T19:01:06.814Z]     return func(*args, **kwargs)

[2023-04-24T19:01:06.814Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-04-24T19:01:06.814Z]     ret = func(self, *args, **kwargs)

[2023-04-24T19:01:06.814Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-04-24T19:01:06.814Z]     raise e

[2023-04-24T19:01:06.814Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-04-24T19:01:06.814Z]     return func(self, *args, **kwargs)

[2023-04-24T19:01:06.814Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search

[2023-04-24T19:01:06.814Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-04-24T19:01:06.814Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests

[2023-04-24T19:01:06.814Z]     raise pre_err

[2023-04-24T19:01:06.814Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests

[2023-04-24T19:01:06.814Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-04-24T19:01:06.815Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_24_441021067389382774v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_24_441021067389382774v0 is not available in any replica, err=<nil>: attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441021067389382774v1 is not available in any replica, err=<nil>: context done during sleep after run#6: context deadline exceeded)>

[2023-04-24T19:01:06.815Z]  (api_request.py:39)

[2023-04-24T19:01:06.815Z] [2023-04-24 19:01:05 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_24_441021067389382774v0 is not available in any replica, err=<nil>: attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-ro...... (api_request.py:40)

Expected Behavior

all test cases passed

Steps To Reproduce

No response

Milvus Log

image failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/3621/pipeline/288 log:

artifacts-querynode-pod-kill-3621-server-logs.tar.gz

artifacts-querynode-pod-kill-3621-pytest-logs.tar.gz

Anything else?

No response

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

set higher priority, as it is about querynode

zhuwenxing commented 1 year ago

It aslo reproduced when kafka is mq. failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/3621/pipeline log: artifacts-querynode-pod-kill-3621-server-logs (1).tar.gz artifacts-querynode-pod-kill-3621-pytest-logs (1).tar.gz

jiaoew1991 commented 1 year ago

/assign @sunby /unassign

sunby commented 1 year ago

@zhuwenxing https://github.com/milvus-io/milvus/pull/23634 this pr fixed the issue.

sunby commented 1 year ago

/assign @zhuwenxing

zhuwenxing commented 1 year ago

Not reproduced in master-20230427-45fbe1d1