milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.05k stars 2.88k forks source link

[Bug]: Search failed with error `ail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197` without any chaos #21099

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: master-20221208-d67e878f
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:29 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:32 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:32 - INFO - ci_test]: [test][2022-12-08T19:02:29Z] [2.52587279s] Hello_Milvus flush -> None (wrapper.py:30)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:32 - INFO - ci_test]: assert entities: 6000 (test_data_persistence.py:83)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:32 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:36 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:36 - INFO - ci_test]: [test][2022-12-08T19:02:32Z] [4.02993328s] Hello_Milvus load -> None (wrapper.py:30)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:36 - INFO - ci_test]: assert load: 4.030129909515381 (test_data_persistence.py:89)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:36 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.12703210446919408, 0.013564376088692334, 0.1014896811272974, 0.02225330822741768, 0.00897035230558303, 0.13416463036071577, 0.11956470592252191, 0.12026555915426952, 0.11612634560112788, 0.04845878313950011, 0.04440896192319051, 0.009075077715064822, 0.08931770892532133, 0.11401688427444076, 0......., kwargs: {} (api_request.py:56)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:46 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-12-08T19:02:46.658Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_437918346556149146v1 is not available in any replica, err=LackSegment(segmentID=437918346556149196)

[2022-12-08T19:02:46.658Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #8:context deadline exceeded

[2022-12-08T19:02:46.658Z] )>, <Time:{'RPC start': '2022-12-08 19:02:36.327997', 'RPC error': '2022-12-08 19:02:46.330702'}> (decorators.py:108)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:46 - ERROR - ci_test]: Traceback (most recent call last):

[2022-12-08T19:02:46.658Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2022-12-08T19:02:46.658Z]     res = func(*args, **_kwargs)

[2022-12-08T19:02:46.658Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2022-12-08T19:02:46.658Z]     return func(*arg, **kwargs)

[2022-12-08T19:02:46.658Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 610, in search

[2022-12-08T19:02:46.658Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2022-12-08T19:02:46.658Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2022-12-08T19:02:46.658Z]     raise e

[2022-12-08T19:02:46.658Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2022-12-08T19:02:46.658Z]     return func(*args, **kwargs)

[2022-12-08T19:02:46.658Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2022-12-08T19:02:46.658Z]     ret = func(self, *args, **kwargs)

[2022-12-08T19:02:46.658Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2022-12-08T19:02:46.658Z]     raise e

[2022-12-08T19:02:46.658Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2022-12-08T19:02:46.658Z]     return func(self, *args, **kwargs)

[2022-12-08T19:02:46.658Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 469, in search

[2022-12-08T19:02:46.658Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2022-12-08T19:02:46.658Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 438, in _execute_search_requests

[2022-12-08T19:02:46.658Z]     raise pre_err

[2022-12-08T19:02:46.658Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 429, in _execute_search_requests

[2022-12-08T19:02:46.658Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2022-12-08T19:02:46.658Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-12-08T19:02:46.658Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_437918346556149146v1 is not available in any replica, err=LackSegment(segmentID=437918346556149196)

[2022-12-08T19:02:46.658Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #8:context deadline exceeded

[2022-12-08T19:02:46.658Z] )>

[2022-12-08T19:02:46.658Z]  (api_request.py:39)

[2022-12-08T19:02:46.658Z] [2022-12-08 19:02:46 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-12-08T19:02:46.658Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_437918346556149146v0 is not available in any replica, err=LackSegment(segmentID=437918346556149197)

[2022-12-08T19:02:46.658Z] attempt #2:fail t...... (api_request.py:40)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

chaos type: pod-failure image tag: master-20221208-d67e878f target pod: datacoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/355/pipeline

log:

artifacts-datacoord-pod-failure-355-server-logs.tar.gz

artifacts-datacoord-pod-failure-355-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing commented 1 year ago

chaos type: pod-kill image tag: master-20221208-d67e878f target pod: indexcoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/346/pipeline log:

artifacts-indexcoord-pod-kill-346-server-logs.tar.gz artifacts-indexcoord-pod-kill-346-pytest-logs.tar.gz

zhuwenxing commented 1 year ago

a failed job after the chaos chaos type: pod-failure image tag: master-20221208-d67e878f target pod: kafka failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/357/pipeline

log: artifacts-kafka-pod-failure-357-server-logs.tar.gz

artifacts-kafka-pod-failure-357-pytest-logs.tar.gz

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

zhuwenxing commented 1 year ago

It is still reproduced in master-20221212-e977e014 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/464/pipeline log: artifacts-indexcoord-pod-kill-464-server-logs.tar.gz

artifacts-indexcoord-pod-kill-464-pytest-logs.tar.gz

czs007 commented 1 year ago

@yah01

yah01 commented 1 year ago

/assign @zhuwenxing this should be fixed with #21107

zhuwenxing commented 1 year ago

Not reproduced in master-20221212-56e722f9

zhuwenxing commented 1 year ago

It reproduced in 2.2 branch. image tag: 2.2.0-20221216-1aa7a9a8

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/531/pipeline

log: artifacts-datacoord-pod-kill-531-server-logs.tar.gz artifacts-datacoord-pod-kill-531-pytest-logs.tar.gz

zhuwenxing commented 1 year ago

@yah01
Please take a look

yah01 commented 1 year ago

21280 fixed this

/assign @zhuwenxing

zhuwenxing commented 1 year ago

verified and passed in master-20221227-d0d0db8c and 2.2.2-20221222-0fdc1a04