milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.43k stars 2.82k forks source link

[Bug]: Search failed with error `fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204197)` after datacoord pod kill chaos test #21217

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.0-20221213-437e4430
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.3.0.dev21
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

image

[2022-12-13T21:31:18.276Z] testcases/test_data_persistence.py::TestDataPersistence::test_milvus_default 

[2022-12-13T21:31:18.276Z] -------------------------------- live log setup --------------------------------

[2022-12-13T21:31:18.276Z] [2022-12-13 21:31:17 - INFO - ci_test]: ################################################################################ (conftest.py:195)

[2022-12-13T21:31:18.276Z] [2022-12-13 21:31:17 - INFO - ci_test]: [initialize_milvus] Log cleaned up, start testing... (conftest.py:196)

[2022-12-13T21:31:18.276Z] [2022-12-13 21:31:17 - INFO - ci_test]: [setup_class] Start setup class... (client_base.py:30)

[2022-12-13T21:31:18.276Z] [2022-12-13 21:31:17 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:36)

[2022-12-13T21:31:18.276Z] [2022-12-13 21:31:17 - INFO - ci_test]: [setup_method] Start setup test case test_milvus_default. (client_base.py:37)

[2022-12-13T21:31:18.276Z] -------------------------------- live log call ---------------------------------

[2022-12-13T21:31:18.276Z] [2022-12-13 21:31:17 - INFO - ci_test]: [test][2022-12-13T21:31:17Z] [0.00380208s] Hello_Milvus flush -> None (wrapper.py:30)

[2022-12-13T21:31:18.276Z] [2022-12-13 21:31:17 - INFO - ci_test]: assert create collection: 0.012913703918457031, init_entities: 6000 (test_data_persistence.py:28)

[2022-12-13T21:31:18.529Z] [2022-12-13 21:31:18 - INFO - ci_test]: [test][2022-12-13T21:31:18Z] [0.33225150s] Hello_Milvus insert -> (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 438033993796419585, success count: 3000, err count: 0) (wrapper.py:30)

[2022-12-13T21:31:18.529Z] [2022-12-13 21:31:18 - INFO - ci_test]: assert insert: 0.3325011730194092 (test_data_persistence.py:35)

[2022-12-13T21:31:21.032Z] [2022-12-13 21:31:20 - INFO - ci_test]: [test][2022-12-13T21:31:18Z] [2.52602210s] Hello_Milvus flush -> None (wrapper.py:30)

[2022-12-13T21:31:21.032Z] [2022-12-13 21:31:20 - INFO - ci_test]: [test][2022-12-13T21:31:20Z] [0.00386828s] Hello_Milvus flush -> None (wrapper.py:30)

[2022-12-13T21:31:21.032Z] [2022-12-13 21:31:20 - INFO - ci_test]: [test][2022-12-13T21:31:20Z] [0.00412302s] Hello_Milvus flush -> None (wrapper.py:30)

[2022-12-13T21:31:21.032Z] [2022-12-13 21:31:20 - INFO - ci_test]: assert flush: 2.5322277545928955, entities: 9000 (test_data_persistence.py:45)

[2022-12-13T21:31:21.032Z] [2022-12-13 21:31:20 - INFO - ci_test]: index info: [{'collection': 'Hello_Milvus', 'field': 'float_vector', 'index_name': 'test_03jpQMEa', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Hello_Milvus', 'field': 'varchar', 'index_name': 'test_MDsSgDt9', 'index_param': {'index_type': 'Trie'}}] (test_data_persistence.py:64)

[2022-12-13T21:31:21.032Z] [2022-12-13 21:31:20 - INFO - ci_test]: [test][2022-12-13T21:31:20Z] [0.00495407s] Hello_Milvus load -> None (wrapper.py:30)

[2022-12-13T21:31:21.589Z] [2022-12-13 21:31:21 - INFO - ci_test]: [test][2022-12-13T21:31:20Z] [0.29301512s] Hello_Milvus search -> <pymilvus.orm.search.SearchResult object at 0x7fc2cd175b50> (wrapper.py:30)

[2022-12-13T21:31:21.589Z] [2022-12-13 21:31:21 - INFO - ci_test]: assert search: 0.2932462692260742 (test_data_persistence.py:76)

[2022-12-13T21:31:22.946Z] [2022-12-13 21:31:22 - INFO - ci_test]: [test][2022-12-13T21:31:21Z] [1.40825358s] Hello_Milvus release -> None (wrapper.py:30)

[2022-12-13T21:31:23.200Z] [2022-12-13 21:31:23 - INFO - ci_test]: [test][2022-12-13T21:31:22Z] [0.31432312s] Hello_Milvus insert -> (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 438033995015127055, success count: 3000, err count: 0) (wrapper.py:30)

[2022-12-13T21:31:25.703Z] [2022-12-13 21:31:25 - INFO - ci_test]: [test][2022-12-13T21:31:23Z] [2.52952627s] Hello_Milvus flush -> None (wrapper.py:30)

[2022-12-13T21:31:25.703Z] [2022-12-13 21:31:25 - INFO - ci_test]: assert entities: 12000 (test_data_persistence.py:83)

[2022-12-13T21:31:30.926Z] [2022-12-13 21:31:29 - INFO - ci_test]: [test][2022-12-13T21:31:25Z] [4.25980674s] Hello_Milvus load -> None (wrapper.py:30)

[2022-12-13T21:31:30.926Z] [2022-12-13 21:31:29 - INFO - ci_test]: assert load: 4.260053396224976 (test_data_persistence.py:89)

[2022-12-13T21:31:40.846Z] [2022-12-13 21:31:39 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-12-13T21:31:40.846Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204168)

[2022-12-13T21:31:40.846Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204168)

[2022-12-13T21:31:40.846Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204168)

[2022-12-13T21:31:40.846Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204168)

[2022-12-13T21:31:40.846Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_438033834232188307v0 is not available in any replica, err=LackSegment(segmentID=438033834232204167)

[2022-12-13T21:31:40.846Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_438033834232188307v0 is not available in any replica, err=LackSegment(segmentID=438033834232204201)

[2022-12-13T21:31:40.846Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204197)

[2022-12-13T21:31:40.846Z] attempt #8:context deadline exceeded

[2022-12-13T21:31:40.846Z] )>, <Time:{'RPC start': '2022-12-13 21:31:29.884718', 'RPC error': '2022-12-13 21:31:39.888244'}> (decorators.py:108)

[2022-12-13T21:31:40.846Z] [2022-12-13 21:31:39 - ERROR - ci_test]: Traceback (most recent call last):

[2022-12-13T21:31:40.846Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2022-12-13T21:31:40.846Z]     res = func(*args, **_kwargs)

[2022-12-13T21:31:40.846Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2022-12-13T21:31:40.846Z]     return func(*arg, **kwargs)

[2022-12-13T21:31:40.846Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 609, in search

[2022-12-13T21:31:40.846Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2022-12-13T21:31:40.846Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2022-12-13T21:31:40.846Z]     raise e

[2022-12-13T21:31:40.846Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2022-12-13T21:31:40.846Z]     return func(*args, **kwargs)

[2022-12-13T21:31:40.846Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2022-12-13T21:31:40.846Z]     ret = func(self, *args, **kwargs)

[2022-12-13T21:31:40.846Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2022-12-13T21:31:40.846Z]     raise e

[2022-12-13T21:31:40.846Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2022-12-13T21:31:40.846Z]     return func(self, *args, **kwargs)

[2022-12-13T21:31:40.846Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 470, in search

[2022-12-13T21:31:40.846Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2022-12-13T21:31:40.846Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 439, in _execute_search_requests

[2022-12-13T21:31:40.846Z]     raise pre_err

[2022-12-13T21:31:40.846Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 430, in _execute_search_requests

[2022-12-13T21:31:40.846Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2022-12-13T21:31:40.846Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-12-13T21:31:40.846Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204168)

[2022-12-13T21:31:40.846Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204168)

[2022-12-13T21:31:40.846Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204168)

[2022-12-13T21:31:40.847Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204168)

[2022-12-13T21:31:40.847Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_438033834232188307v0 is not available in any replica, err=LackSegment(segmentID=438033834232204167)

[2022-12-13T21:31:40.847Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_438033834232188307v0 is not available in any replica, err=LackSegment(segmentID=438033834232204201)

[2022-12-13T21:31:40.847Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204197)

[2022-12-13T21:31:40.847Z] attempt #8:context deadline exceeded

[2022-12-13T21:31:40.847Z] )>

[2022-12-13T21:31:40.847Z]  (api_request.py:39)

[2022-12-13T21:31:40.847Z] [2022-12-13 21:31:39 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2022-12-13T21:31:40.847Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_3_438033834232188307v1 is not available in any replica, err=LackSegment(segmentID=438033834232204168)

[2022-12-13T21:31:40.847Z] attempt #2:fail t...... (api_request.py:40)

[2022-12-13T21:31:40.847Z] FAILED

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

chaos type: pod-kill image tag: 2.2.0-20221213-437e4430 target pod: datacoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/410/pipeline

log: artifacts-datacoord-pod-kill-410-server-logs.tar.gz

artifacts-datacoord-pod-kill-410-pytest-logs.tar.gz

Anything else?

No response

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

zhuwenxing commented 1 year ago

tag: master-20221215-ee7fddb6 target pod: indexnode failed job: chaos tchaos type: pod-kill image tag: master-20221215-ee7fddb6 target pod: indexnodeype: pod-kill image

log: artifacts-indexnode-pod-kill-461-server-logs.tar.gz artifacts-indexnode-pod-kill-461-pytest-logs.tar.gz

zhuwenxing commented 1 year ago

It still reproduced in many chaos scenarios.

chaos type: pod-kill image tag: 2.2.0-20221216-1aa7a9a8 target pod: querynode failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/538/pipeline log: artifacts-querynode-pod-kill-538-server-logs.tar.gz artifacts-querynode-pod-kill-538-pytest-logs.tar.gz

jiaoew1991 commented 1 year ago

we need more commits to cherry-pick into 2.2.0 branch

zhuwenxing commented 1 year ago

we need more commits to cherry-pick into 2.2.0 branch

@jiaoew1991 Please link this issue to the fix PR, otherwise, I can not see any progress for this issue

yah01 commented 1 year ago

/assign

yah01 commented 1 year ago

/assign @zhuwenxing

21280 should fix this as it works at master

zhuwenxing commented 1 year ago

Not reproduced in master-20221227-d0d0db8c