milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.05k stars 2.95k forks source link

[Bug]: Search failed with error `segment lacks` after upgrading from v2.2.5 to master-20230823-0bb68cac-amd64 #26564

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: master-20230823-0bb68cac-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-08-23T03:27:48.460Z] 2023-08-23 03:27:48.424 | INFO     | MainThread |utils:load_and_search:206 - collection name: task_2_IVF_SQ8

[2023-08-23T03:27:48.460Z] 2023-08-23 03:27:48.425 | INFO     | MainThread |utils:load_and_search:207 - load collection

[2023-08-23T03:27:48.460Z] 2023-08-23 03:27:48.430 | INFO     | MainThread |utils:load_and_search:211 - load time: 0.0048

[2023-08-23T03:27:48.460Z] 2023-08-23 03:27:48.446 | INFO     | MainThread |utils:load_and_search:225 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-08-23T03:27:48.460Z] 2023-08-23 03:27:48.447 | INFO     | MainThread |utils:load_and_search:228 - 

[2023-08-23T03:27:48.460Z] Search...

[2023-08-23T03:28:00.655Z] RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #1: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #2: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #3: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #4: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #5: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #6: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: context done during sleep after run#6: context deadline exceeded: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-08-23 03:27:48.447234', 'RPC error': '2023-08-23 03:27:58.452779'}>

[2023-08-23T03:28:00.655Z] Traceback (most recent call last):

[2023-08-23T03:28:00.655Z]   File "scripts/action_after_upgrade.py", line 109, in <module>

[2023-08-23T03:28:00.655Z]     task_2(data_size, host)

[2023-08-23T03:28:00.655Z]   File "scripts/action_after_upgrade.py", line 37, in task_2

[2023-08-23T03:28:00.655Z]     load_and_search(prefix)

[2023-08-23T03:28:00.655Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 231, in load_and_search

[2023-08-23T03:28:00.655Z]     res = c.search(

[2023-08-23T03:28:00.655Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 778, in search

[2023-08-23T03:28:00.655Z]     res = conn.search(

[2023-08-23T03:28:00.655Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 127, in handler

[2023-08-23T03:28:00.655Z]     raise e from e

[2023-08-23T03:28:00.655Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 123, in handler

[2023-08-23T03:28:00.655Z]     return func(*args, **kwargs)

[2023-08-23T03:28:00.655Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 162, in handler

[2023-08-23T03:28:00.655Z]     return func(self, *args, **kwargs)

[2023-08-23T03:28:00.655Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 102, in handler

[2023-08-23T03:28:00.655Z]     raise e from e

[2023-08-23T03:28:00.655Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 68, in handler

[2023-08-23T03:28:00.655Z]     return func(*args, **kwargs)

[2023-08-23T03:28:00.655Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 766, in search

[2023-08-23T03:28:00.655Z]     return self._execute_search_requests(

[2023-08-23T03:28:00.655Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 727, in _execute_search_requests

[2023-08-23T03:28:00.655Z]     raise pre_err from pre_err

[2023-08-23T03:28:00.655Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 718, in _execute_search_requests

[2023-08-23T03:28:00.655Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-08-23T03:28:00.655Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #1: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #2: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #3: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #4: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #5: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: attempt #6: fail to get shard leaders from QueryCoord: segment=443746356773971997: segment lacks: channel=by-dev-rootcoord-dml_221_443746356773965551v1: channel not available: context done during sleep after run#6: context deadline exceeded: fail to search on all shard leaders)>

script returned exit code 1

1.25.5

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_cron/detail/deploy_test_kafka_cron/1180/pipeline log: artifacts-kafka-cluster-upgrade-1180-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-1180-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-1180-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing commented 1 year ago

/assign @jiaoew1991

jiaoew1991 commented 1 year ago

/assign @weiliu1031 /unassign

weiliu1031 commented 1 year ago

for now, helm rolling upgrade doesn't support the rolling upgrade one by one in order, also it doesn't support the graceful stop process. so when you do rolling upgrade, no expected segment balanced, and when the old node down, it will cause the segment lack in shard, until the segment has been loaded in new query node, which may consist for tens of seconds.

if you wish the service available in rolling upgrade, please do rolling upgrade by operator.

zhuwenxing commented 1 year ago

It happened after the upgrading finished and wait 5 min to start tests image

weiliu1031 commented 1 year ago

the upgrade task causes the segment lack. and then try to load this segment in query node.

but comes lots of load segment request, which cause the deadlock between load binlog and load segment, both of them share the same thread pool.

same as #25781

/cc @yah01 @MrPresent-Han we should cp the fix to 2.2 branch

image

MrPresent-Han commented 1 year ago

fine, I will cp for this

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.