milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.41k stars 2.82k forks source link

[Bug]: [benchmark][multi-replicas-loadbalance]The Milvus deployed by the operator fails to search during the rolling image upgrade process #25025

Closed wangting0128 closed 1 year ago

wangting0128 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: from master-20230619-a6310050 to master-20230620-af1d84e5
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.9.dev36
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy server argo task: fouramf-hdbkc client argo task: fouramf-94rlz rolling upgrade argo task: fouramf-rkpqg

test configs:

rolling upgrade image from master-20230619-a6310050 to master-20230620-af1d84e5

server:

[2023-06-20 07:33:16,912 -  INFO - fouram]: [get_pods] pod details of release(lb-op-upgrade):  (operator_client.py:298)
[2023-06-20 07:33:16,912 -  INFO - fouram]: 
NAME                                                 STATUS      RESTARTS     AGE        IP                NODE           
lb-op-upgrade-milvus-datacoord-c5c94b57b-chnb2       Running     0            19h40m     10.104.23.69      4am-node27     
lb-op-upgrade-milvus-datanode-547c48646f-tknwl       Running     0            19h40m     10.104.19.41      4am-node28     
lb-op-upgrade-milvus-indexcoord-5cb49bddcb-ftrxm     Running     0            19h40m     10.104.22.134     4am-node26     
lb-op-upgrade-milvus-indexnode-54c6c4c574-hklb5      Running     5            19h40m     10.104.18.236     4am-node25     
lb-op-upgrade-milvus-proxy-55b7b45947-c8d7c          Running     0            19h40m     10.104.19.43      4am-node28     
lb-op-upgrade-milvus-querycoord-65686786d-ztjmg      Running     0            19h40m     10.104.19.42      4am-node28     
lb-op-upgrade-milvus-querynode-57c68d4ccc-llktr      Running     0            19h40m     10.104.22.135     4am-node26     
lb-op-upgrade-milvus-querynode-57c68d4ccc-px8j7      Running     0            19h40m     10.104.4.184      4am-node11     
lb-op-upgrade-milvus-querynode-57c68d4ccc-qv9qq      Running     0            19h40m     10.104.19.44      4am-node28     
lb-op-upgrade-milvus-rootcoord-c64949f48-5c7wd       Running     0            19h40m     10.104.23.68      4am-node27     
lb-op-upgrade-etcd-0                                 Running     0            19h44m     10.104.24.253     4am-node29     
lb-op-upgrade-etcd-1                                 Running     0            19h44m     10.104.19.28      4am-node28     
lb-op-upgrade-etcd-2                                 Running     0            19h44m     10.104.23.63      4am-node27     
lb-op-upgrade-pulsar-bookie-0                        Running     0            19h44m     10.104.24.7       4am-node29     
lb-op-upgrade-pulsar-bookie-1                        Running     0            19h43m     10.104.19.33      4am-node28     
lb-op-upgrade-pulsar-bookie-2                        Running     0            19h43m     10.104.23.67      4am-node27     
lb-op-upgrade-pulsar-broker-0                        Running     0            19h44m     10.104.24.251     4am-node29     
lb-op-upgrade-pulsar-proxy-0                         Running     0            19h44m     10.104.23.61      4am-node27     
lb-op-upgrade-pulsar-recovery-0                      Running     0            19h44m     10.104.21.93      4am-node24     
lb-op-upgrade-pulsar-zookeeper-0                     Running     0            19h44m     10.104.19.32      4am-node28     
lb-op-upgrade-pulsar-zookeeper-1                     Running     0            19h43m     10.104.21.97      4am-node24     
lb-op-upgrade-pulsar-zookeeper-2                     Running     0            19h42m     10.104.22.133     4am-node26     
lb-op-upgrade-minio-0                                Running     0            19h44m     10.104.24.254     4am-node29     
lb-op-upgrade-minio-1                                Running     0            19h44m     10.104.21.95      4am-node24     
lb-op-upgrade-minio-2                                Running     0            19h44m     10.104.23.64      4am-node27     
lb-op-upgrade-minio-3                                Running     0            19h44m     10.104.22.131     4am-node26     
 (common_func.py:407)
[2023-06-20 07:33:16,912 -  INFO - fouram]: [Base] upgrade configs: {'spec': {'components': {'enableRollingUpdate': True, 'imageUpdateMode': 'rollingUpgrade', 'image': 'harbor.milvus.io/milvus/milvus:master-20230620-af1d84e5'}, 'mode': 'cluster'}, 'apiVersion': 'milvus.io/v1beta1', 'kind': 'Milvus'} (base.py:168)
[2023-06-20 07:33:17,146 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:33:47,255 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:34:17,369 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:34:47,516 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:35:17,626 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:35:47,737 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:36:17,846 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:36:47,955 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:37:18,063 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:37:48,184 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:38:18,280 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:38:48,388 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:39:18,498 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:39:48,665 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:40:18,778 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-upgrade health... (operator_client.py:188)
[2023-06-20 07:40:48,897 -  INFO - fouram]: [wait_for_healthy] Instance:lb-op-upgrade is healthy. (operator_client.py:185)
[2023-06-20 07:40:48,996 -  INFO - fouram]: [Base] Get pods after upgrade... (base.py:176)
[2023-06-20 07:40:49,564 -  INFO - fouram]: [get_pods] pod details of release(lb-op-upgrade):  (operator_client.py:298)
[2023-06-20 07:40:49,564 -  INFO - fouram]: 
NAME                                                 STATUS      RESTARTS     AGE        IP                NODE           
lb-op-upgrade-milvus-datacoord-65c8976475-d5d4z      Running     0            6m         10.104.18.175     4am-node25     
lb-op-upgrade-milvus-datanode-67c5989ffd-c988z       Running     0            3m         10.104.18.179     4am-node25     
lb-op-upgrade-milvus-indexcoord-5b6c4c6d6c-kpwhr     Running     0            5m         10.104.15.137     4am-node20     
lb-op-upgrade-milvus-indexnode-78759dc65c-g9d6c      Running     0            3m         10.104.19.228     4am-node28     
lb-op-upgrade-milvus-proxy-857f8fddd6-nkqhk          Running     0            1m         10.104.18.180     4am-node25     
lb-op-upgrade-milvus-querycoord-74976fdf59-pdqmb     Running     0            4m         10.104.15.138     4am-node20     
lb-op-upgrade-milvus-querynode-649f68ccf6-4f26c      Running     0            3m         10.104.17.48      4am-node23     
lb-op-upgrade-milvus-querynode-649f68ccf6-7lqrn      Running     0            2m         10.104.15.139     4am-node20     
lb-op-upgrade-milvus-querynode-649f68ccf6-qsgkc      Running     0            2m         10.104.22.67      4am-node26     
lb-op-upgrade-milvus-rootcoord-b9d878d87-66fm5       Running     0            7m         10.104.21.141     4am-node24     
lb-op-upgrade-etcd-0                                 Running     0            19h51m     10.104.24.253     4am-node29     
lb-op-upgrade-etcd-1                                 Running     0            19h51m     10.104.19.28      4am-node28     
lb-op-upgrade-etcd-2                                 Running     0            19h51m     10.104.23.63      4am-node27     
lb-op-upgrade-pulsar-bookie-0                        Running     0            19h51m     10.104.24.7       4am-node29     
lb-op-upgrade-pulsar-bookie-1                        Running     0            19h51m     10.104.19.33      4am-node28     
lb-op-upgrade-pulsar-bookie-2                        Running     0            19h51m     10.104.23.67      4am-node27     
lb-op-upgrade-pulsar-broker-0                        Running     0            19h51m     10.104.24.251     4am-node29     
lb-op-upgrade-pulsar-proxy-0                         Running     0            19h51m     10.104.23.61      4am-node27     
lb-op-upgrade-pulsar-recovery-0                      Running     0            19h51m     10.104.21.93      4am-node24     
lb-op-upgrade-pulsar-zookeeper-0                     Running     0            19h51m     10.104.19.32      4am-node28     
lb-op-upgrade-pulsar-zookeeper-1                     Running     0            19h50m     10.104.21.97      4am-node24     
lb-op-upgrade-pulsar-zookeeper-2                     Running     0            19h50m     10.104.22.133     4am-node26     
lb-op-upgrade-minio-0                                Running     0            19h51m     10.104.24.254     4am-node29     
lb-op-upgrade-minio-1                                Running     0            19h51m     10.104.21.95      4am-node24     
lb-op-upgrade-minio-2                                Running     0            19h51m     10.104.23.64      4am-node27     
lb-op-upgrade-minio-3                                Running     0            19h51m     10.104.22.131     4am-node26     

client test result:

{'server': {'deploy_tool': '',
            'deploy_mode': '',
            'config_name': '',
            'config': {},
            'host': 'lb-op-upgrade-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '5m',
                                                    'ni_per': 50000},
                                 'load_params': {'replica_number': 3},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 8,
                                                                  'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 100,
                                                       'during_time': '5h',
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 1000,
                                                                  'top_k': 10,
                                                                  'search_param': {'ef': 64},
                                                                  'timeout': 3600}}]},
            'run_id': 2023062097303256,
            'datetime': '2023-06-20 02:55:30.334588',
            'client_version': '2.2'},
 'result': {'test_result': {'load': {'RT': 0.006},
                            'Locust': {'Aggregated': {'Requests': 112562,
                                                      'Fails': 209,
                                                      'RPS': 6.25,
                                                      'fail_s': 0.0,
                                                      'RT_max': 27328.45,
                                                      'RT_avg': 15502.19,
                                                      'TP50': 15000.0,
                                                      'TP99': 24000.0},
                                       'search': {'Requests': 112562,
                                                  'Fails': 209,
                                                  'RPS': 6.25,
                                                  'fail_s': 0.0,
                                                  'RT_max': 27328.45,
                                                  'RT_avg': 15502.19,
                                                  'TP50': 15000.0,
                                                  'TP99': 24000.0}}}}}
截屏2023-06-20 16 04 15
[2023-06-20 07:34:11,179 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2023-06-20 07:34:11,179 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2023-06-20 07:34:11,179 -  INFO - fouram]: grpc     search                                                                        103045     0(0.00%) |  15736    1362   27328  15000 |    7.30        0.00 (stats.py:789)
[2023-06-20 07:34:11,179 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2023-06-20 07:34:11,179 -  INFO - fouram]:          Aggregated                                                                    103045     0(0.00%) |  15736    1362   27328  15000 |    7.30        0.00 (stats.py:789)
[2023-06-20 07:34:11,179 -  INFO - fouram]:  (stats.py:790)
[2023-06-20 07:34:11,179 -  INFO - fouram]: Response time percentiles (approximated) (stats.py:819)
[2023-06-20 07:34:11,179 -  INFO - fouram]: Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs (stats.py:819)
[2023-06-20 07:34:11,179 -  INFO - fouram]: --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------ (stats.py:819)
[2023-06-20 07:34:11,179 -  INFO - fouram]: grpc     search                                                                              15000  15000  15000  16000  21000  22000  23000  24000  25000  26000  27000 103045 (stats.py:819)
[2023-06-20 07:34:11,180 -  INFO - fouram]: --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------ (stats.py:819)
[2023-06-20 07:34:11,180 -  INFO - fouram]:          Aggregated                                                                          15000  15000  15000  16000  21000  22000  23000  24000  25000  26000  27000 103045 (stats.py:819)
[2023-06-20 07:34:11,180 -  INFO - fouram]:  (stats.py:820)
[2023-06-20 07:34:12,362 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)>, <Time:{'RPC start': '2023-06-20 07:33:59.565324', 'RPC error': '2023-06-20 07:34:12.362094'}> (decorators.py:108)
[2023-06-20 07:34:12,364 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (api_request.py:53)
[2023-06-20 07:34:12,364 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (func_check.py:46)
[2023-06-20 07:34:12,364 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)>, <Time:{'RPC start': '2023-06-20 07:33:59.697998', 'RPC error': '2023-06-20 07:34:12.364472'}> (decorators.py:108)
[2023-06-20 07:34:12,364 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (api_request.py:53)
[2023-06-20 07:34:12,364 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (func_check.py:46)
[2023-06-20 07:34:12,378 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)>, <Time:{'RPC start': '2023-06-20 07:33:59.913040', 'RPC error': '2023-06-20 07:34:12.378462'}> (decorators.py:108)
[2023-06-20 07:34:12,378 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (api_request.py:53)
[2023-06-20 07:34:12,379 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (func_check.py:46)
[2023-06-20 07:34:13,475 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-20 07:34:01.233517', 'RPC error': '2023-06-20 07:34:13.475592'}> (decorators.py:108)
[2023-06-20 07:34:13,477 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-20 07:34:13,477 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-20 07:34:13,627 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-20 07:34:01.339476', 'RPC error': '2023-06-20 07:34:13.627742'}> (decorators.py:108)
[2023-06-20 07:34:13,628 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-20 07:34:13,628 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-20 07:34:13,628 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-20 07:34:01.599045', 'RPC error': '2023-06-20 07:34:13.628564'}> (decorators.py:108)
[2023-06-20 07:34:13,628 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-20 07:34:13,628 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-20 07:34:43,849 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #1: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #2: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #3: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #4: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #5: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #6: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: context done during sleep after run#6: context deadline exceeded: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-20 07:34:43,849 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #1: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #2: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #3: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #4: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #5: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #6: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: context done during sleep after run#6: context deadline exceeded: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-20 07:34:43,849 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #1: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #2: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #3: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #4: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #5: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #6: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: context done during sleep after run#6: context deadline exceeded: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-20 07:34:25.313769', 'RPC error': '2023-06-20 07:34:43.849291'}> (decorators.py:108)
[2023-06-20 07:34:43,849 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #1: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #2: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #3: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #4: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #5: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #6: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: context done during sleep after run#6: context deadline exceeded: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-20 07:34:43,849 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #1: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #2: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #3: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #4: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #5: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: attempt #6: fail to get shard leaders from QueryCoord: collection 442282960369483900 is not fully loaded: context done during sleep after run#6: context deadline exceeded: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-20 07:34:51,246 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2023-06-20 07:34:51,246 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2023-06-20 07:34:51,246 -  INFO - fouram]: grpc     search                                                                        103323   209(0.20%) |  15732    1362   27328  15000 |    9.00        5.80 (stats.py:789)
[2023-06-20 07:34:51,246 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2023-06-20 07:34:51,246 -  INFO - fouram]:          Aggregated                                                                    103323   209(0.20%) |  15732    1362   27328  15000 |    9.00        5.80 (stats.py:789)
[2023-06-20 07:34:51,246 -  INFO - fouram]:  (stats.py:790)
[2023-06-20 07:34:51,247 -  INFO - fouram]: Response time percentiles (approximated) (stats.py:819)
[2023-06-20 07:34:51,247 -  INFO - fouram]: Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs (stats.py:819)
[2023-06-20 07:34:51,247 -  INFO - fouram]: --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------ (stats.py:819)
[2023-06-20 07:34:51,247 -  INFO - fouram]: grpc     search                                                                              15000  15000  15000  16000  21000  22000  23000  24000  25000  26000  27000 103323 (stats.py:819)
[2023-06-20 07:34:51,247 -  INFO - fouram]: --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------ (stats.py:819)
[2023-06-20 07:34:51,247 -  INFO - fouram]:          Aggregated                                                                          15000  15000  15000  16000  21000  22000  23000  24000  25000  26000  27000 103323 (stats.py:819)

Expected Behavior

Search does not fail during rolling upgrade

Steps To Reproduce

1. deploy milvus by operator
2. insert 5m vectors and concurrent search with timeout=3600
3. rolling upgrade image during concurrent search <- search failed when upgarde

Milvus Log

No response

Anything else?

fouramf-server-op-replicas-lb-3qn:

    spec:
      components: 
        queryNode:
          replicas: 3
          resources:
            limits:
              cpu: '8'
              memory: 8Gi
            requests:
              cpu: '4'
              memory: 4Gi
        indexNode:
          resources:
            limits:
              cpu: '4.0'
              memory: 4Gi
            requests:
              cpu: '3.0'
              memory: 3Gi
          replicas: 1
        dataNode:
          resources:
            limits:
              cpu: '2.0'
              memory: 2Gi
            requests:
              cpu: '2.0'
              memory: 2Gi

fouramf-client-sift-replica3-search:

    load_params:
      replica_number: 3
    dataset_params:
      dim: 128
      dataset_name: sift
      dataset_size: 5m
      ni_per: 50000
      metric_type: L2
    index_params:
      index_type: HNSW
      index_param:
        M: 8
        efConstruction: 200
    concurrent_params:
      concurrent_number: 100
      during_time: 5h
      interval: 20
    concurrent_tasks:
      - type: search
        weight: 1
        params:
          nq: 1000
          top_k: 10
          search_param:
            ef: 64
          timeout: 3600
weiliu1031 commented 1 year ago

same as #24904

yanliang567 commented 1 year ago

/unassign

wangting0128 commented 1 year ago

client argo task: fouramf-lb-op-rolling-upgrade rollingupgrade argo task: fouramf-8xx7s

image: master-20230628-31122a68 -> master-20230629-b30517d3

server:

[2023-06-30 03:59:31,849 -  INFO - fouram]: [get_pods] pod details of release(lb-op-rolling-upgrade):  (operator_client.py:301)
[2023-06-30 03:59:31,849 -  INFO - fouram]: 
NAME                                                         STATUS      RESTARTS     AGE     IP                NODE           
lb-op-rolling-upgrade-milvus-datacoord-79765db9cb-wbndw      Running     0            45m     10.104.24.124     4am-node29     
lb-op-rolling-upgrade-milvus-datanode-7778d54d85-p7mfd       Running     0            45m     10.104.20.84      4am-node22     
lb-op-rolling-upgrade-milvus-indexcoord-6b65cb6cd8-wnfxp     Running     0            45m     10.104.20.83      4am-node22     
lb-op-rolling-upgrade-milvus-indexnode-57f7ffd6b5-7n6gl      Running     0            45m     10.104.15.33      4am-node20     
lb-op-rolling-upgrade-milvus-proxy-66cc5ff56c-7fq6f          Running     0            45m     10.104.15.31      4am-node20     
lb-op-rolling-upgrade-milvus-querycoord-7d87d8f5d8-2tddv     Running     0            45m     10.104.24.123     4am-node29     
lb-op-rolling-upgrade-milvus-querynode-9ff6c45c7-2k4cf       Running     0            45m     10.104.24.125     4am-node29     
lb-op-rolling-upgrade-milvus-querynode-9ff6c45c7-6jbt2       Running     0            45m     10.104.18.206     4am-node25     
lb-op-rolling-upgrade-milvus-querynode-9ff6c45c7-lpw5c       Running     0            45m     10.104.15.34      4am-node20     
lb-op-rolling-upgrade-milvus-rootcoord-5478f587c6-bb7pp      Running     0            45m     10.104.20.90      4am-node22     
lb-op-rolling-upgrade-etcd-0                                 Running     0            49m     10.104.15.27      4am-node20     
lb-op-rolling-upgrade-etcd-1                                 Running     0            49m     10.104.20.74      4am-node22     
lb-op-rolling-upgrade-etcd-2                                 Running     0            49m     10.104.6.49       4am-node13     
lb-op-rolling-upgrade-pulsar-bookie-0                        Running     0            49m     10.104.6.51       4am-node13     
lb-op-rolling-upgrade-pulsar-bookie-1                        Running     0            49m     10.104.20.79      4am-node22     
lb-op-rolling-upgrade-pulsar-bookie-2                        Running     0            49m     10.104.24.118     4am-node29     
lb-op-rolling-upgrade-pulsar-broker-0                        Running     0            49m     10.104.15.15      4am-node20     
lb-op-rolling-upgrade-pulsar-proxy-0                         Running     0            49m     10.104.15.17      4am-node20     
lb-op-rolling-upgrade-pulsar-recovery-0                      Running     0            49m     10.104.20.70      4am-node22     
lb-op-rolling-upgrade-pulsar-zookeeper-0                     Running     0            49m     10.104.20.77      4am-node22     
lb-op-rolling-upgrade-pulsar-zookeeper-1                     Running     0            48m     10.104.15.30      4am-node20     
lb-op-rolling-upgrade-pulsar-zookeeper-2                     Running     0            48m     10.104.18.200     4am-node25     
lb-op-rolling-upgrade-minio-0                                Running     0            49m     10.104.15.24      4am-node20     
lb-op-rolling-upgrade-minio-1                                Running     0            49m     10.104.20.71      4am-node22     
lb-op-rolling-upgrade-minio-2                                Running     0            49m     10.104.24.115     4am-node29     
lb-op-rolling-upgrade-minio-3                                Running     0            49m     10.104.6.45       4am-node13     
 (common_func.py:407)
[2023-06-30 03:59:31,849 -  INFO - fouram]: [Base] upgrade configs: {'spec': {'components': {'enableRollingUpdate': True, 'imageUpdateMode': 'rollingUpgrade', 'image': 'harbor.milvus.io/milvus/milvus:master-20230629-b30517d3'}, 'mode': 'cluster'}, 'apiVersion': 'milvus.io/v1beta1', 'kind': 'Milvus'} (base.py:194)
[2023-06-30 03:59:32,082 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:00:02,237 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:00:32,411 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:01:02,671 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:01:32,816 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:02:02,990 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:02:33,149 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:03:03,413 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:03:33,553 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:04:03,718 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:04:33,896 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:05:04,067 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:05:34,320 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:06:04,479 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:06:34,658 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:07:04,815 -  INFO - fouram]: [wait_for_healthy] Waiting for instance:lb-op-rolling-upgrade health... (operator_client.py:188)
[2023-06-30 04:07:35,068 -  INFO - fouram]: [wait_for_healthy] Instance:lb-op-rolling-upgrade is healthy. (operator_client.py:185)
[2023-06-30 04:07:35,226 -  INFO - fouram]: [Base] Get pods after upgrade... (base.py:202)
[2023-06-30 04:07:35,867 -  INFO - fouram]: [get_pods] pod details of release(lb-op-rolling-upgrade):  (operator_client.py:301)
[2023-06-30 04:07:35,867 -  INFO - fouram]: 
NAME                                                         STATUS      RESTARTS     AGE     IP                NODE           
lb-op-rolling-upgrade-milvus-datacoord-86d9ddb668-hqq77      Running     0            7m      10.104.18.207     4am-node25     
lb-op-rolling-upgrade-milvus-datanode-5d6db89745-2zffp       Running     0            3m      10.104.15.38      4am-node20     
lb-op-rolling-upgrade-milvus-indexcoord-695b9699b4-dpvt6     Running     0            5m      10.104.15.36      4am-node20     
lb-op-rolling-upgrade-milvus-indexnode-56658bb4c7-fhghj      Running     0            3m      10.104.24.151     4am-node29     
lb-op-rolling-upgrade-milvus-proxy-6554c99887-xl7ks          Running     0            1m      10.104.18.209     4am-node25     
lb-op-rolling-upgrade-milvus-querycoord-7bddb5fcb-2lj5c      Running     0            4m      10.104.15.37      4am-node20     
lb-op-rolling-upgrade-milvus-querynode-846989fb8f-5cvzs      Running     0            3m      10.104.15.41      4am-node20     
lb-op-rolling-upgrade-milvus-querynode-846989fb8f-g5pwf      Running     0            3m      10.104.24.152     4am-node29     
lb-op-rolling-upgrade-milvus-querynode-846989fb8f-n84rs      Running     0            2m      10.104.20.121     4am-node22     
lb-op-rolling-upgrade-milvus-rootcoord-5c6b64695b-v5k2j      Running     0            8m      10.104.20.117     4am-node22     
lb-op-rolling-upgrade-etcd-0                                 Running     0            57m     10.104.15.27      4am-node20     
lb-op-rolling-upgrade-etcd-1                                 Running     0            57m     10.104.20.74      4am-node22     
lb-op-rolling-upgrade-etcd-2                                 Running     0            57m     10.104.6.49       4am-node13     
lb-op-rolling-upgrade-pulsar-bookie-0                        Running     0            57m     10.104.6.51       4am-node13     
lb-op-rolling-upgrade-pulsar-bookie-1                        Running     0            57m     10.104.20.79      4am-node22     
lb-op-rolling-upgrade-pulsar-bookie-2                        Running     0            57m     10.104.24.118     4am-node29     
lb-op-rolling-upgrade-pulsar-broker-0                        Running     0            57m     10.104.15.15      4am-node20     
lb-op-rolling-upgrade-pulsar-proxy-0                         Running     0            57m     10.104.15.17      4am-node20     
lb-op-rolling-upgrade-pulsar-recovery-0                      Running     0            57m     10.104.20.70      4am-node22     
lb-op-rolling-upgrade-pulsar-zookeeper-0                     Running     0            57m     10.104.20.77      4am-node22     
lb-op-rolling-upgrade-pulsar-zookeeper-1                     Running     0            56m     10.104.15.30      4am-node20     
lb-op-rolling-upgrade-pulsar-zookeeper-2                     Running     0            56m     10.104.18.200     4am-node25     
lb-op-rolling-upgrade-minio-0                                Running     0            57m     10.104.15.24      4am-node20     
lb-op-rolling-upgrade-minio-1                                Running     0            57m     10.104.20.71      4am-node22     
lb-op-rolling-upgrade-minio-2                                Running     0            57m     10.104.24.115     4am-node29     
lb-op-rolling-upgrade-minio-3                                Running     0            57m     10.104.6.45       4am-node13     
 (common_func.py:407)

client test result:

{'server': {'deploy_tool': 'operator',
            'deploy_mode': 'cluster',
            'config_name': 'cluster_8c16m',
            'config': {'spec': {'components': {'queryNode': {'resources': {'limits': {'cpu': '8',
                                                                                      'memory': '8Gi'},
                                                                           'requests': {'cpu': '4',
                                                                                        'memory': '4Gi'}},
                                                             'replicas': 3},
                                               'indexNode': {'resources': {'limits': {'cpu': '4.0',
                                                                                      'memory': '8Gi'},
                                                                           'requests': {'cpu': '3.0',
                                                                                        'memory': '5Gi'}},
                                                             'replicas': 1},
                                               'dataNode': {'resources': {'limits': {'cpu': '2.0',
                                                                                     'memory': '2Gi'},
                                                                          'requests': {'cpu': '2.0',
                                                                                       'memory': '2Gi'}}},
                                               'image': 'harbor.milvus.io/milvus/milvus:master-20230628-31122a68'},
                                'mode': 'cluster',
                                'dependencies': {'etcd': {'inCluster': {'deletionPolicy': 'Delete',
                                                                        'pvcDeletion': True,
                                                                        'values': {'global': {'storageClass': 'local-path'},
                                                                                   'metrics': {'enabled': True,
                                                                                               'podMonitor': {'enabled': True}}}}},
                                                 'pulsar': {'inCluster': {'deletionPolicy': 'Delete',
                                                                          'pvcDeletion': True,
                                                                          'values': {'bookkeeper': {'volumes': {'journal': {'storageClassName': 'local-path'},
                                                                                                                'ledgers': {'storageClassName': 'local-path'}}},
                                                                                     'zookeeper': {'volumes': {'data': {'storageClassName': 'local-path'}}}}}},
                                                 'kafka': {'inCluster': {'deletionPolicy': 'Delete',
                                                                         'pvcDeletion': True,
                                                                         'values': {'persistence': {'storageClass': 'local-path'}}}},
                                                 'storage': {'inCluster': {'deletionPolicy': 'Delete',
                                                                           'pvcDeletion': True,
                                                                           'values': {'persistence': {'storageClass': 'local-path'},
                                                                                      'metrics': {'podMonitor': {'enabled': True}}}}}},
                                'config': {'log': {'level': 'debug'}}},
                       'apiVersion': 'milvus.io/v1beta1',
                       'kind': 'Milvus',
                       'metadata': {'name': 'fouram-op-16-5610'}},
            'host': 'lb-op-rolling-upgrade-milvus.qa-milvus',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '5m',
                                                    'ni_per': 50000},
                                 'load_params': {'replica_number': 3},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 8,
                                                                  'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 100,
                                                       'during_time': '5h',
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 1000,
                                                                  'top_k': 10,
                                                                  'search_param': {'ef': 64},
                                                                  'timeout': 3600,
                                                                  'random_data': True}}]},
            'run_id': 2023063045896080,
            'datetime': '2023-06-30 03:09:49.854644',
            'client_version': '2.2'},
 'result': {'test_result': {'index': {'RT': 909.0306},
                            'insert': {'total_time': 138.2387,
                                       'VPS': 36169.3216,
                                       'batch_time': 1.3824,
                                       'batch': 50000},
                            'flush': {'RT': 2.5226},
                            'load': {'RT': 6.5599},
                            'Locust': {'Aggregated': {'Requests': 141500,
                                                      'Fails': 369,
                                                      'RPS': 7.86,
                                                      'fail_s': 0.0,
                                                      'RT_max': 19459.52,
                                                      'RT_avg': 12128.43,
                                                      'TP50': 12000.0,
                                                      'TP99': 15000.0},
                                       'search': {'Requests': 141500,
                                                  'Fails': 369,
                                                  'RPS': 7.86,
                                                  'fail_s': 0.0,
                                                  'RT_max': 19459.52,
                                                  'RT_avg': 12128.43,
                                                  'TP50': 12000.0,
                                                  'TP99': 15000.0}}}}} (performance_template.py:141)
截屏2023-06-30 16 51 15
[2023-06-30 04:00:15,406 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2023-06-30 04:00:15,407 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2023-06-30 04:00:15,407 -  INFO - fouram]: grpc     search                                                                         12742     0(0.00%) |  11900    1221   14330  12000 |    8.40        0.00 (stats.py:789)
[2023-06-30 04:00:15,407 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2023-06-30 04:00:15,407 -  INFO - fouram]:          Aggregated                                                                     12742     0(0.00%) |  11900    1221   14330  12000 |    8.40        0.00 (stats.py:789)
[2023-06-30 04:00:15,407 -  INFO - fouram]:  (stats.py:790)
[2023-06-30 04:00:15,407 -  INFO - fouram]: Response time percentiles (approximated) (stats.py:819)
[2023-06-30 04:00:15,407 -  INFO - fouram]: Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs (stats.py:819)
[2023-06-30 04:00:15,407 -  INFO - fouram]: --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------ (stats.py:819)
[2023-06-30 04:00:15,407 -  INFO - fouram]: grpc     search                                                                              12000  12000  12000  12000  12000  13000  13000  13000  14000  14000  14000  12742 (stats.py:819)
[2023-06-30 04:00:15,407 -  INFO - fouram]: --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------ (stats.py:819)
[2023-06-30 04:00:15,407 -  INFO - fouram]:          Aggregated                                                                          12000  12000  12000  12000  12000  13000  13000  13000  14000  14000  14000  12742 (stats.py:819)
[2023-06-30 04:00:15,407 -  INFO - fouram]:  (stats.py:820)
[2023-06-30 04:00:23,201 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)>, <Time:{'RPC start': '2023-06-30 04:00:10.232508', 'RPC error': '2023-06-30 04:00:23.200973'}> (decorators.py:108)
[2023-06-30 04:00:23,202 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (api_request.py:53)
[2023-06-30 04:00:23,202 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (func_check.py:46)
[2023-06-30 04:00:23,202 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)>, <Time:{'RPC start': '2023-06-30 04:00:10.339087', 'RPC error': '2023-06-30 04:00:23.202934'}> (decorators.py:108)
[2023-06-30 04:00:23,203 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (api_request.py:53)
[2023-06-30 04:00:23,203 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=syncTimestamp Failed:context deadline exceeded)> (func_check.py:46)
[2023-06-30 04:00:33,325 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-30 04:00:18.280082', 'RPC error': '2023-06-30 04:00:33.325800'}> (decorators.py:108)
[2023-06-30 04:00:33,326 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-30 04:00:33,326 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-30 04:00:33,326 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-30 04:00:18.386501', 'RPC error': '2023-06-30 04:00:33.326694'}> (decorators.py:108)
[2023-06-30 04:00:33,326 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-30 04:00:33,327 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-30 04:00:33,327 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-30 04:00:18.492884', 'RPC error': '2023-06-30 04:00:33.327182'}> (decorators.py:108)
[2023-06-30 04:00:33,327 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-30 04:00:33,327 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-30 04:00:33,327 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-30 04:00:18.615418', 'RPC error': '2023-06-30 04:00:33.327668'}> (decorators.py:108)
[2023-06-30 04:00:33,327 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-30 04:00:33,327 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-30 04:00:33,328 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-30 04:00:18.720946', 'RPC error': '2023-06-30 04:00:33.328131'}> (decorators.py:108)
[2023-06-30 04:00:33,328 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-30 04:01:23,818 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: err: find no available querycoord, check querycoord state
, /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:352 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:121 github.com/milvus-io/milvus/internal/distributed/querycoord/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:317 github.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).GetShardLeaders
/go/src/github.com/milvus-io/milvus/internal/proxy/meta_cache.go:734 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetShards.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:40 github.com/milvus-io/milvus/pkg/util/retry.Do
/go/src/github.com/milvus-io/milvus/internal/proxy/meta_cache.go:733 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetShards
/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:187 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute
/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:406 github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute
/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:457 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask
: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-06-30 04:01:05.902341', 'RPC error': '2023-06-30 04:01:23.818730'}> (decorators.py:108)
[2023-06-30 04:01:23,818 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: err: find no available querycoord, check querycoord state
, /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:352 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:121 github.com/milvus-io/milvus/internal/distributed/querycoord/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:317 github.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).GetShardLeaders
/go/src/github.com/milvus-io/milvus/internal/proxy/meta_cache.go:734 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetShards.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:40 github.com/milvus-io/milvus/pkg/util/retry.Do
/go/src/github.com/milvus-io/milvus/internal/proxy/meta_cache.go:733 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetShards
/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:187 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute
/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:406 github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute
/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:457 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask
: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-06-30 04:01:23,819 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: err: find no available querycoord, check querycoord state
, /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:352 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:121 github.com/milvus-io/milvus/internal/distributed/querycoord/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:317 github.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).GetShardLeaders
/go/src/github.com/milvus-io/milvus/internal/proxy/meta_cache.go:734 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetShards.func1
/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:40 github.com/milvus-io/milvus/pkg/util/retry.Do
/go/src/github.com/milvus-io/milvus/internal/proxy/meta_cache.go:733 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetShards
/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:187 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute
/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:406 github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute
/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:457 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask
: unrecoverable error: fail to search on all shard leaders)> (func_check.py:46)
[2023-06-30 04:01:37,349 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2023-06-30 04:01:37,349 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2023-06-30 04:01:37,349 -  INFO - fouram]: grpc     search                                                                         13348   369(2.76%) |  11948    1221   19459  12000 |    8.00        0.00 (stats.py:789)
[2023-06-30 04:01:37,349 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2023-06-30 04:01:37,349 -  INFO - fouram]:          Aggregated                                                                     13348   369(2.76%) |  11948    1221   19459  12000 |    8.00        0.00 (stats.py:789)
[2023-06-30 04:01:37,349 -  INFO - fouram]:  (stats.py:790)
[2023-06-30 04:01:37,349 -  INFO - fouram]: Response time percentiles (approximated) (stats.py:819)
[2023-06-30 04:01:37,349 -  INFO - fouram]: Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs (stats.py:819)
[2023-06-30 04:01:37,349 -  INFO - fouram]: --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------ (stats.py:819)
[2023-06-30 04:01:37,349 -  INFO - fouram]: grpc     search                                                                              12000  12000  12000  12000  13000  13000  13000  15000  18000  19000  19000  13348 (stats.py:819)
[2023-06-30 04:01:37,350 -  INFO - fouram]: --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------ (stats.py:819)
[2023-06-30 04:01:37,350 -  INFO - fouram]:          Aggregated                                                                          12000  12000  12000  12000  13000  13000  13000  15000  18000  19000  19000  13348 (stats.py:819)
[2023-06-30 04:01:37,350 -  INFO - fouram]:  (stats.py:820)
bigsheeper commented 1 year ago

@weiliu1031

Not fully loaded issue should now be resolved.

weiliu1031 commented 1 year ago

here comes two issues:

  1. due to graceful stop policy in master branch has been broken, so during the rolling upgrade, it will causes shard leader unavailable. should be fixed by #25226

  2. during rolling upgrade, if there isn't standby coord, it will cause unavailble for qc/dc/rc for a short period, which may effect the search/query, which is as expected.

weiliu1031 commented 1 year ago

/assign @wangting0128

weiliu1031 commented 1 year ago

please verify this

elstic commented 1 year ago

deployment mode: operator

argo task: fouramf-5wf5t-glzhh rollingupgrade argo task: fouramf-vtbtm

image: master-20230630-bc403dbd -> master-20230706-2ae6def3

server:

fouramf-5wf5t-glzhh-op-67-5375-milvus-datacoord-569f474fb-94s8f     Running     0            3h2m      10.104.4.172      4am-node11     
fouramf-5wf5t-glzhh-op-67-5375-milvus-datanode-84cdd75b54-svlcl     Running     0            2h59m     10.104.22.83      4am-node26     
fouramf-5wf5t-glzhh-op-67-5375-milvus-indexcoord-567f6d6556qvtk     Running     0            3h1m      10.104.4.173      4am-node11     
fouramf-5wf5t-glzhh-op-67-5375-milvus-indexnode-db876fd7-mnfcb      Running     0            2h59m     10.104.4.176      4am-node11     
fouramf-5wf5t-glzhh-op-67-5375-milvus-proxy-9c79d6c87-4bxpp         Running     0            2h58m     10.104.21.118     4am-node24     
fouramf-5wf5t-glzhh-op-67-5375-milvus-querycoord-d5994d986vqtmf     Running     0            3h        10.104.4.174      4am-node11     
fouramf-5wf5t-glzhh-op-67-5375-milvus-querynode-5c59ffcfb97vmhs     Running     0            2h59m     10.104.6.168      4am-node13     
fouramf-5wf5t-glzhh-op-67-5375-milvus-querynode-5c59ffcfb9zz74l     Running     0            2h59m     10.104.4.175      4am-node11     
fouramf-5wf5t-glzhh-op-67-5375-milvus-rootcoord-6ffcc5f44-vs6cf     Running     0            3h3m      10.104.22.82      4am-node26     
fouramf-5wf5t-glzhh-op-67-5375-etcd-0                               Running     0            6h50m     10.104.17.183     4am-node23     
fouramf-5wf5t-glzhh-op-67-5375-etcd-1                               Running     0            6h50m     10.104.21.29      4am-node24     
fouramf-5wf5t-glzhh-op-67-5375-etcd-2                               Running     0            6h50m     10.104.6.128      4am-node13     
fouramf-5wf5t-glzhh-op-67-5375-kafka-0                              Running     2            6h50m     10.104.17.188     4am-node23     
fouramf-5wf5t-glzhh-op-67-5375-kafka-1                              Running     2            6h50m     10.104.21.34      4am-node24     
fouramf-5wf5t-glzhh-op-67-5375-kafka-2                              Running     2            6h50m     10.104.6.136      4am-node13     
fouramf-5wf5t-glzhh-op-67-5375-kafka-zookeeper-0                    Running     0            6h50m     10.104.17.187     4am-node23     
fouramf-5wf5t-glzhh-op-67-5375-kafka-zookeeper-1                    Running     0            6h50m     10.104.21.35      4am-node24     
fouramf-5wf5t-glzhh-op-67-5375-kafka-zookeeper-2                    Running     0            6h50m     10.104.6.137      4am-node13     
fouramf-5wf5t-glzhh-op-67-5375-minio-0                              Running     0            6h50m     10.104.17.184     4am-node23     
fouramf-5wf5t-glzhh-op-67-5375-minio-1                              Running     0            6h50m     10.104.21.31      4am-node24     
fouramf-5wf5t-glzhh-op-67-5375-minio-2                              Running     0            6h50m     10.104.6.129      4am-node13     
fouramf-5wf5t-glzhh-op-67-5375-minio-3                              Running     0            6h50m     10.104.22.19      4am-node26   

client test result:

{'server': {'deploy_tool': 'operator',
            'deploy_mode': 'cluster',
            'config_name': 'cluster_2c2m',
            'config': {'spec': {'components': {'queryNode': {'resources': {'limits': {'cpu': '4.0',
                                                                                      'memory': '64Gi'},
                                                                           'requests': {'cpu': '3.0',
                                                                                        'memory': '33Gi'}},
                                                             'replicas': 2},
                                               'indexNode': {'resources': {'limits': {'cpu': '8.0',
                                                                                      'memory': '16Gi'},
                                                                           'requests': {'cpu': '5.0',
                                                                                        'memory': '9Gi'}},
                                                             'replicas': 1},
                                               'dataNode': {'resources': {'limits': {'cpu': '2.0',
                                                                                     'memory': '4Gi'},
                                                                          'requests': {'cpu': '2.0',
                                                                                       'memory': '3Gi'}},
                                                            'replicas': 1},
                                               'image': 'harbor.milvus.io/milvus/milvus:master-20230630-bc403dbd'},
                                'mode': 'cluster',
                                'dependencies': {'etcd': {'inCluster': {'deletionPolicy': 'Delete',
                                                                        'pvcDeletion': True,
                                                                        'values': {'global': {'storageClass': 'local-path'},
                                                                                   'metrics': {'enabled': True,
                                                                                               'podMonitor': {'enabled': True}}}}},
                                                 'pulsar': {'inCluster': {'deletionPolicy': 'Delete',
                                                                          'pvcDeletion': True,
                                                                          'values': {'bookkeeper': {'volumes': {'journal': {'storageClassName': 'local-path'},
                                                                                                                'ledgers': {'storageClassName': 'local-path'}}},
                                                                                     'zookeeper': {'volumes': {'data': {'storageClassName': 'local-path'}}}}}},
                                                 'kafka': {'inCluster': {'deletionPolicy': 'Delete',
                                                                         'pvcDeletion': True,
                                                                         'values': {'persistence': {'storageClass': 'local-path'}}}},
                                                 'storage': {'inCluster': {'deletionPolicy': 'Delete',
                                                                           'pvcDeletion': True,
                                                                           'values': {'persistence': {'storageClass': 'local-path'},
                                                                                      'metrics': {'podMonitor': {'enabled': True}}}}},
                                                 'msgStreamType': 'kafka'},
                                'config': {'log': {'level': 'debug'}}},
                       'apiVersion': 'milvus.io/v1beta1',
                       'kind': 'Milvus',
                       'metadata': {'name': 'fouram-op-79-6657'}},
            'host': 'fouramf-5wf5t-glzhh-op-67-5375-milvus.qa-milvus',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_100m_hnsw_ddl_dql_filter_kafka_cluster',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 100000000,
                                                    'ni_per': 50000},
                                 'collection_params': {'other_fields': ['float_1'],
                                                       'shards_num': 2},
                                 'load_params': {},
                                 'query_params': {},
                                 'search_params': {},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False,
                                                          'reset_db': False},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 8,
                                                                  'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 20,
                                                       'during_time': '4h',
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 20,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'search_param': {'ef': 16},
                                                                  'expr': {'float_1': {'GT': -1.0,
                                                                                       'LT': 50000000.0}},
                                                                  'guarantee_timestamp': None,
                                                                  'output_fields': None,
                                                                  'ignore_growing': False,
                                                                  'timeout': 60,
                                                                  'random_data': True}},
                                                      {'type': 'query',
                                                       'weight': 10,
                                                       'params': {'ids': [0,
                                                                          1,
                                                                          2,
                                                                          3,
                                                                          4,
                                                                          5,
                                                                          6,
                                                                          7,
                                                                          8,
                                                                          9],
                                                                  'expr': None,
                                                                  'output_fields': None,
                                                                  'ignore_growing': False,
                                                                  'timeout': 60}},
                                                      {'type': 'load',
                                                       'weight': 1,
                                                       'params': {'replica_number': 1,
                                                                  'timeout': 30}},
                                                      {'type': 'scene_test',
                                                       'weight': 2,
                                                       'params': {'dim': 128,
                                                                  'data_size': 3000,
                                                                  'nb': 3000,
                                                                  'index_type': 'IVF_SQ8',
                                                                  'index_param': {'nlist': 2048},
                                                                  'metric_type': 'L2'}}]},
            'run_id': 2023070697145337,
            'datetime': '2023-07-06 02:15:14.781465',
            'client_version': '2.2'},
 'result': {'test_result': {'index': {'RT': 5563.4137},
                            'insert': {'total_time': 3152.0947,
                                       'VPS': 31724.9352,
                                       'batch_time': 1.576,
                                       'batch': 50000},
                            'flush': {'RT': 2.5207},
                            'load': {'RT': 127.7297},
                            'Locust': {'Aggregated': {'Requests': 69711,
                                                      'Fails': 1344,
                                                      'RPS': 4.84,
                                                      'fail_s': 0.02,
                                                      'RT_max': 125165.16,
                                                      'RT_avg': 4119.79,
                                                      'TP50': 23,
                                                      'TP99': 66000.0},
                                       'load': {'Requests': 2148,
                                                'Fails': 32,
                                                'RPS': 0.15,
                                                'fail_s': 0.01,
                                                'RT_max': 30308.24,
                                                'RT_avg': 121.58,
                                                'TP50': 6,
                                                'TP99': 400.0},
                                       'query': {'Requests': 20883,
                                                 'Fails': 377,
                                                 'RPS': 1.45,
                                                 'fail_s': 0.02,
                                                 'RT_max': 20433.55,
                                                 'RT_avg': 55.49,
                                                 'TP50': 6,
                                                 'TP99': 99},
                                       'scene_test': {'Requests': 4354,
                                                      'Fails': 46,
                                                      'RPS': 0.3,
                                                      'fail_s': 0.01,
                                                      'RT_max': 125165.16,
                                                      'RT_avg': 64807.71,
                                                      'TP50': 65000.0,
                                                      'TP99': 72000.0},
                                       'search': {'Requests': 42326,
                                                  'Fails': 889,
                                                  'RPS': 2.94,
                                                  'fail_s': 0.02,
                                                  'RT_max': 15885.63,
                                                  'RT_avg': 85.1,
                                                  'TP50': 30,
                                                  'TP99': 170.0}}}}}

search fails: image client error log:

[2023-07-06 06:02:10,377 -  INFO - fouram]:          Aggregated                                                                             31     34     37     40     50  65000  65000  65000  76000 111000 118000  16842 (stats.py:819)
[2023-07-06 06:02:10,377 -  INFO - fouram]:  (stats.py:820)
[2023-07-06 06:02:19,864 - ERROR - fouram]: RPC error: [batch_insert], <MilvusException: (code=1, message=code: NotReadyServe, reason: stage=Abnormal: service not ready)>, <Time:{'RPC start': '2023-07-06 06:02:19.799164', 'RPC error': '2023-07-06 06:02:19.864387'}> (decorators.py:108)
[2023-07-06 06:02:19,866 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=code: NotReadyServe, reason: stage=Abnormal: service not ready)> (api_request.py:53)
[2023-07-06 06:02:19,866 - ERROR - fouram]: [CheckFunc] insert request check failed, response:<MilvusException: (code=1, message=code: NotReadyServe, reason: stage=Abnormal: service not ready)> (func_check.py:52)
[2023-07-06 06:02:19,867 - ERROR - fouram]: [func_time_catch] :  (api_request.py:120)
[2023-07-06 06:02:30,383 -  INFO - fouram]:  (stats.py:820)
[2023-07-06 06:02:31,879 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to Search, QueryNode ID=5, reason=err: failed to connect 10.104.21.42:21123, reason: context deadline exceeded
, /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:352 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:101 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:219 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).SearchSegments
/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:123 github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).SearchSegments
/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:249 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).Search.func2
/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:434 github.com/milvus-io/milvus/internal/querynodev2/delegator.executeSubTasks[...].func1
/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit: channel=fouramf-5wf5t-glzhh-op-67-5375-rootcoord-dml_1_442658949782831325v1: fail to access shard delegator: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-07-06 06:02:19.868698', 'RPC error': '2023-07-06 06:02:31.879642'}> (decorators.py:108)
[2023-07-06 06:02:31,880 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to Search, QueryNode ID=5, reason=err: failed to connect 10.104.21.42:21123, reason: context deadline exceeded
, /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:352 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:101 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:219 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).SearchSegments
/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:123 github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).SearchSegments
/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:249 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).Search.func2
/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:434 github.com/milvus-io/milvus/internal/querynodev2/delegator.executeSubTasks[...].func1
/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit: channel=fouramf-5wf5t-glzhh-op-67-5375-rootcoord-dml_1_442658949782831325v1: fail to access shard delegator: fail to search on all shard leaders)> (api_request.py:53)
[2023-07-06 06:02:31,880 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to Search, QueryNode ID=5, reason=err: failed to connect 10.104.21.42:21123, reason: context deadline exceeded
, /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:352 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:101 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]
/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:219 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).SearchSegments
/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:123 github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).SearchSegments
/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:249 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).Search.func2
/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:434 github.com/milvus-io/milvus/internal/querynodev2/delegator.executeSubTasks[...].func1
/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit: channel=fouramf-5wf5t-glzhh-op-67-5375-rootcoord-dml_1_442658949782831325v1: fail to access shard delegator: fail to search on all shard leaders)> (func_check.py:46)
[2023-07-06 06:02:37,344 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-07-06 06:02:31.882265', 'RPC error': '2023-07-06 06:02:37.344561'}> (decorators.py:108)
[2023-07-06 06:02:37,345 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (api_request.py:53)
[2023-07-06 06:02:37,345 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to search on all shard leaders)> (func_check.py:46)
[2023-07-06 06:02:37,349 - ERROR - fouram]: RPC error: [query], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: stage=Initializing: service not ready: unrecoverable error: fail to query on all shard leaders)>, <Time:{'RPC start': '2023-07-06 06:02:37.345428', 'RPC error': '2023-07-06 06:02:37.349563'}> (decorators.py:108)
weiliu1031 commented 1 year ago

/assign @aoiasd

weiliu1031 commented 1 year ago

/assign @elstic please verify this

weiliu1031 commented 1 year ago

/assign @elstic

elstic commented 1 year ago

The issue still exists.

deployment mode: operator argo task : fouramf-2hrlt rollingupgrade argo task: fouramf-6427s image : master-20230721-32827f53 -> master-20230726-2b9ec565 search concurrency: 20

server:

fouramf-2hrlt-op-52-9084-milvus-datacoord-66895b767-d4ktr       Running     0            39.96429s      10.104.24.150     4am-node29     
fouramf-2hrlt-op-52-9084-milvus-datanode-85b68997d7-lfvrf       Running     0            39.964343s     10.104.23.75      4am-node27     
fouramf-2hrlt-op-52-9084-milvus-indexcoord-6567988f55-8mg9f     Running     0            39.964362s     10.104.24.151     4am-node29     
fouramf-2hrlt-op-52-9084-milvus-indexnode-bffff4744-pwlng       Running     0            39.964378s     10.104.24.153     4am-node29     
fouramf-2hrlt-op-52-9084-milvus-proxy-59b9f655bb-glwg9          Running     0            39.964393s     10.104.23.74      4am-node27     
fouramf-2hrlt-op-52-9084-milvus-querycoord-5f69d6b59-jv4pd      Running     0            39.964408s     10.104.23.73      4am-node27     
fouramf-2hrlt-op-52-9084-milvus-querynode-5957f78754-ddxrj      Running     0            39.964422s     10.104.23.76      4am-node27     
fouramf-2hrlt-op-52-9084-milvus-querynode-5957f78754-jcv7v      Running     0            39.964436s     10.104.20.186     4am-node22     
fouramf-2hrlt-op-52-9084-milvus-rootcoord-6fb5bd55cf-twgdr      Running     0            39.96445s      10.104.23.77      4am-node27     
fouramf-2hrlt-op-52-9084-etcd-0                                 Running     0            3m             10.104.24.139     4am-node29     
fouramf-2hrlt-op-52-9084-etcd-1                                 Running     0            3m             10.104.23.65      4am-node27     
fouramf-2hrlt-op-52-9084-etcd-2                                 Running     0            3m             10.104.16.192     4am-node21     
fouramf-2hrlt-op-52-9084-kafka-0                                Running     2            3m             10.104.24.144     4am-node29     
fouramf-2hrlt-op-52-9084-kafka-1                                Running     1            3m             10.104.23.70      4am-node27     
fouramf-2hrlt-op-52-9084-kafka-2                                Running     1            3m             10.104.20.180     4am-node22     
fouramf-2hrlt-op-52-9084-kafka-zookeeper-0                      Running     0            3m             10.104.23.69      4am-node27     
fouramf-2hrlt-op-52-9084-kafka-zookeeper-1                      Running     0            3m             10.104.24.146     4am-node29     
fouramf-2hrlt-op-52-9084-kafka-zookeeper-2                      Running     0            3m             10.104.20.181     4am-node22     
fouramf-2hrlt-op-52-9084-minio-0                                Running     0            3m             10.104.23.60      4am-node27     
fouramf-2hrlt-op-52-9084-minio-1                                Running     0            3m             10.104.24.138     4am-node29     
fouramf-2hrlt-op-52-9084-minio-2                                Running     0            3m             10.104.20.176     4am-node22     
fouramf-2hrlt-op-52-9084-minio-3                                Running     0            3m             10.104.16.191     4am-node21  

2023-07-26 03:41:05 ~ 03:41:46 search fail There was nearly a minute of search failure. The total number of failures is 571.

client error log: image

elstic commented 1 year ago

Use operator deployment and change the enableActiveStandby value to true at deployment time.

    rootCoord:
      enableActiveStandby: true
    dataCoord:
      enableActiveStandby: true
    queryCoord:
      enableActiveStandby: true
    indexCoord:
      enableActiveStandby: true

argo task : fouramf-b5p7h rollingupgrade argo task: fouramf-tpztc image : master-20230721-32827f53 -> master-20230727-b986e3af search concurrency: 20 It takes 24 seconds to search properly.

server:

fouramf-b5p7h-op-36-1412-etcd-0                                   1/1     Running            0               3h17m   10.104.16.212   4am-node21   <none>           <none>
fouramf-b5p7h-op-36-1412-etcd-1                                   1/1     Running            0               3h17m   10.104.15.65    4am-node20   <none>           <none>
fouramf-b5p7h-op-36-1412-etcd-2                                   1/1     Running            0               3h17m   10.104.18.126   4am-node25   <none>           <none>
fouramf-b5p7h-op-36-1412-kafka-0                                  1/1     Running            2 (3h16m ago)   3h17m   10.104.16.216   4am-node21   <none>           <none>
fouramf-b5p7h-op-36-1412-kafka-1                                  1/1     Running            2 (3h16m ago)   3h17m   10.104.18.131   4am-node25   <none>           <none>
fouramf-b5p7h-op-36-1412-kafka-2                                  1/1     Running            1 (3h16m ago)   3h17m   10.104.15.75    4am-node20   <none>           <none>
fouramf-b5p7h-op-36-1412-kafka-zookeeper-0                        1/1     Running            0               3h17m   10.104.16.217   4am-node21   <none>           <none>
fouramf-b5p7h-op-36-1412-kafka-zookeeper-1                        1/1     Running            0               3h17m   10.104.18.132   4am-node25   <none>           <none>
fouramf-b5p7h-op-36-1412-kafka-zookeeper-2                        1/1     Running            0               3h17m   10.104.15.79    4am-node20   <none>           <none>
fouramf-b5p7h-op-36-1412-milvus-datacoord-74cbdfd686-wdwv8        1/1     Running            0               146m    10.104.6.153    4am-node13   <none>           <none>
fouramf-b5p7h-op-36-1412-milvus-datanode-666c49f9d-tdcsk          1/1     Running            0               143m    10.104.18.192   4am-node25   <none>           <none>
fouramf-b5p7h-op-36-1412-milvus-indexcoord-768b8fc474-fhrq7       1/1     Running            0               145m    10.104.6.156    4am-node13   <none>           <none>
fouramf-b5p7h-op-36-1412-milvus-indexnode-674bc96f7-xgkvl         1/1     Running            0               143m    10.104.6.158    4am-node13   <none>           <none>
fouramf-b5p7h-op-36-1412-milvus-proxy-694fcfc67c-wkwpz            1/1     Running            1 (42m ago)     142m    10.104.19.163   4am-node28   <none>           <none>
fouramf-b5p7h-op-36-1412-milvus-querycoord-667cf8cb58-pz7w2       1/1     Running            0               144m    10.104.6.157    4am-node13   <none>           <none>
fouramf-b5p7h-op-36-1412-milvus-querynode-5fbf54d7ff-7x8xs        1/1     Running            0               143m    10.104.23.65    4am-node27   <none>           <none>
fouramf-b5p7h-op-36-1412-milvus-querynode-5fbf54d7ff-qcspb        1/1     Running            0               143m    10.104.4.228    4am-node11   <none>           <none>
fouramf-b5p7h-op-36-1412-milvus-rootcoord-6d9cc89f4b-mkfs8        1/1     Running            0               147m    10.104.6.151    4am-node13   <none>           <none>
fouramf-b5p7h-op-36-1412-minio-0                                  1/1     Running            0               3h17m   10.104.16.213   4am-node21   <none>           <none>
fouramf-b5p7h-op-36-1412-minio-1                                  1/1     Running            0               3h17m   10.104.18.128   4am-node25   <none>           <none>
fouramf-b5p7h-op-36-1412-minio-2                                  1/1     Running            0               3h17m   10.104.15.68    4am-node20   <none>           <none>
fouramf-b5p7h-op-36-1412-minio-3                                  1/1     Running            0               3h17m   10.104.5.63     4am-node12   <none>           <none>

client error log: image

weiliu1031 commented 1 year ago

datacoord unavailable during rolling upgrade https://github.com/milvus-io/milvus/issues/25648#issuecomment-1664936274

yanliang567 commented 1 year ago

close for now , and will be tracked in #25648