milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.46k stars 2.82k forks source link

[Bug]: [benchmark][standalone] Milvus standalone restart causes requests to fail under dql and dml scene #26978

Closed wangting0128 closed 11 months ago

wangting0128 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20230910-2101f2d2
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):rocksmq    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0.dev81
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: fouramf-stable-test-wt-1694376000 client pod: fouramf-stable-test-wt-1694376000-902656247

server:

[2023-09-11 01:07:50,581 -  INFO - fouram]: [Base] Deploy initial state: 
I0910 20:06:55.950271     381 request.go:665] Waited for 1.163769399s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/autoscaling/v1?timeout=32s
NAME                                                              READY   STATUS              RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-stable-76000-6-91-8774-etcd-0                             1/1     Running             0               4m43s   10.104.24.158   4am-node29   <none>           <none>
fouramf-stable-76000-6-91-8774-milvus-standalone-7f9cd9f9744w77   1/1     Running             0               4m44s   10.104.4.172    4am-node11   <none>           <none>
fouramf-stable-76000-6-91-8774-minio-869bd78499-qzjsc             1/1     Running             0               4m44s   10.104.9.9      4am-node14   <none>           <none> (base.py:221)
[2023-09-11 01:07:50,581 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'STATUS|fouramf-stable-76000-6-91-8774-milvus|fouramf-stable-76000-6-91-8774-minio|fouramf-stable-76000-6-91-8774-etcd|fouramf-stable-76000-6-91-8774-pulsar|fouramf-stable-76000-6-91-8774-kafka'  (util_cmd.py:14)
[2023-09-11 01:07:59,929 -  INFO - fouram]: [CliClient] pod details of release(fouramf-stable-76000-6-91-8774): 
 I0911 01:07:51.858060     498 request.go:665] Waited for 1.119707962s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/milvus.io/v1alpha1?timeout=32s
NAME                                                              READY   STATUS        RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-stable-76000-6-91-8774-etcd-0                             1/1     Running       0               5h5m    10.104.24.158   4am-node29   <none>           <none>
fouramf-stable-76000-6-91-8774-milvus-standalone-7f9cd9f9744w77   1/1     Running       1 (68m ago)     5h5m    10.104.4.172    4am-node11   <none>           <none>
fouramf-stable-76000-6-91-8774-minio-869bd78499-qzjsc             1/1     Running       0               5h5m    10.104.9.9      4am-node14   <none>           <none>

client logs: Request failure time: 2023-09-10 23:59:35,783 ~ 2023-09-11 00:01:19,460

截屏2023-09-11 11 37 46 截屏2023-09-11 11 37 59

test result:

{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'standalone',
            'config_name': 'standalone_8c16m',
            'config': {'standalone': {'resources': {'limits': {'cpu': '8.0',
                                                               'memory': '16Gi'},
                                                    'requests': {'cpu': '5.0',
                                                                 'memory': '9Gi'}},
                                      'persistence': {'persistentVolumeClaim': {'storageClass': 'local-path'}}},
                       'cluster': {'enabled': False},
                       'etcd': {'replicaCount': 1,
                                'global': {'storageClass': 'local-path'},
                                'metrics': {'enabled': True,
                                            'podMonitor': {'enabled': True}}},
                       'minio': {'mode': 'standalone',
                                 'persistence': {'storageClass': 'local-path'},
                                 'metrics': {'podMonitor': {'enabled': True}}},
                       'pulsar': {'enabled': False},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'image': {'all': {'repository': 'harbor.milvus.io/dockerhub/milvusdb/milvus',
                                         'tag': 'master-20230910-2101f2d2'}}},
            'host': 'fouramf-stable-76000-6-91-8774-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_hnsw_dql_filter_insert_standalone',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 100000,
                                                    'ni_per': 50000},
                                 'collection_params': {'other_fields': ['float_1'],
                                                       'shards_num': 2},
                                 'load_params': {},
                                 'query_params': {},
                                 'search_params': {},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False,
                                                          'reset_db': False},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 8,
                                                                  'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 20,
                                                       'during_time': '5h',
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 30,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'search_param': {'ef': 16},
                                                                  'expr': {'float_1': {'GT': -1.0,
                                                                                       'LT': 50000.0}},
                                                                  'guarantee_timestamp': None,
                                                                  'output_fields': None,
                                                                  'ignore_growing': False,
                                                                  'timeout': 60,
                                                                  'random_data': True}},
                                                      {'type': 'query',
                                                       'weight': 10,
                                                       'params': {'ids': [0,
                                                                          1,
                                                                          2,
                                                                          3,
                                                                          4,
                                                                          5,
                                                                          6,
                                                                          7,
                                                                          8,
                                                                          9],
                                                                  'expr': None,
                                                                  'output_fields': None,
                                                                  'ignore_growing': False,
                                                                  'timeout': 60}},
                                                      {'type': 'flush',
                                                       'weight': 5,
                                                       'params': {'timeout': 30}},
                                                      {'type': 'insert',
                                                       'weight': 1,
                                                       'params': {'nb': 1,
                                                                  'timeout': 30,
                                                                  'random_id': True,
                                                                  'random_vector': True,
                                                                  'varchar_filled': False}}]},
            'run_id': 20230910240,
            'datetime': '2023-09-10 20:02:16.420375',
            'client_version': 'master'},
 'result': {'test_result': {'index': {'RT': 6.2128},
                            'insert': {'total_time': 2.7775,
                                       'VPS': 36003.6004,
                                       'batch_time': 1.3887,
                                       'batch': 50000},
                            'flush': {'RT': 3.0191},
                            'load': {'RT': 3.5279},
                            'Locust': {'Aggregated': {'Requests': 973875,
                                                      'Fails': 17,
                                                      'RPS': 54.1,
                                                      'fail_s': 0.0,
                                                      'RT_max': 154885.55,
                                                      'RT_avg': 367.74,
                                                      'TP50': 94,
                                                      'TP99': 3300.0},
                                       'flush': {'Requests': 105807,
                                                 'Fails': 8,
                                                 'RPS': 5.88,
                                                 'fail_s': 0.0,
                                                 'RT_max': 154885.55,
                                                 'RT_avg': 2516.18,
                                                 'TP50': 2600.0,
                                                 'TP99': 7400.0},
                                       'insert': {'Requests': 21085,
                                                  'Fails': 1,
                                                  'RPS': 1.17,
                                                  'fail_s': 0.0,
                                                  'RT_max': 33517.32,
                                                  'RT_avg': 137.27,
                                                  'TP50': 26,
                                                  'TP99': 2200.0},
                                       'query': {'Requests': 211022,
                                                 'Fails': 2,
                                                 'RPS': 11.72,
                                                 'fail_s': 0.0,
                                                 'RT_max': 93503.78,
                                                 'RT_avg': 143.11,
                                                 'TP50': 120.0,
                                                 'TP99': 510.0},
                                       'search': {'Requests': 635961,
                                                  'Fails': 6,
                                                  'RPS': 35.33,
                                                  'fail_s': 0.0,
                                                  'RT_max': 93194.44,
                                                  'RT_avg': 92.47,
                                                  'TP50': 82,
                                                  'TP99': 280.0}}}}}

Expected Behavior

No response

Steps To Reproduce

1. Deploy a standalone Milvus
2. prepare 100k data
3. concurrent requests: load、query、search、flush 《- Milvus restart and request fiailed

Milvus Log

Milvus log:

截屏2023-09-11 11 42 29 截屏2023-09-11 11 42 37

Anything else?

No response

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

jiaoew1991 commented 1 year ago

/assign @yah01 /unassign

yah01 commented 12 months ago

/assign @wangting0128

27103 fixed this

wangting0128 commented 12 months ago

/assign @wangting0128

27103 fixed this

@elstic please follow up on this issue

elstic commented 11 months ago

verify imag: master-20230926-2d6a9682 not having this problem