milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.03k stars 2.95k forks source link

[Bug]: [benchmark][standalone] Milvus panic `segment not found` in concurrent DML scene #34325

Closed wangting0128 closed 4 months ago

wangting0128 commented 5 months ago

Is there an existing issue for this?

Environment

- Milvus version:2.4-20240701-3c5ad499-amd64 
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):rocksmq    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc66
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: inverted-corn-1719849600 test case name: test_inverted_locust_partition_key_dml_standalone

server:

[2024-07-01 19:26:33,087 -  INFO - fouram]: [Base] Deploy initial state: 
I0701 16:10:49.442587     420 request.go:665] Waited for 1.176647648s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/policy/v1beta1?timeout=32s
I0701 16:10:59.641718     420 request.go:665] Waited for 6.997821513s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/node.k8s.io/v1beta1?timeout=32s
NAME                                                              READY   STATUS                   RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
inverted-corn-149600-2-79-3547-etcd-0                             1/1     Running                  0                 4m52s   10.104.26.247   4am-node32   <none>           <none>
inverted-corn-149600-2-79-3547-milvus-standalone-7df89f646ds4sl   1/1     Running                  3 (2m32s ago)     4m52s   10.104.26.248   4am-node32   <none>           <none>
inverted-corn-149600-2-79-3547-minio-58c7ccf54f-jxkps             1/1     Running                  0                 4m52s   10.104.16.168   4am-node21   <none>           <none> (base.py:261)
[2024-07-01 19:26:33,087 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|inverted-corn-149600-2-79-3547-milvus|inverted-corn-149600-2-79-3547-minio|inverted-corn-149600-2-79-3547-etcd|inverted-corn-149600-2-79-3547-pulsar|inverted-corn-149600-2-79-3547-zookeeper|inverted-corn-149600-2-79-3547-kafka|inverted-corn-149600-2-79-3547-log|inverted-corn-149600-2-79-3547-tikv'  (util_cmd.py:14)
[2024-07-01 19:26:49,663 -  INFO - fouram]: [CliClient] pod details of release(inverted-corn-149600-2-79-3547): 
 I0701 19:26:34.337262     530 request.go:665] Waited for 1.177881965s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/resolution.tekton.dev/v1beta1?timeout=32s
I0701 19:26:44.537756     530 request.go:665] Waited for 6.997021061s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/dashboard.tekton.dev/v1alpha1?timeout=32s
NAME                                                              READY   STATUS                        RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
inverted-corn-149600-2-79-3547-etcd-0                             1/1     Running                       0                 3h20m   10.104.26.247   4am-node32   <none>           <none>
inverted-corn-149600-2-79-3547-milvus-standalone-7df89f646ds4sl   1/1     Running                       4 (127m ago)      3h20m   10.104.26.248   4am-node32   <none>           <none>
inverted-corn-149600-2-79-3547-minio-58c7ccf54f-jxkps             1/1     Running                       0                 3h20m   10.104.16.168   4am-node21   <none>           <none> (cli_client.py:144)

image

截屏2024-07-02 10 52 01

client pod name: inverted-corn-1719849600-788157396 client error time: 2024-07-01 17:19:07,081 ~ 2024-07-01 19:28:11,949

Expected Behavior

No response

Steps To Reproduce

concurrent test and calculation of RT and QPS

        :purpose:  `partition_key: scalar enable partition_key(num_partitions=128)`
            verify concurrent DML scenario which
            scalar `id`(pk) & `int64_1` created INVERTED index and enable partition_key on `int64_1` field

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim,
                'int64_1': is_partition_key
            2. build indexes:
                IVF_FLAT: 'float_vector'
                INVERTED: 'id', 'int64_1'
            3. insert 5 million data
            4. flush collection
            5. build indexes again using the same params
            6. load collection
            7. concurrent request:
                - insert
                - delete
                - flush
                - release

Milvus Log

No response

Anything else?

test result:

[2024-07-01 19:26:18,617 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-07-01 19:26:18,617 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-07-01 19:26:18,617 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-07-01 19:26:18,617 -  INFO - fouram]: grpc     delete                                                                          2223     3(0.13%) |     46       0   30669      2 |    0.21        0.00 (stats.py:789)
[2024-07-01 19:26:18,617 -  INFO - fouram]: grpc     flush                                                                           2155 1041(48.31%) |  99229     504  361609 136000 |    0.20        0.10 (stats.py:789)
[2024-07-01 19:26:18,618 -  INFO - fouram]: grpc     insert                                                                          2128     0(0.00%) |    239       6  129764     13 |    0.20        0.00 (stats.py:789)
[2024-07-01 19:26:18,618 -  INFO - fouram]: grpc     release                                                                         2255     3(0.13%) |     67       0   30673      2 |    0.21        0.00 (stats.py:789)
[2024-07-01 19:26:18,618 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-07-01 19:26:18,618 -  INFO - fouram]:          Aggregated                                                                      8761 1047(11.95%) |  24495       0  361609      9 |    0.81        0.10 (stats.py:789)
[2024-07-01 19:26:18,618 -  INFO - fouram]:  (stats.py:790)
[2024-07-01 19:26:18,620 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'standalone',
            'config_name': 'standalone_8c16m',
            'config': {'standalone': {'resources': {'limits': {'cpu': '8.0',
                                                               'memory': '16Gi'},
                                                    'requests': {'cpu': '5.0',
                                                                 'memory': '9Gi'}}},
                       'cluster': {'enabled': False},
                       'etcd': {'replicaCount': 1,
                                'metrics': {'enabled': True,
                                            'podMonitor': {'enabled': True}}},
                       'minio': {'mode': 'standalone',
                                 'metrics': {'podMonitor': {'enabled': True}}},
                       'pulsar': {'enabled': False},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus',
                                         'tag': '2.4-20240701-3c5ad499-amd64'}}},
            'host': 'inverted-corn-149600-2-79-3547-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_inverted_locust_partition_key_dml_standalone',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'scalars_index': {'id': {'index_type': 'INVERTED'},
                                                                      'int64_1': {'index_type': 'INVERTED'}},
                                                    'scalars_params': {'int64_1': {'params': {'is_partition_key': True}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 5000000,
                                                    'ni_per': 50000},
                                 'collection_params': {'other_fields': ['int64_1'],
                                                       'shards_num': 2,
                                                       'num_partitions': 128},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False,
                                                          'reset_db': False},
                                 'index_params': {'index_type': 'IVF_FLAT',
                                                  'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 20,
                                                       'during_time': '3h',
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'insert',
                                                       'weight': 1,
                                                       'params': {'nb': 10,
                                                                  'timeout': 180,
                                                                  'random_id': True,
                                                                  'random_vector': True,
                                                                  'varchar_filled': False,
                                                                  'start_id': 0}},
                                                      {'type': 'delete',
                                                       'weight': 1,
                                                       'params': {'expr': '',
                                                                  'delete_length': 9,
                                                                  'timeout': 30}},
                                                      {'type': 'flush',
                                                       'weight': 1,
                                                       'params': {'timeout': 180}},
                                                      {'type': 'release',
                                                       'weight': 1,
                                                       'params': {'timeout': 30}}]},
            'run_id': 2024070199707716,
            'datetime': '2024-07-01 16:06:10.584222',
            'client_version': '2.4.0'},
 'result': {'test_result': {'index': {'RT': 709.6943,
                                      'id': {'RT': 1.0496},
                                      'int64_1': {'RT': 1.0175}},
                            'insert': {'total_time': 176.862,
                                       'VPS': 28270.6291,
                                       'batch_time': 1.7686,
                                       'batch': 50000},
                            'flush': {'RT': 6.6009},
                            'load': {'RT': 6.1357},
                            'Locust': {'Aggregated': {'Requests': 8761,
                                                      'Fails': 1047,
                                                      'RPS': 0.81,
                                                      'fail_s': 0.12,
                                                      'RT_max': 361609.5,
                                                      'RT_avg': 24495.41,
                                                      'TP50': 9,
                                                      'TP99': 182000.0},
                                       'delete': {'Requests': 2223,
                                                  'Fails': 3,
                                                  'RPS': 0.21,
                                                  'fail_s': 0.0,
                                                  'RT_max': 30669.76,
                                                  'RT_avg': 46.62,
                                                  'TP50': 2,
                                                  'TP99': 90},
                                       'flush': {'Requests': 2155,
                                                 'Fails': 1041,
                                                 'RPS': 0.2,
                                                 'fail_s': 0.48,
                                                 'RT_max': 361609.5,
                                                 'RT_avg': 99229.53,
                                                 'TP50': 136000.0,
                                                 'TP99': 190000.0},
                                       'insert': {'Requests': 2128,
                                                  'Fails': 0,
                                                  'RPS': 0.2,
                                                  'fail_s': 0.0,
                                                  'RT_max': 129764.23,
                                                  'RT_avg': 239.5,
                                                  'TP50': 13,
                                                  'TP99': 1100.0},
                                       'release': {'Requests': 2255,
                                                   'Fails': 3,
                                                   'RPS': 0.21,
                                                   'fail_s': 0.0,
                                                   'RT_max': 30673.81,
                                                   'RT_avg': 67.1,
                                                   'TP50': 2,
                                                   'TP99': 90}}}}}
yanliang567 commented 5 months ago

/assign @weiliu1031 /unassign

xiaofan-luan commented 5 months ago

/assign @wangting0128

wangting0128 commented 4 months ago

Same case, different panic

34376

weiliu1031 commented 4 months ago

please verify this with latest images

weiliu1031 commented 4 months ago

/assign @wangting0128

wangting0128 commented 4 months ago

verification passed

argo task: inverted-corn-1720386000 image: 2.4-20240705-326370c1-amd64