milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.13k stars 2.95k forks source link

[Bug]: [benchmark][standalone] Milvus restart `SIGSEGV: segmentation violation` during concurrent DQL & DML requests #36561

Closed wangting0128 closed 1 month ago

wangting0128 commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version:master-20240926-7ff41697-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):rocksmq    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.5rc7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: fouramf-bitmap-scenes-pn8vs test case name: test_bitmap_locust_dql_dml_upsert_standalone

server:

[2024-09-26 16:14:48,173 -  INFO - fouram]: [Base] Deploy initial state: 
I0926 13:01:15.954954     396 request.go:665] Waited for 1.177268139s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/flowcontrol.apiserver.k8s.io/v1beta1?timeout=32s
I0926 13:01:26.154598     396 request.go:665] Waited for 4.196992043s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/source.toolkit.fluxcd.io/v1beta1?timeout=32s
NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-bitmap-scenes-pn8vs-3-etcd-0                              1/1     Running     0               2m24s   10.104.16.30    4am-node21   <none>           <none>
fouramf-bitmap-scenes-pn8vs-3-milvus-standalone-6cdb86577bsg6p5   1/1     Running     0               2m24s   10.104.21.192   4am-node24   <none>           <none>
fouramf-bitmap-scenes-pn8vs-3-minio-d95f6dd7b-9nk2q               1/1     Running     0               2m24s   10.104.21.193   4am-node24   <none>           <none> (base.py:261)
[2024-09-26 16:14:48,173 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|fouramf-bitmap-scenes-pn8vs-3-milvus|fouramf-bitmap-scenes-pn8vs-3-minio|fouramf-bitmap-scenes-pn8vs-3-etcd|fouramf-bitmap-scenes-pn8vs-3-pulsar|fouramf-bitmap-scenes-pn8vs-3-zookeeper|fouramf-bitmap-scenes-pn8vs-3-kafka|fouramf-bitmap-scenes-pn8vs-3-log|fouramf-bitmap-scenes-pn8vs-3-tikv'  (util_cmd.py:14)
[2024-09-26 16:15:11,818 -  INFO - fouram]: [CliClient] pod details of release(fouramf-bitmap-scenes-pn8vs-3): 
 I0926 16:14:52.426249     545 request.go:665] Waited for 1.165188836s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/operator.victoriametrics.com/v1beta1?timeout=32s
I0926 16:15:02.426574     545 request.go:665] Waited for 3.997586946s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/litmuschaos.io/v1alpha1?timeout=32s
NAME                                                              READY   STATUS             RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-bitmap-scenes-pn8vs-3-etcd-0                              1/1     Running            0                3h16m   10.104.16.30    4am-node21   <none>           <none>
fouramf-bitmap-scenes-pn8vs-3-milvus-standalone-6cdb86577bsg6p5   0/1     Running            11 (6m23s ago)   3h16m   10.104.21.192   4am-node24   <none>           <none>
fouramf-bitmap-scenes-pn8vs-3-minio-d95f6dd7b-9nk2q               1/1     Running            0                3h16m   10.104.21.193   4am-node24   <none>           <none> 

image

截屏2024-09-27 10 55 25

Expected Behavior

No response

Steps To Reproduce

concurrent test and calculation of RT and QPS

        :purpose:  `primary key: INT64`, shards_num=16
            1. building `BITMAP` index on all supported 12 scalar fields
            2. 2 fields of different vector types
            3. verify DQL & DML(upsert) requests

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim
                'sparse_float_vector': sparse_range=[1, 100] <- the range of non-zero values of a sparse vector
                'id': primary key type is INT64

                all scalar fields: varchar max_length=100, array max_capacity=13
            2. build indexes:
                IVF_SQ8: 'float_vector'
                SPARSE_WAND: 'sparse_float_vector'

                BITMAP: all scalar fields
            3. insert 500k data
            4. flush collection
            5. build indexes again using the same params
            6. load collection
            7. concurrent request:
                - search
                - query
                - hybrid_search
                - load
                - upsert: batch=10
                - flush: ignore RateLimiter

Milvus Log

No response

Anything else?

test result:

{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'standalone',
            'config_name': 'standalone_16c16m',
            'config': {'standalone': {'resources': {'limits': {'cpu': '16.0', 'memory': '16Gi'}, 'requests': {'cpu': '9.0', 'memory': '9Gi'}}},
                       'cluster': {'enabled': False},
                       'etcd': {'replicaCount': 1, 'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'minio': {'mode': 'standalone', 'metrics': {'podMonitor': {'enabled': True}}},
                       'pulsar': {'enabled': False},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20240926-7ff41697-amd64'}}},
            'host': 'fouramf-bitmap-scenes-pn8vs-3-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_bitmap_locust_dql_dml_upsert_standalone',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'max_length': 100,
                                                    'scalars_index': {'int8_1': {'index_type': 'BITMAP'},
                                                                      'int16_1': {'index_type': 'BITMAP'},
                                                                      'int32_1': {'index_type': 'BITMAP'},
                                                                      'int64_1': {'index_type': 'BITMAP'},
                                                                      'varchar_1': {'index_type': 'BITMAP'},
                                                                      'bool_1': {'index_type': 'BITMAP'},
                                                                      'array_int8_1': {'index_type': 'BITMAP'},
                                                                      'array_int16_1': {'index_type': 'BITMAP'},
                                                                      'array_int32_1': {'index_type': 'BITMAP'},
                                                                      'array_int64_1': {'index_type': 'BITMAP'},
                                                                      'array_varchar_1': {'index_type': 'BITMAP'},
                                                                      'array_bool_1': {'index_type': 'BITMAP'}},
                                                    'vectors_index': {'sparse_float_vector': {'index_type': 'SPARSE_INVERTED_INDEX',
                                                                                              'index_param': {'drop_ratio_build': 0.2},
                                                                                              'metric_type': 'IP'}},
                                                    'scalars_params': {'array_int8_1': {'params': {'max_capacity': 13},
                                                                                        'other_params': {'dataset': 'random_algorithm',
                                                                                                         'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                              'specify_range': [-128, 128],
                                                                                                                              'max_capacity': 13}}},
                                                                       'array_int16_1': {'params': {'max_capacity': 13},
                                                                                         'other_params': {'dataset': 'random_algorithm',
                                                                                                          'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                               'specify_range': [-200, 200],
                                                                                                                               'max_capacity': 13}}},
                                                                       'array_int32_1': {'params': {'max_capacity': 13},
                                                                                         'other_params': {'dataset': 'random_algorithm',
                                                                                                          'algorithm_params': {'algorithm_name': 'specify_scope',
                                                                                                                               'specify_range': [-300, 300],
                                                                                                                               'max_capacity': 13}}},
                                                                       'array_int64_1': {'params': {'max_capacity': 13},
                                                                                         'other_params': {'dataset': 'random_algorithm',
                                                                                                          'algorithm_params': {'algorithm_name': 'fixed_value_range',
                                                                                                                               'specify_range': [-400, 432],
                                                                                                                               'batch': 50,
                                                                                                                               'max_capacity': 13}}},
                                                                       'array_varchar_1': {'params': {'max_capacity': 13},
                                                                                           'other_params': {'dataset': 'random_algorithm',
                                                                                                            'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                                 'specify_range': [-1500, 1500],
                                                                                                                                 'max_capacity': 13}}},
                                                                       'array_bool_1': {'params': {'max_capacity': 13}},
                                                                       'int8_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                   'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                        'specify_range': [-128, 128],
                                                                                                                        'max_capacity': 13}}},
                                                                       'int16_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                         'specify_range': [-200, 200],
                                                                                                                         'max_capacity': 13}}},
                                                                       'int32_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'specify_scope',
                                                                                                                         'specify_range': [-300, 300],
                                                                                                                         'max_capacity': 13}}},
                                                                       'int64_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'fixed_value_range',
                                                                                                                         'specify_range': [-400, 432],
                                                                                                                         'batch': 50,
                                                                                                                         'max_capacity': 13}}},
                                                                       'varchar_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                      'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                           'specify_range': [-1500, 1500],
                                                                                                                           'max_capacity': 13}}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 500000,
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['sparse_float_vector', 'int8_1', 'int16_1', 'int32_1', 'int64_1', 'varchar_1', 'bool_1',
                                                                        'array_int8_1', 'array_int16_1', 'array_int32_1', 'array_int64_1', 'array_varchar_1',
                                                                        'array_bool_1'],
                                                       'shards_num': 16},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'IVF_SQ8', 'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 20, 'during_time': '3h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 8,
                                                                  'top_k': 10,
                                                                  'search_param': {'nprobe': 16},
                                                                  'expr': 'int8_1 == 100',
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'output_fields': ['id', 'float_vector', 'int64_1'],
                                                                  'ignore_growing': False,
                                                                  'group_by_field': None,
                                                                  'timeout': 180,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'nq': 8}}},
                                                      {'type': 'query',
                                                       'weight': 1,
                                                       'params': {'ids': None,
                                                                  'expr': 'int64_1 > -1',
                                                                  'output_fields': ['*'],
                                                                  'offset': None,
                                                                  'limit': 10,
                                                                  'ignore_growing': False,
                                                                  'partition_names': None,
                                                                  'timeout': 60,
                                                                  'consistency_level': None,
                                                                  'random_data': False,
                                                                  'random_count': 0,
                                                                  'random_range': [0, 1],
                                                                  'field_name': 'id',
                                                                  'field_type': 'int64',
                                                                  'check_task': 'check_query_output',
                                                                  'check_items': {'expect_length': 10}}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 3,
                                                                  'top_k': 5,
                                                                  'reqs': [{'search_param': {'nprobe': 128},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': '(array_contains_any(array_int32_1, [0]) || array_contains(array_int64_1, '
                                                                                    '1)) || ((varchar_1 like "1%") and (bool_1 == True))',
                                                                            'top_k': 100},
                                                                           {'search_param': {'drop_ratio_search': 0.1},
                                                                            'anns_field': 'sparse_float_vector',
                                                                            'expr': 'not (int16_1 == int8_1) && ARRAY_CONTAINS_ANY(array_int64_1, [-1, 0, '
                                                                                    '1])'}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'timeout': 180,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'output_fields': ['sparse_float_vector', 'int8_1', 'int16_1', 'int32_1',
                                                                                                    'int64_1', 'varchar_1', 'bool_1', 'array_int8_1',
                                                                                                    'array_int16_1', 'array_int32_1', 'array_int64_1',
                                                                                                    'array_varchar_1', 'array_bool_1', 'id', 'float_vector'],
                                                                                  'nq': 3}}},
                                                      {'type': 'load',
                                                       'weight': 1,
                                                       'params': {'replica_number': 1, 'timeout': 180, 'check_task': 'check_response', 'check_items': None}},
                                                      {'type': 'upsert',
                                                       'weight': 1,
                                                       'params': {'nb': 10,
                                                                  'timeout': 30,
                                                                  'random_id': True,
                                                                  'random_vector': True,
                                                                  'varchar_filled': False,
                                                                  'start_id': 500000,
                                                                  'shuffle_id': False,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'flush',
                                                       'weight': 1,
                                                       'params': {'timeout': 600,
                                                                  'check_task': 'check_ignore_expected_errors',
                                                                  'check_items': [{'message': 'request is rejected by grpc RateLimiter middleware, please '
                                                                                              'retry later'},
                                                                                  {'message': 'wait for flush timeout'}]}}]},
            'run_id': 2024092655484362,
            'datetime': '2024-09-26 12:59:08.495672',
            'client_version': '2.5.0'},
 'result': {'test_result': {'index': {'RT': 477.1256,
                                      'sparse_float_vector': {'RT': 0.5156},
                                      'int8_1': {'RT': 0.5163},
                                      'int16_1': {'RT': 0.5153},
                                      'int32_1': {'RT': 0.517},
                                      'int64_1': {'RT': 0.516},
                                      'varchar_1': {'RT': 0.5143},
                                      'bool_1': {'RT': 0.5508},
                                      'array_int8_1': {'RT': 0.5135},
                                      'array_int16_1': {'RT': 0.5148},
                                      'array_int32_1': {'RT': 0.5139},
                                      'array_int64_1': {'RT': 0.5132},
                                      'array_varchar_1': {'RT': 0.5109},
                                      'array_bool_1': {'RT': 0.5125}},
                            'insert': {'total_time': 41.716, 'VPS': 11985.8088, 'batch_time': 0.4172, 'batch': 5000},
                            'flush': {'RT': 3.0363},
                            'load': {'RT': 1.8453},
                            'Locust': {'Aggregated': {'Requests': 6021,
                                                      'Fails': 149,
                                                      'RPS': 0.56,
                                                      'fail_s': 0.02,
                                                      'RT_max': 602982.76,
                                                      'RT_avg': 33840.35,
                                                      'TP50': 42,
                                                      'TP99': 601000.0},
                                       'flush': {'Requests': 987,
                                                 'Fails': 36,
                                                 'RPS': 0.09,
                                                 'fail_s': 0.04,
                                                 'RT_max': 602982.76,
                                                 'RT_avg': 198110.83,
                                                 'TP50': 97000.0,
                                                 'TP99': 602000.0},
                                       'hybrid_search': {'Requests': 1051,
                                                         'Fails': 27,
                                                         'RPS': 0.1,
                                                         'fail_s': 0.03,
                                                         'RT_max': 19407.09,
                                                         'RT_avg': 2885.62,
                                                         'TP50': 2400.0,
                                                         'TP99': 12000.0},
                                       'load': {'Requests': 1018,
                                                'Fails': 23,
                                                'RPS': 0.09,
                                                'fail_s': 0.02,
                                                'RT_max': 180899.87,
                                                'RT_avg': 4134.56,
                                                'TP50': 37,
                                                'TP99': 181000.0},
                                       'query': {'Requests': 963,
                                                 'Fails': 21,
                                                 'RPS': 0.09,
                                                 'fail_s': 0.02,
                                                 'RT_max': 6295.35,
                                                 'RT_avg': 101.66,
                                                 'TP50': 23,
                                                 'TP99': 2700.0},
                                       'search': {'Requests': 1015,
                                                  'Fails': 18,
                                                  'RPS': 0.09,
                                                  'fail_s': 0.02,
                                                  'RT_max': 6257.99,
                                                  'RT_avg': 109.57,
                                                  'TP50': 26,
                                                  'TP99': 3100.0},
                                       'upsert': {'Requests': 987,
                                                  'Fails': 24,
                                                  'RPS': 0.09,
                                                  'fail_s': 0.02,
                                                  'RT_max': 30892.44,
                                                  'RT_avg': 776.57,
                                                  'TP50': 19,
                                                  'TP99': 31000.0}}}}}
wangting0128 commented 2 months ago

different case, same error

argo task: fouramf-bitmap-scenes-pn8vs test case name: test_bitmap_locust_dql_dml_standalone image: master-20240926-7ff41697-amd64

server:

[2024-09-26 16:39:46,082 -  INFO - fouram]: [Base] Deploy initial state: 
I0926 13:01:56.333925     406 request.go:665] Waited for 1.178454381s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/authentication.k8s.io/v1?timeout=32s
I0926 13:02:06.334423     406 request.go:665] Waited for 3.998441925s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/dashboard.tekton.dev/v1alpha1?timeout=32s
NAME                                                              READY   STATUS            RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-bitmap-scenes-pn8vs-1-etcd-0                              1/1     Running           0               2m47s   10.104.16.39    4am-node21   <none>           <none>
fouramf-bitmap-scenes-pn8vs-1-milvus-standalone-58bc49cdd7hfqvh   1/1     Running           1 (2m11s ago)   2m47s   10.104.23.67    4am-node27   <none>           <none>
fouramf-bitmap-scenes-pn8vs-1-minio-6b7d55d749-c5kfm              1/1     Running           0               2m47s   10.104.16.38    4am-node21   <none>           <none> (base.py:261)
[2024-09-26 16:39:46,083 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|fouramf-bitmap-scenes-pn8vs-1-milvus|fouramf-bitmap-scenes-pn8vs-1-minio|fouramf-bitmap-scenes-pn8vs-1-etcd|fouramf-bitmap-scenes-pn8vs-1-pulsar|fouramf-bitmap-scenes-pn8vs-1-zookeeper|fouramf-bitmap-scenes-pn8vs-1-kafka|fouramf-bitmap-scenes-pn8vs-1-log|fouramf-bitmap-scenes-pn8vs-1-tikv'  (util_cmd.py:14)
[2024-09-26 16:40:09,666 -  INFO - fouram]: [CliClient] pod details of release(fouramf-bitmap-scenes-pn8vs-1): 
 I0926 16:39:50.347854     543 request.go:665] Waited for 1.172117329s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/certificates.k8s.io/v1?timeout=32s
I0926 16:40:00.547929     543 request.go:665] Waited for 4.198255053s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/storage.k8s.io/v1?timeout=32s
NAME                                                              READY   STATUS             RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-bitmap-scenes-pn8vs-1-etcd-0                              1/1     Running            0               3h40m   10.104.16.39    4am-node21   <none>           <none>
fouramf-bitmap-scenes-pn8vs-1-milvus-standalone-58bc49cdd7hfqvh   1/1     Running            8 (69m ago)     3h40m   10.104.23.67    4am-node27   <none>           <none>
fouramf-bitmap-scenes-pn8vs-1-minio-6b7d55d749-c5kfm              1/1     Running            0               3h40m   10.104.16.38    4am-node21   <none>           <none>

image

截屏2024-09-27 10 59 17

test steps:

        concurrent test and calculation of RT and QPS

        :purpose:  `primary key: INT64 autoID`
            1. building `BITMAP` index on all supported 12 scalar fields
            2. 2 fields of different vector types
            3. verify DQL & DML requests

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim
                'sparse_float_vector': sparse_range=[1, 100] <- the range of non-zero values of a sparse vector
                'id': primary key type is INT64

                all scalar fields: varchar max_length=100, array max_capacity=13
            2. build indexes:
                IVF_SQ8: 'float_vector'
                SPARSE_WAND: 'sparse_float_vector'
                BITMAP: all scalar fields
            3. insert 2 million data
            4. flush collection
            5. build indexes again using the same params
            6. load collection
            7. concurrent request:
                - search
                - query
                - hybrid_search
                - load
                - insert
                - delete: delete all inserted data
                - flush: ignore RateLimiter
yanliang567 commented 2 months ago

/assign @zhengbuqian /unassign

wangting0128 commented 1 month ago

verification passed

argo task: fouramf-bitmap-scenes-wzj7v

test case name: test_bitmap_locust_dql_dml_upsert_standalone

image: master-20241015-aa904be6-amd64