milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.14k stars 2.89k forks source link

[Bug]: [benchmark][cluster] inserting a partition raises error `partition not found` in concurrent dql & dml scene #36989

Open wangting0128 opened 1 week ago

wangting0128 commented 1 week ago

Is there an existing issue for this?

Environment

- Milvus version:2.4-20241017-2bfd22f2-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.5rc7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: multi-vector-corn-1-1729173600 test case name: test_hybrid_search_locust_dql_dml_partition_hybrid_search_cluster

server:

NAME                                                              READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
multi-vector-corn-1-1729173600-1-etcd-0                           1/1     Running     0                3h9m    10.104.26.98    4am-node32   <none>           <none>
multi-vector-corn-1-1729173600-1-etcd-1                           1/1     Running     0                3h9m    10.104.32.180   4am-node39   <none>           <none>
multi-vector-corn-1-1729173600-1-etcd-2                           1/1     Running     0                3h9m    10.104.19.137   4am-node28   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-datanode-5888f5b48568st   1/1     Running     3 (3h8m ago)     3h9m    10.104.1.225    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-indexnode-c5c8f5f6grfpg   1/1     Running     3 (3h8m ago)     3h9m    10.104.5.53     4am-node12   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-indexnode-c5c8f5f6lfvqg   1/1     Running     3 (3h9m ago)     3h9m    10.104.4.2      4am-node11   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-indexnode-c5c8f5f6ndtjp   1/1     Running     3 (3h8m ago)     3h9m    10.104.1.222    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-indexnode-c5c8f5f6v7psv   1/1     Running     3 (3h9m ago)     3h9m    10.104.9.174    4am-node14   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-mixcoord-554d75997bxspn   1/1     Running     3 (3h8m ago)     3h9m    10.104.1.220    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-proxy-57b7d4949f-7gv48    1/1     Running     3 (3h9m ago)     3h9m    10.104.1.223    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-querynode-5c77b5567frhd   1/1     Running     3 (3h9m ago)     3h9m    10.104.6.182    4am-node13   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-querynode-5c77b556xp86q   1/1     Running     3 (3h9m ago)     3h9m    10.104.23.42    4am-node27   <none>           <none>
multi-vector-corn-1-1729173600-1-minio-0                          1/1     Running     0                3h9m    10.104.19.133   4am-node28   <none>           <none>
multi-vector-corn-1-1729173600-1-minio-1                          1/1     Running     0                3h9m    10.104.32.176   4am-node39   <none>           <none>
multi-vector-corn-1-1729173600-1-minio-2                          1/1     Running     0                3h9m    10.104.26.99    4am-node32   <none>           <none>
multi-vector-corn-1-1729173600-1-minio-3                          1/1     Running     0                3h9m    10.104.18.192   4am-node25   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-bookie-0                  1/1     Running     0                3h9m    10.104.32.178   4am-node39   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-bookie-1                  1/1     Running     0                3h9m    10.104.19.134   4am-node28   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-bookie-2                  1/1     Running     0                3h9m    10.104.18.191   4am-node25   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-bookie-init-cdqts         0/1     Completed   0                3h9m    10.104.1.224    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-broker-0                  1/1     Running     0                3h9m    10.104.9.173    4am-node14   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-proxy-0                   1/1     Running     0                3h9m    10.104.9.175    4am-node14   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-pulsar-init-zxbrj         0/1     Completed   0                3h9m    10.104.1.219    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-recovery-0                1/1     Running     0                3h9m    10.104.9.172    4am-node14   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-zookeeper-0               1/1     Running     0                3h9m    10.104.32.177   4am-node39   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-zookeeper-1               1/1     Running     0                3h9m    10.104.25.183   4am-node30   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-zookeeper-2               1/1     Running     0                3h8m    10.104.19.140   4am-node28   <none>           <none> 

{pod=~"multi-vector-corn-1-1729173600-1-milvus-.*"} |~ "partition not found|c2537709715697ef8aa3770fc1962c3a|scene_test_partition_hybrid_search_DX3SsnBo|453295868220979021" partition_not_found.log image

client log:

[2024-10-17 20:05:14,654 - ERROR - fouram]: RPC error: [batch_insert], <MilvusException: (code=200, message=partition not found[partition=scene_test_partition_hybrid_search_DX3SsnBo])>, <Time:{'RPC start': '2024-10-17 20:05:14.339202', 'RPC error': '2024-10-17 20:05:14.654836'}> (decorators.py:146)
[2024-10-17 20:05:14,655 - ERROR - fouram]: (api_response) : [Collection.insert] <MilvusException: (code=200, message=partition not found[partition=scene_test_partition_hybrid_search_DX3SsnBo])>, [requestId: 1f4c5a1c-8cc3-11ef-b43c-7e05d3331439] (api_request.py:57)
[2024-10-17 20:05:14,655 - ERROR - fouram]: [CheckFunc] insert request check failed, response:<MilvusException: (code=200, message=partition not found[partition=scene_test_partition_hybrid_search_DX3SsnBo])> (func_check.py:106)
[2024-10-17 20:05:14,656 - ERROR - fouram]: [func_time_catch] :  (api_request.py:127)
[2024-10-17 20:05:23,020 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-10-17 20:05:23,020 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: grpc     hybrid_search                                                                   2035     0(0.00%) |   4470      23   35405   2600 |    0.30        0.00 (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: grpc     query                                                                            255     0(0.00%) |   5339      97   83184    740 |    0.00        0.00 (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: grpc     scene_test_partition_hybrid_search                                               237     1(0.42%) | 427450    1847  827518 408000 |    0.40        0.10 (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: grpc     search                                                                          2045     0(0.00%) |  27488    3236   64599  27000 |    0.70        0.00 (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]:          Aggregated 

Expected Behavior

No response

Steps To Reproduce

concurrent test and calculation of RT and QPS

        :purpose:  `DQL & DML(partition)`
            verify concurrent DQL & DML(partition) scenario,
            which has 4 vector fields(IVF_FLAT, HNSW, DISKANN, IVF_SQ8) and scalar fields: `int64_1`, `varchar_1`

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim,
                'float_vector_1': 128dim,
                'float_vector_2': 128dim,
                'float_vector_3': 128dim,
                scalar field: int64_1, varchar_1
            2. build indexes:
                IVF_FLAT: 'float_vector'
                HNSW: 'float_vector_1',
                DISKANN: 'float_vector_2'
                IVF_SQ8: 'float_vector_3'
                INVERTED: 'int64_1', 'varchar_1'
                default scalar index: 'id'
            3. insert 1 million data into 10 partitions
            4. flush collection
            5. build indexes again using the same params
            6. load collection
                replica: 1
            7. concurrent request:
                - scene_test_partition_hybrid_search
                    (partition: create->insert->flush->index again->load->hybrid_search->release->hybrid_search failed->drop)  <- insert raises error
                - search
                - hybrid_search
                - query

Milvus Log

No response

Anything else?

test result:

[2024-10-17 20:42:59,475 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-10-17 20:42:59,475 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-10-17 20:42:59,475 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-17 20:42:59,475 -  INFO - fouram]: grpc     hybrid_search                                                                   2569     0(0.00%) |   4356      23   35405   2600 |    0.24        0.00 (stats.py:789)
[2024-10-17 20:42:59,475 -  INFO - fouram]: grpc     query                                                                            325     0(0.00%) |   4863      90   83184    640 |    0.03        0.00 (stats.py:789)
[2024-10-17 20:42:59,475 -  INFO - fouram]: grpc     scene_test_partition_hybrid_search                                               303     1(0.33%) | 425817    1847  827518 411000 |    0.03        0.00 (stats.py:789)
[2024-10-17 20:42:59,476 -  INFO - fouram]: grpc     search                                                                          2605     0(0.00%) |  27140    3236   64599  27000 |    0.24        0.00 (stats.py:789)
[2024-10-17 20:42:59,476 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-17 20:42:59,476 -  INFO - fouram]:          Aggregated                                                                      5802     1(0.02%) |  36624      23  827518  16000 |    0.54        0.00 (stats.py:789)
[2024-10-17 20:42:59,476 -  INFO - fouram]:  (stats.py:790)
[2024-10-17 20:42:59,479 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'cluster',
            'config_name': 'cluster_2c8m',
            'config': {'queryNode': {'resources': {'limits': {'cpu': '32.0', 'memory': '32Gi'}, 'requests': {'cpu': '17.0', 'memory': '17Gi'}}, 'replicas': 2},
                       'indexNode': {'resources': {'limits': {'cpu': '8.0', 'memory': '8Gi'}, 'requests': {'cpu': '5.0', 'memory': '5Gi'}}, 'replicas': 4},
                       'dataNode': {'resources': {'limits': {'cpu': '2.0', 'memory': '8Gi'}, 'requests': {'cpu': '2.0', 'memory': '5Gi'}}},
                       'cluster': {'enabled': True},
                       'pulsar': {},
                       'kafka': {},
                       'minio': {'metrics': {'podMonitor': {'enabled': True}}},
                       'etcd': {'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': '2.4-20241017-2bfd22f2-amd64'}}},
            'host': 'multi-vector-corn-1-1729173600-1-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_hybrid_search_locust_dql_dml_partition_hybrid_search_cluster',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'scalars_index': {'id': {}, 'int64_1': {'index_type': 'INVERTED'}, 'varchar_1': {'index_type': 'INVERTED'}},
                                                    'vectors_index': {'float_vector_1': {'index_type': 'HNSW',
                                                                                         'index_param': {'M': 8, 'efConstruction': 200},
                                                                                         'metric_type': 'L2'},
                                                                      'float_vector_2': {'index_type': 'DISKANN', 'index_param': {}, 'metric_type': 'IP'},
                                                                      'float_vector_3': {'index_type': 'IVF_SQ8',
                                                                                         'index_param': {'nlist': 2048},
                                                                                         'metric_type': 'L2'}},
                                                    'scalars_params': {'float_vector_1': {'params': {'dim': 128}, 'other_params': {'dataset': 'sift'}},
                                                                       'float_vector_2': {'params': {'dim': 128}, 'other_params': {'dataset': 'sift'}},
                                                                       'float_vector_3': {'params': {'dim': 128}, 'other_params': {'dataset': 'sift'}}},
                                                    'extra_partitions': {'partitions': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                        'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                        'partition_9'],
                                                                         'data_repeated': False},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 1000000,
                                                    'ni_per': 10000},
                                 'collection_params': {'other_fields': ['float_vector_1', 'float_vector_2', 'float_vector_3', 'int64_1', 'varchar_1'],
                                                       'shards_num': 2},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'IVF_FLAT', 'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 20, 'during_time': '3h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'scene_test_partition_hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 1,
                                                                  'top_k': 1,
                                                                  'reqs': [{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
                                                                           {'search_param': {'ef': 64}, 'anns_field': 'float_vector_1', 'top_k': 10},
                                                                           {'search_param': {'search_list': 32}, 'anns_field': 'float_vector_2', 'top_k': 30},
                                                                           {'search_param': {'nprobe': 16}, 'anns_field': 'float_vector_3', 'top_k': 400}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'data_size': 3000,
                                                                  'ni': 3000}},
                                                      {'type': 'search',
                                                       'weight': 8,
                                                       'params': {'nq': 1000,
                                                                  'top_k': 1,
                                                                  'search_param': {'nprobe': 1000},
                                                                  'expr': 'int64_1 >= 0',
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                      'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                      'partition_9'],
                                                                  'output_fields': None,
                                                                  'ignore_growing': False,
                                                                  'group_by_field': None,
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 8,
                                                       'params': {'nq': 1,
                                                                  'top_k': 100,
                                                                  'reqs': [{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
                                                                           {'search_param': {'ef': 64}, 'anns_field': 'float_vector_1', 'top_k': 10},
                                                                           {'search_param': {'search_list': 32}, 'anns_field': 'float_vector_2', 'top_k': 30},
                                                                           {'search_param': {'nprobe': 16}, 'anns_field': 'float_vector_3', 'top_k': 400}],
                                                                  'rerank': {'WeightedRanker': [0.85, 0.95, 0.51, 0.32]},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                      'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                      'partition_9'],
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'query',
                                                       'weight': 1,
                                                       'params': {'ids': None,
                                                                  'expr': 'int64_1 > -1 && ',
                                                                  'output_fields': ['*'],
                                                                  'offset': None,
                                                                  'limit': None,
                                                                  'ignore_growing': False,
                                                                  'partition_names': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                      'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                      'partition_9'],
                                                                  'timeout': 600,
                                                                  'consistency_level': None,
                                                                  'random_data': True,
                                                                  'random_count': 20,
                                                                  'random_range': [0, 100000],
                                                                  'field_name': 'id',
                                                                  'field_type': 'int64',
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}}]},
            'run_id': 2024101764162756,
            'datetime': '2024-10-17 17:33:36.785500',
            'client_version': '2.4.0'},
 'result': {'test_result': {'index': {'RT': 110.8804,
                                      'float_vector_1': {'RT': 0.5163},
                                      'float_vector_2': {'RT': 6.0471},
                                      'float_vector_3': {'RT': 0.538},
                                      'id': {'RT': 0.5305},
                                      'int64_1': {'RT': 0.5163},
                                      'varchar_1': {'RT': 0.5145}},
                            'insert': {'total_time': 148.2714, 'VPS': 6749.52, 'batch_time': 1.4827, 'batch': 10000.0},
                            'flush': {'RT': 3.0467},
                            'load': {'RT': 4.2753},
                            'Locust': {'Aggregated': {'Requests': 5802,
                                                      'Fails': 1,
                                                      'RPS': 0.54,
                                                      'fail_s': 0.0,
                                                      'RT_max': 827518.71,
                                                      'RT_avg': 36624.95,
                                                      'TP50': 16000.0,
                                                      'TP99': 510000.0},
                                       'hybrid_search': {'Requests': 2569,
                                                         'Fails': 0,
                                                         'RPS': 0.24,
                                                         'fail_s': 0.0,
                                                         'RT_max': 35405.13,
                                                         'RT_avg': 4356.86,
                                                         'TP50': 2600.0,
                                                         'TP99': 27000.0},
                                       'query': {'Requests': 325,
                                                 'Fails': 0,
                                                 'RPS': 0.03,
                                                 'fail_s': 0.0,
                                                 'RT_max': 83184.82,
                                                 'RT_avg': 4863.86,
                                                 'TP50': 640.0,
                                                 'TP99': 66000.0},
                                       'scene_test_partition_hybrid_search': {'Requests': 303,
                                                                              'Fails': 1,
                                                                              'RPS': 0.03,
                                                                              'fail_s': 0.0,
                                                                              'RT_max': 827518.71,
                                                                              'RT_avg': 425817.24,
                                                                              'TP50': 411000.0,
                                                                              'TP99': 776000.0},
                                       'search': {'Requests': 2605,
                                                  'Fails': 0,
                                                  'RPS': 0.24,
                                                  'fail_s': 0.0,
                                                  'RT_max': 64599.44,
                                                  'RT_avg': 27140.8,
                                                  'TP50': 27000.0,
                                                  'TP99': 52000.0}}}}}
yanliang567 commented 4 days ago

/assign @congqixia sounds like a similar issue we just fixed a few days ago?

/unassign