milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.57k stars 2.92k forks source link

[Bug]: [benchmark][cluster] proxy panic `ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded` under concurrent DQL & DML scene with enabled global mmap #36682

Open wangting0128 opened 1 month ago

wangting0128 commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: master-20240930-94005b71-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.5rc7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: fouramf-8276l

server:

NAME                                                              READY   STATUS             RESTARTS           AGE     IP              NODE         NOMINATED NODE   READINESS GATES
holiday-mmap-etcd-0                                               1/1     Running            0                  8d      10.104.21.102   4am-node24   <none>           <none>
holiday-mmap-etcd-1                                               1/1     Running            0                  8d      10.104.33.80    4am-node36   <none>           <none>
holiday-mmap-etcd-2                                               1/1     Running            0                  8d      10.104.34.214   4am-node37   <none>           <none>
holiday-mmap-milvus-datanode-65f48b8c5c-5dgs6                     1/1     Running            3 (27h ago)        8d      10.104.34.212   4am-node37   <none>           <none>
holiday-mmap-milvus-indexnode-55989ffbb5-s2bkt                    1/1     Running            0                  8d      10.104.19.213   4am-node28   <none>           <none>
holiday-mmap-milvus-indexnode-55989ffbb5-s2ttg                    1/1     Running            0                  8d      10.104.18.254   4am-node25   <none>           <none>
holiday-mmap-milvus-indexnode-55989ffbb5-swn2l                    1/1     Running            0                  8d      10.104.13.107   4am-node16   <none>           <none>
holiday-mmap-milvus-mixcoord-5c664c7ff4-vhcmt                     1/1     Running            0                  8d      10.104.5.35     4am-node12   <none>           <none>
holiday-mmap-milvus-proxy-65b4cc7c8d-gbw2r                        1/1     Running            1 (27h ago)        8d      10.104.13.105   4am-node16   <none>           <none>
holiday-mmap-milvus-querynode-7fbd999fcc-28m2d                    1/1     Running            0                  8d      10.104.5.36     4am-node12   <none>           <none>
holiday-mmap-milvus-querynode-7fbd999fcc-2tqcw                    1/1     Running            0                  8d      10.104.17.148   4am-node23   <none>           <none>
holiday-mmap-milvus-querynode-7fbd999fcc-675ns                    1/1     Running            0                  8d      10.104.13.106   4am-node16   <none>           <none>
holiday-mmap-milvus-querynode-7fbd999fcc-dfv96                    1/1     Running            0                  8d      10.104.6.6      4am-node13   <none>           <none>
holiday-mmap-milvus-querynode-7fbd999fcc-l4xrq                    1/1     Running            0                  8d      10.104.34.213   4am-node37   <none>           <none>
holiday-mmap-minio-0                                              1/1     Running            0                  8d      10.104.21.82    4am-node24   <none>           <none>
holiday-mmap-minio-1                                              1/1     Running            0                  8d      10.104.17.116   4am-node23   <none>           <none>
holiday-mmap-minio-2                                              1/1     Running            0                  8d      10.104.34.91    4am-node37   <none>           <none>
holiday-mmap-minio-3                                              1/1     Running            0                  8d      10.104.33.183   4am-node36   <none>           <none>
holiday-mmap-pulsar-bookie-0                                      1/1     Running            0                  8d      10.104.21.83    4am-node24   <none>           <none>
holiday-mmap-pulsar-bookie-1                                      1/1     Running            0                  8d      10.104.19.16    4am-node28   <none>           <none>
holiday-mmap-pulsar-bookie-2                                      1/1     Running            0                  8d      10.104.34.96    4am-node37   <none>           <none>
holiday-mmap-pulsar-broker-0                                      1/1     Running            0                  8d      10.104.21.78    4am-node24   <none>           <none>
holiday-mmap-pulsar-proxy-0                                       1/1     Running            0                  8d      10.104.5.173    4am-node12   <none>           <none>
holiday-mmap-pulsar-recovery-0                                    1/1     Running            0                  8d      10.104.5.172    4am-node12   <none>           <none>
holiday-mmap-pulsar-zookeeper-0                                   1/1     Running            0                  8d      10.104.21.85    4am-node24   <none>           <none>
holiday-mmap-pulsar-zookeeper-1                                   1/1     Running            0                  8d      10.104.33.185   4am-node36   <none>           <none>
holiday-mmap-pulsar-zookeeper-2                                   1/1     Running            0                  8d      10.104.17.118   4am-node23   <none>           <none>

enabled global mmap

截屏2024-10-08 14 26 57

proxy panic holiday-mmap-milvus-proxy-65b4cc7c8d-gbw2r_panic.log

截屏2024-10-08 14 36 57 截屏2024-10-08 14 37 59

client log:

截屏2024-10-08 14 30 49

Expected Behavior

No response

Steps To Reproduce

1. create a collection with fields:  ['id', 'float_vector', 'float_vector_1', 'float_vector_2', 'sparse_float_vector', 'int64_1', 'float_1', 'bool_1']
2. build index
  HNSW: float_vector
  DISKANN: float_vector_1
  IVF_SQ8: float_vector_2
  SPARSE_WAND: sparse_float_vector
  AUTOINDEX: 'int64_1', 'float_1', 'bool_1'
3. insert 10m data
4. flush collection
5. build index with the same params
6. load collection
7. concurrent request:
   - search
   - query
   - hybrid_search
   - load
   - insert
   - delete
   - scene_test
     (collection: create->insert->flush->index->drop)
   - scene_hybrid_search_test: 4 vector fields, 3 scalar fields
     (collection: create->insert->flush->index->load->hybrid_search->drop)

Milvus Log

No response

Anything else?

server config:

{
     "extraConfigFiles": {
          "user.yaml": "queryNode:\n  mmap:\n    mmapEnabled: true"
     },
     "queryNode": {
          "resources": {
               "limits": {
                    "cpu": "8.0",
                    "memory": "32Gi"
               },
               "requests": {
                    "cpu": "5.0",
                    "memory": "17Gi"
               }
          },
          "replicas": 5
     },
     "indexNode": {
          "resources": {
               "limits": {
                    "cpu": "8.0",
                    "memory": "16Gi"
               },
               "requests": {
                    "cpu": "4.0",
                    "memory": "3Gi"
               }
          },
          "replicas": 3
     },
     "dataNode": {
          "resources": {
               "limits": {
                    "cpu": "2.0",
                    "memory": "16Gi"
               },
               "requests": {
                    "cpu": "2.0",
                    "memory": "5Gi"
               }
          },
          "replicas": 1
     }
}

test result:

[2024-10-07 05:11:05,999 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-10-07 05:11:06,000 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: grpc     delete                                                                         96442     0(0.00%) |     13       2    3818      5 |    0.16        0.00 (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: grpc     hybrid_search                                                                1931188     1(0.00%) |    288     107  153808    200 |    3.19        0.00 (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: grpc     insert                                                                         96523     0(0.00%) |     61       5    6134     19 |    0.16        0.00 (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: grpc     load                                                                           96705     0(0.00%) |     67       4    6562      8 |    0.16        0.00 (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: grpc     query                                                                        1927685     0(0.00%) |     31       3  156398      7 |    3.19        0.00 (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: grpc     scene_hybrid_search_test                                                       96457     0(0.00%) |  50860   11709  381928  34000 |    0.16        0.00 (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: grpc     scene_test                                                                     96614     0(0.00%) |  66698   63072 26643289  66000 |    0.16        0.00 (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: grpc     search                                                                       1930962     0(0.00%) |     40       3  144274     19 |    3.19        0.00 (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]:          Aggregated                                                                   6272576     1(0.00%) |   1922       2 26643289     23 |   10.37        0.00 (stats.py:789)
[2024-10-07 05:11:06,000 -  INFO - fouram]:  (stats.py:790)
[2024-10-07 05:11:06,004 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': '',
            'deploy_mode': '',
            'config_name': '',
            'config': {},
            'host': 'holiday-mmap-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'scalars_index': {'int64_1': {}, 'float_1': {}, 'bool_1': {}},
                                                    'vectors_index': {'float_vector_1': {'index_type': 'DISKANN', 'index_param': {}, 'metric_type': 'IP'},
                                                                      'float_vector_2': {'index_type': 'IVF_SQ8',
                                                                                         'index_param': {'nlist': 1024},
                                                                                         'metric_type': 'L2'},
                                                                      'sparse_float_vector': {'index_type': 'SPARSE_WAND',
                                                                                              'index_param': {'drop_ratio_build': 0.2},
                                                                                              'metric_type': 'IP'}},
                                                    'scalars_params': {'float_vector_1': {'params': {'dim': 200}, 'other_params': {'dataset': 'text2img'}},
                                                                       'float_vector_2': {'params': {'dim': 768},
                                                                                          'other_params': {'dataset': 'laion2b_multi',
                                                                                                           'column_name': 'float32_vector'}},
                                                                       'int64_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'random_range_custom_size',
                                                                                                                         'specify_range': [-50, 52],
                                                                                                                         'base_size': '9w',
                                                                                                                         'custom_size': {'254000': [-2, -1, 0,
                                                                                                                                                    1, 2]}}}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '10m',
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['float_vector_1', 'float_vector_2', 'sparse_float_vector', 'int64_1', 'float_1',
                                                                        'bool_1'],
                                                       'shards_num': 2},
                                 'index_params': {'index_type': 'HNSW', 'index_param': {'M': 8, 'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 20, 'during_time': '168h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 20,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'search_param': {'ef': 32},
                                                                  'expr': 'bool_1 == true',
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'check_task': 'check_response'}},
                                                      {'type': 'query',
                                                       'weight': 20,
                                                       'params': {'expr': '',
                                                                  'timeout': 600,
                                                                  'limit': 20,
                                                                  'random_data': True,
                                                                  'random_count': 10,
                                                                  'random_range': [-50, 52],
                                                                  'field_name': 'int64_1',
                                                                  'field_type': 'int64',
                                                                  'check_task': 'check_response'}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 20,
                                                       'params': {'nq': 2,
                                                                  'top_k': 10,
                                                                  'reqs': [{'search_param': {'ef': 32},
                                                                            'anns_field': 'float_vector',
                                                                            'top_k': 20,
                                                                            'expr': 'bool_1 == false'},
                                                                           {'search_param': {'search_list': 30},
                                                                            'anns_field': 'float_vector_1',
                                                                            'top_k': 10,
                                                                            'expr': 'int64_1 > 0'},
                                                                           {'search_param': {'nprobe': 32},
                                                                            'anns_field': 'float_vector_2',
                                                                            'top_k': 12,
                                                                            'expr': 'int64_1 < 0'},
                                                                           {'search_param': {'drop_ratio_search': 0.1},
                                                                            'anns_field': 'sparse_float_vector',
                                                                            'top_k': 8,
                                                                            'expr': 'float_1 > 1000000.0'}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['*'],
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'check_task': 'check_response'}},
                                                      {'type': 'load',
                                                       'weight': 1,
                                                       'params': {'replica_number': 1, 'timeout': 300, 'check_task': 'check_response'}},
                                                      {'type': 'insert',
                                                       'weight': 1,
                                                       'params': {'nb': 1, 'timeout': 600, 'random_id': 10000000, 'check_task': 'check_response'}},
                                                      {'type': 'delete',
                                                       'weight': 1,
                                                       'params': {'delete_length': 1, 'timeout': 600, 'check_task': 'check_response'}},
                                                      {'type': 'scene_test',
                                                       'weight': 1,
                                                       'params': {'dim': 128,
                                                                  'data_size': 3000,
                                                                  'nb': 3000,
                                                                  'index_type': 'IVF_SQ8',
                                                                  'index_param': {'nlist': 2048},
                                                                  'metric_type': 'L2'}},
                                                      {'type': 'scene_hybrid_search_test',
                                                       'weight': 1,
                                                       'params': {'nq': 1,
                                                                  'top_k': 1,
                                                                  'reqs': [{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
                                                                           {'search_param': {'nprobe': 32}, 'anns_field': 'float_vector_1', 'top_k': 10},
                                                                           {'search_param': {'ef': 32}, 'anns_field': 'float_vector_2', 'top_k': 5},
                                                                           {'search_param': {'search_list': 20}, 'anns_field': 'float_vector_3', 'top_k': 10}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'dataset': 'local',
                                                                  'dim': 128,
                                                                  'shards_num': 2,
                                                                  'data_size': 3000,
                                                                  'nb': 3000,
                                                                  'index_type': 'IVF_SQ8',
                                                                  'index_param': {'nlist': 2048},
                                                                  'metric_type': 'L2',
                                                                  'other_fields': ['float_vector_1', 'float_vector_2', 'float_vector_3', 'int64_1', 'bool_1',
                                                                                   'varchar_1'],
                                                                  'replica_number': 1,
                                                                  'scalars_params': {'float_vector_1': {'params': {'dim': 128},
                                                                                                        'other_params': {'dataset': 'sift'}},
                                                                                     'float_vector_2': {'params': {'dim': 128},
                                                                                                        'other_params': {'dataset': 'sift'}},
                                                                                     'float_vector_3': {'params': {'dim': 128},
                                                                                                        'other_params': {'dataset': 'sift'}}},
                                                                  'scalars_index': {'int64_1': {},
                                                                                    'bool_1': {'index_type': 'BITMAP'},
                                                                                    'varchar_1': {'index_type': 'INVERTED'}},
                                                                  'vectors_index': {'float_vector_1': {'index_type': 'IVF_FLAT',
                                                                                                       'index_param': {'nlist': 1024},
                                                                                                       'metric_type': 'L2'},
                                                                                    'float_vector_2': {'index_type': 'HNSW',
                                                                                                       'index_param': {'M': 8, 'efConstruction': 200},
                                                                                                       'metric_type': 'L2'},
                                                                                    'float_vector_3': {'index_type': 'DISKANN',
                                                                                                       'index_param': {},
                                                                                                       'metric_type': 'IP'}},
                                                                  'hybrid_search_counts': 10}}]},
            'run_id': 2024093054356789,
            'datetime': '2024-09-30 03:03:55.261743',
            'client_version': '2.2'},
 'result': {'test_result': {'index': {'RT': 2672.2509,
                                      'float_vector_1': {'RT': 1276.034},
                                      'float_vector_2': {'RT': 471.9704},
                                      'sparse_float_vector': {'RT': 51.7185},
                                      'int64_1': {'RT': 1.0275},
                                      'float_1': {'RT': 1.0256},
                                      'bool_1': {'RT': 0.5241}},
                            'insert': {'total_time': 2787.3243, 'VPS': 3587.6701, 'batch_time': 1.3937, 'batch': 5000},
                            'flush': {'RT': 2.5262},
                            'load': {'RT': 18.0837},
                            'Locust': {'Aggregated': {'Requests': 6272576,
                                                      'Fails': 1,
                                                      'RPS': 10.37,
                                                      'fail_s': 0.0,
                                                      'RT_max': 26643289.92,
                                                      'RT_avg': 1922.61,
                                                      'TP50': 23,
                                                      'TP99': 66000.0},
                                       'delete': {'Requests': 96442,
                                                  'Fails': 0,
                                                  'RPS': 0.16,
                                                  'fail_s': 0.0,
                                                  'RT_max': 3818.22,
                                                  'RT_avg': 13.59,
                                                  'TP50': 5,
                                                  'TP99': 190.0},
                                       'hybrid_search': {'Requests': 1931188,
                                                         'Fails': 1,
                                                         'RPS': 3.19,
                                                         'fail_s': 0.0,
                                                         'RT_max': 153808.74,
                                                         'RT_avg': 288.59,
                                                         'TP50': 200.0,
                                                         'TP99': 1200.0},
                                       'insert': {'Requests': 96523,
                                                  'Fails': 0,
                                                  'RPS': 0.16,
                                                  'fail_s': 0.0,
                                                  'RT_max': 6134.92,
                                                  'RT_avg': 61.83,
                                                  'TP50': 19,
                                                  'TP99': 1100.0},
                                       'load': {'Requests': 96705,
                                                'Fails': 0,
                                                'RPS': 0.16,
                                                'fail_s': 0.0,
                                                'RT_max': 6562.1,
                                                'RT_avg': 67.61,
                                                'TP50': 8,
                                                'TP99': 1600.0},
                                       'query': {'Requests': 1927685,
                                                 'Fails': 0,
                                                 'RPS': 3.19,
                                                 'fail_s': 0.0,
                                                 'RT_max': 156398.56,
                                                 'RT_avg': 31.64,
                                                 'TP50': 7,
                                                 'TP99': 410.0},
                                       'scene_hybrid_search_test': {'Requests': 96457,
                                                                    'Fails': 0,
                                                                    'RPS': 0.16,
                                                                    'fail_s': 0.0,
                                                                    'RT_max': 381928.18,
                                                                    'RT_avg': 50860.82,
                                                                    'TP50': 34000.0,
                                                                    'TP99': 211000.0},
                                       'scene_test': {'Requests': 96614,
                                                      'Fails': 0,
                                                      'RPS': 0.16,
                                                      'fail_s': 0.0,
                                                      'RT_max': 26643289.92,
                                                      'RT_avg': 66698.9,
                                                      'TP50': 66000.0,
                                                      'TP99': 82000.0},
                                       'search': {'Requests': 1930962,
                                                  'Fails': 0,
                                                  'RPS': 3.19,
                                                  'fail_s': 0.0,
                                                  'RT_max': 144274.73,
                                                  'RT_avg': 40.21,
                                                  'TP50': 19,
                                                  'TP99': 410.0}}}}}
yanliang567 commented 1 month ago

/assign @aoiasd /unassign

aoiasd commented 4 weeks ago

May relate https://github.com/milvus-io/milvus/issues/25767 How about the pulsar conifg

aoiasd commented 4 weeks ago

dede1921-f24f-4f46-b3d2-bdee6ec1674f Milvus-Helm don't change the setting of backlogQuoteDefaultLimitGB Seems some dead subdescribe cause pulsar backlog larger than limit, and cause create producer panic.

xiaofan-luan commented 4 weeks ago

the cpu keeps increasing and memory usage is also increaasing.

My guess;

  1. compaction can not catch up
  2. more data accumulate in the system, cause cpu goes higher
  3. querynode stream consumtpion can not catch up
  4. pulsar data accumulate
  5. cause pulsar backlog issue
xiaofan-luan commented 4 weeks ago

problem is:

  1. why compaction can not catch up?
  2. why quota limitation does not take effect