milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.91k stars 2.95k forks source link

[Bug]: [benchmark][cluster] search RT almost doubled after enabling streamingNode #36804

Open wangting0128 opened 1 month ago

wangting0128 commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version:master-20241011-3fe0f829-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.5rc7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: fouramf-streaming-node-corn-1728615600 test case name: test_bitmap_locust_dql_dml_partition_key_cluster

server:

NAME                                                              READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-streami15600-1-63-8844-etcd-0                             1/1     Running     0                6h47m   10.104.17.43    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-etcd-1                             1/1     Running     0                6h47m   10.104.19.190   4am-node28   <none>           <none>
fouramf-streami15600-1-63-8844-etcd-2                             1/1     Running     0                6h47m   10.104.20.152   4am-node22   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-8ddv7   1/1     Running     1 (6h43m ago)    6h47m   10.104.14.20    4am-node18   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-8pfrl   1/1     Running     1 (6h43m ago)    6h47m   10.104.23.25    4am-node27   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-hftd6   1/1     Running     1 (6h43m ago)    6h47m   10.104.19.188   4am-node28   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-jppv7   1/1     Running     1 (6h43m ago)    6h47m   10.104.1.84     4am-node10   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-mzh45   1/1     Running     1 (6h43m ago)    6h47m   10.104.13.167   4am-node16   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-mzzd6   1/1     Running     0                6h47m   10.104.6.193    4am-node13   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-ptgqq   1/1     Running     1 (6h43m ago)    6h47m   10.104.18.162   4am-node25   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-qnq6b   1/1     Running     1 (6h43m ago)    6h47m   10.104.25.250   4am-node30   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-r5vwm   1/1     Running     0                6h47m   10.104.4.30     4am-node11   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-tdvgt   1/1     Running     1 (6h43m ago)    6h47m   10.104.17.35    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-indexnode-d78d47c54-7k9qc   1/1     Running     0                6h47m   10.104.6.194    4am-node13   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-indexnode-d78d47c54-8rgnw   1/1     Running     0                6h47m   10.104.20.147   4am-node22   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-indexnode-d78d47c54-nft65   1/1     Running     0                6h47m   10.104.30.147   4am-node38   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-indexnode-d78d47c54-x45z8   1/1     Running     0                6h47m   10.104.1.85     4am-node10   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-mixcoord-8668dd7f99-24fwm   1/1     Running     1 (6h43m ago)    6h47m   10.104.30.146   4am-node38   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-proxy-654bbf597f-q6kgl      1/1     Running     1 (6h43m ago)    6h47m   10.104.9.223    4am-node14   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf445887gwx9   1/1     Running     0                6h47m   10.104.34.160   4am-node37   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf44588db2qn   1/1     Running     0                6h47m   10.104.9.224    4am-node14   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf44588fcsls   1/1     Running     0                6h47m   10.104.21.172   4am-node24   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf44588lqpgn   1/1     Running     0                6h47m   10.104.18.163   4am-node25   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf44588q9gbl   1/1     Running     0                6h47m   10.104.4.29     4am-node11   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-streamingnode-7dcd45vjxg7   1/1     Running     1 (6h43m ago)    6h47m   10.104.17.36    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-minio-0                            1/1     Running     0                6h47m   10.104.18.165   4am-node25   <none>           <none>
fouramf-streami15600-1-63-8844-minio-1                            1/1     Running     0                6h47m   10.104.17.44    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-minio-2                            1/1     Running     0                6h47m   10.104.33.198   4am-node36   <none>           <none>
fouramf-streami15600-1-63-8844-minio-3                            1/1     Running     0                6h47m   10.104.20.153   4am-node22   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-bookie-0                    1/1     Running     0                6h47m   10.104.30.148   4am-node38   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-bookie-1                    1/1     Running     0                6h47m   10.104.17.45    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-bookie-2                    1/1     Running     0                6h47m   10.104.20.154   4am-node22   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-bookie-init-ww94p           0/1     Completed   0                6h47m   10.104.17.33    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-broker-0                    1/1     Running     0                6h47m   10.104.13.168   4am-node16   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-proxy-0                     1/1     Running     0                6h47m   10.104.18.161   4am-node25   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-pulsar-init-tbfcf           0/1     Completed   0                6h47m   10.104.9.225    4am-node14   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-recovery-0                  1/1     Running     0                6h47m   10.104.19.187   4am-node28   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-zookeeper-0                 1/1     Running     0                6h47m   10.104.17.42    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-zookeeper-1                 1/1     Running     0                6h47m   10.104.33.207   4am-node36   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-zookeeper-2                 1/1     Running     0                6h46m   10.104.19.196   4am-node28   <none>           <none>

enabled streamingNode👇 image

截屏2024-10-12 10 55 33

disabled streamingNode👇 release name: fouramf-bitmap-scenes-q27w2-7 image

截屏2024-10-12 10 56 55

client log:

截屏2024-10-12 10 52 30

Expected Behavior

No response

Steps To Reproduce

concurrent test and calculation of RT and QPS

        :purpose:  `partition_key on scalar int64_1 field`, shards_num=16
            verify DQL & DML scenario,
            which has 1 vector fields(IVF_SQ8) and building `BITMAP` index on all supported 12 scalar fields

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim

                'int64_1': partition_key, num_partitions=1024
                all scalar fields: varchar max_length=100, array max_capacity=13
            2. build indexes:
                IVF_SQ8: 'float_vector'

                BITMAP: all scalar fields
            3. insert 50 million data
            4. flush collection
            5. build indexes again using the same params
            6. load collection
                replica: 1
            7. concurrent request:
                - search
                - query
                - hybrid_search
                - load
                - insert
                - delete: delete data 90%
                - flush: ignore RateLimiter

Milvus Log

No response

Anything else?

test result:

[2024-10-11 09:48:33,214 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     delete                                                                            36     0(0.00%) | 108027       7 1324820     21 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     flush                                                                             31     0(0.00%) | 722038    1130 2007647 344000 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     hybrid_search                                                                     16    4(25.00%) |1409925       0 23626801832000 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     insert                                                                            28     0(0.00%) |  83340      21 1322431     96 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     load                                                                              19     0(0.00%) |  68352      22 1256742    360 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]: grpc     query                                                                             17     0(0.00%) |1307332  277148 23583891260000 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]: grpc     search                                                                            28     0(0.00%) |1532973  766062 23839951592000 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]:          Aggregated                                                                       175     4(2.29%) | 672063       0 2383995  12000 |    0.02        0.00 (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]:  (stats.py:790)
[2024-10-11 09:48:33,218 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'cluster',
            'config_name': 'cluster_2c2m',
            'config': {'queryNode': {'resources': {'limits': {'cpu': '16.0', 'memory': '32Gi'}, 'requests': {'cpu': '9.0', 'memory': '17Gi'}}, 'replicas': 5},
                       'indexNode': {'resources': {'limits': {'cpu': '4.0', 'memory': '8Gi'}, 'requests': {'cpu': '3.0', 'memory': '5Gi'}}, 'replicas': 4},
                       'dataNode': {'resources': {'limits': {'cpu': '4.0', 'memory': '16Gi'}, 'requests': {'cpu': '3.0', 'memory': '9Gi'}}, 'replicas': 10},
                       'cluster': {'enabled': True},
                       'pulsar': {},
                       'kafka': {},
                       'minio': {'metrics': {'podMonitor': {'enabled': True}}},
                       'etcd': {'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'streaming': {'enabled': True},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20241011-3fe0f829-amd64'}}},
            'host': 'fouramf-streami15600-1-63-8844-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_bitmap_locust_dql_dml_partition_key_cluster',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'max_length': 512,
                                                    'scalars_index': {'int8_1': {'index_type': 'BITMAP'},
                                                                      'int16_1': {'index_type': 'BITMAP'},
                                                                      'int32_1': {'index_type': 'BITMAP'},
                                                                      'int64_1': {'index_type': 'BITMAP'},
                                                                      'varchar_1': {'index_type': 'BITMAP'},
                                                                      'bool_1': {'index_type': 'BITMAP'},
                                                                      'array_int8_1': {'index_type': 'BITMAP'},
                                                                      'array_int16_1': {'index_type': 'BITMAP'},
                                                                      'array_int32_1': {'index_type': 'BITMAP'},
                                                                      'array_int64_1': {'index_type': 'BITMAP'},
                                                                      'array_varchar_1': {'index_type': 'BITMAP'},
                                                                      'array_bool_1': {'index_type': 'BITMAP'}},
                                                    'scalars_params': {'int64_1': {'params': {'is_partition_key': True}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 50000000,
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['int8_1', 'int16_1', 'int32_1', 'int64_1', 'varchar_1', 'bool_1', 'array_int8_1',
                                                                        'array_int16_1', 'array_int32_1', 'array_int64_1', 'array_varchar_1', 'array_bool_1'],
                                                       'shards_num': 16,
                                                       'num_partitions': 1024},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'IVF_SQ8', 'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 15, 'during_time': '3h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 1000,
                                                                  'top_k': 10,
                                                                  'search_param': {'nprobe': 16},
                                                                  'expr': 'int8_1 == 100',
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'output_fields': ['id', 'float_vector', 'int64_1'],
                                                                  'ignore_growing': False,
                                                                  'group_by_field': None,
                                                                  'timeout': 3000,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'nq': 1000}}},
                                                      {'type': 'query',
                                                       'weight': 1,
                                                       'params': {'ids': None,
                                                                  'expr': 'int64_1 > -1',
                                                                  'output_fields': ['*'],
                                                                  'offset': None,
                                                                  'limit': 10,
                                                                  'ignore_growing': False,
                                                                  'partition_names': None,
                                                                  'timeout': 3000,
                                                                  'consistency_level': None,
                                                                  'random_data': False,
                                                                  'random_count': 0,
                                                                  'random_range': [0, 1],
                                                                  'field_name': 'id',
                                                                  'field_type': 'int64',
                                                                  'check_task': 'check_query_output',
                                                                  'check_items': {'expect_length': 10}}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'reqs': [{'search_param': {'nprobe': 32},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': '(array_contains_any(array_int32_1, [0]) || array_contains(array_int64_1, '
                                                                                    '1)) || ((varchar_1 like "1%") and (bool_1 == True))',
                                                                            'top_k': 30},
                                                                           {'search_param': {'nprobe': 64},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': 'not (int16_1 == int8_1) && ARRAY_CONTAINS_ANY(array_int64_1, [-1, 0, '
                                                                                    '1])'}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'timeout': 3000,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'output_fields': ['int8_1', 'int16_1', 'int32_1', 'int64_1', 'varchar_1',
                                                                                                    'bool_1', 'array_int8_1', 'array_int16_1', 'array_int32_1',
                                                                                                    'array_int64_1', 'array_varchar_1', 'array_bool_1', 'id',
                                                                                                    'float_vector'],
                                                                                  'nq': 10}}},
                                                      {'type': 'load',
                                                       'weight': 1,
                                                       'params': {'replica_number': 1, 'timeout': 180, 'check_task': 'check_response', 'check_items': None}},
                                                      {'type': 'insert',
                                                       'weight': 1,
                                                       'params': {'nb': 10,
                                                                  'timeout': 30,
                                                                  'random_id': True,
                                                                  'random_vector': True,
                                                                  'varchar_filled': False,
                                                                  'start_id': 50000000,
                                                                  'shuffle_id': False,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'delete',
                                                       'weight': 1,
                                                       'params': {'expr': '',
                                                                  'delete_length': 9,
                                                                  'timeout': 30,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'flush',
                                                       'weight': 1,
                                                       'params': {'timeout': 600,
                                                                  'check_task': 'check_ignore_expected_errors',
                                                                  'check_items': [{'message': 'request is rejected by grpc RateLimiter middleware, please '
                                                                                              'retry later'},
                                                                                  {'message': 'wait for flush timeout'}]}}]},
            'run_id': 2024101156796516,
            'datetime': '2024-10-11 03:01:19.694088',
            'client_version': '2.4.0'},
 'result': {'test_result': {'index': {'RT': 4672.48,
                                      'int8_1': {'RT': 0.9997},
                                      'int16_1': {'RT': 0.5649},
                                      'int32_1': {'RT': 0.6071},
                                      'int64_1': {'RT': 0.548},
                                      'varchar_1': {'RT': 0.5494},
                                      'bool_1': {'RT': 0.7083},
                                      'array_int8_1': {'RT': 0.7709},
                                      'array_int16_1': {'RT': 0.5407},
                                      'array_int32_1': {'RT': 0.5794},
                                      'array_int64_1': {'RT': 0.5391},
                                      'array_varchar_1': {'RT': 0.541},
                                      'array_bool_1': {'RT': 0.547}},
                            'insert': {'total_time': 7252.4755, 'VPS': 6894.1977, 'batch_time': 0.7252, 'batch': 5000},
                            'flush': {'RT': 141.8577},
                            'load': {'RT': 21.2851},
                            'Locust': {'Aggregated': {'Requests': 175,
                                                      'Fails': 4,
                                                      'RPS': 0.02,
                                                      'fail_s': 0.02,
                                                      'RT_max': 2383995.06,
                                                      'RT_avg': 672063.63,
                                                      'TP50': 12000.0,
                                                      'TP99': 2382000.0},
                                       'delete': {'Requests': 36,
                                                  'Fails': 0,
                                                  'RPS': 0.0,
                                                  'fail_s': 0.0,
                                                  'RT_max': 1324820.55,
                                                  'RT_avg': 108027.29,
                                                  'TP50': 24,
                                                  'TP99': 1325000.0},
                                       'flush': {'Requests': 31,
                                                 'Fails': 0,
                                                 'RPS': 0.0,
                                                 'fail_s': 0.0,
                                                 'RT_max': 2007647.31,
                                                 'RT_avg': 722038.97,
                                                 'TP50': 344000.0,
                                                 'TP99': 2008000.0},
                                       'hybrid_search': {'Requests': 16,
                                                         'Fails': 4,
                                                         'RPS': 0.0,
                                                         'fail_s': 0.25,
                                                         'RT_max': 2362680.11,
                                                         'RT_avg': 1409925.17,
                                                         'TP50': 1982000.0,
                                                         'TP99': 2363000.0},
                                       'insert': {'Requests': 28,
                                                  'Fails': 0,
                                                  'RPS': 0.0,
                                                  'fail_s': 0.0,
                                                  'RT_max': 1322431.89,
                                                  'RT_avg': 83340.22,
                                                  'TP50': 99,
                                                  'TP99': 1322000.0},
                                       'load': {'Requests': 19,
                                                'Fails': 0,
                                                'RPS': 0.0,
                                                'fail_s': 0.0,
                                                'RT_max': 1256742.25,
                                                'RT_avg': 68352.56,
                                                'TP50': 360.0,
                                                'TP99': 1257000.0},
                                       'query': {'Requests': 17,
                                                 'Fails': 0,
                                                 'RPS': 0.0,
                                                 'fail_s': 0.0,
                                                 'RT_max': 2358389.62,
                                                 'RT_avg': 1307332.69,
                                                 'TP50': 1260000.0,
                                                 'TP99': 2358000.0},
                                       'search': {'Requests': 28,
                                                  'Fails': 0,
                                                  'RPS': 0.0,
                                                  'fail_s': 0.0,
                                                  'RT_max': 2383995.06,
                                                  'RT_avg': 1532973.61,
                                                  'TP50': 1682000.0,
                                                  'TP99': 2384000.0}}}}}
chyezh commented 1 month ago

It seems that the difference of flush policy make the final segment size different. And the work load is too high, the root cause may be the scheduling policy of querynode. Most cost is the queue time but not the execution time.

chyezh commented 1 month ago

may be related to #36761

yanliang567 commented 1 month ago

/unassign

chyezh commented 1 month ago

36761 is merged, but I do not make sure that it fix these issue.

@wangting0128 please help to rerun these test with commit f0f5147aefe581b87e30b7b144dc801d7926322e. thx!

wangting0128 commented 1 month ago

36761 is merged, but I do not make sure that it fix these issue. @wangting0128 please help to rerun these test with commit f0f5147aefe581b87e30b7b144dc801d7926322e. thx!

Verification failed

argo task: fouramf-kmshk image: master-20241014-d566b0ce-amd64

server:

NAME                                                              READY   STATUS             RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
verify-36804-rt-etcd-0                                            1/1     Running            0               9h      10.104.18.119   4am-node25   <none>           <none>
verify-36804-rt-etcd-1                                            1/1     Running            0               9h      10.104.34.235   4am-node37   <none>           <none>
verify-36804-rt-etcd-2                                            1/1     Running            0               9h      10.104.19.59    4am-node28   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-2xctx                  1/1     Running            2 (9h ago)      9h      10.104.32.154   4am-node39   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-6mxxn                  1/1     Running            2 (9h ago)      9h      10.104.4.149    4am-node11   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-7x59m                  1/1     Running            1 (9h ago)      9h      10.104.14.14    4am-node18   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-7xz8j                  1/1     Running            2 (9h ago)      9h      10.104.17.25    4am-node23   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-c5gt4                  1/1     Running            2 (9h ago)      9h      10.104.15.61    4am-node20   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-mz7p6                  1/1     Running            2 (9h ago)      9h      10.104.18.115   4am-node25   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-r8nzw                  1/1     Running            2 (9h ago)      9h      10.104.25.133   4am-node30   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-vdsww                  1/1     Running            2 (9h ago)      9h      10.104.9.182    4am-node14   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-xwdv4                  1/1     Running            1 (9h ago)      9h      10.104.13.126   4am-node16   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-z6k7m                  1/1     Running            2 (9h ago)      9h      10.104.19.52    4am-node28   <none>           <none>
verify-36804-rt-milvus-indexnode-c646989df-msbkh                  1/1     Running            2 (9h ago)      9h      10.104.4.150    4am-node11   <none>           <none>
verify-36804-rt-milvus-indexnode-c646989df-qm5xp                  1/1     Running            2 (9h ago)      9h      10.104.34.230   4am-node37   <none>           <none>
verify-36804-rt-milvus-indexnode-c646989df-rwrl6                  1/1     Running            0               9h      10.104.14.16    4am-node18   <none>           <none>
verify-36804-rt-milvus-indexnode-c646989df-sbp24                  1/1     Running            2 (9h ago)      9h      10.104.5.14     4am-node12   <none>           <none>
verify-36804-rt-milvus-mixcoord-c7cc55b48-xfq49                   1/1     Running            1 (9h ago)      9h      10.104.14.17    4am-node18   <none>           <none>
verify-36804-rt-milvus-proxy-5cb97c6d46-rdb62                     1/1     Running            3 (9h ago)      9h      10.104.4.151    4am-node11   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-86wj5                 1/1     Running            2 (9h ago)      9h      10.104.9.183    4am-node14   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-hqdm5                 1/1     Running            2 (9h ago)      9h      10.104.4.152    4am-node11   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-krb4q                 1/1     Running            0               9h      10.104.14.18    4am-node18   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-s5277                 1/1     Running            2 (9h ago)      9h      10.104.15.62    4am-node20   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-wz7nb                 1/1     Running            1 (9h ago)      9h      10.104.20.147   4am-node22   <none>           <none>
verify-36804-rt-milvus-streamingnode-59dbdd5fc8-brg85             1/1     Running            3 (9h ago)      9h      10.104.4.147    4am-node11   <none>           <none>
verify-36804-rt-minio-0                                           1/1     Running            0               9h      10.104.20.149   4am-node22   <none>           <none>
verify-36804-rt-minio-1                                           1/1     Running            0               9h      10.104.34.236   4am-node37   <none>           <none>
verify-36804-rt-minio-2                                           1/1     Running            0               9h      10.104.17.27    4am-node23   <none>           <none>
verify-36804-rt-minio-3                                           1/1     Running            0               9h      10.104.19.60    4am-node28   <none>           <none>
verify-36804-rt-pulsar-bookie-0                                   1/1     Running            0               9h      10.104.30.6     4am-node38   <none>           <none>
verify-36804-rt-pulsar-bookie-1                                   1/1     Running            0               9h      10.104.34.239   4am-node37   <none>           <none>
verify-36804-rt-pulsar-bookie-2                                   1/1     Running            0               9h      10.104.17.33    4am-node23   <none>           <none>
verify-36804-rt-pulsar-bookie-init-jtskk                          0/1     Completed          0               9h      10.104.4.148    4am-node11   <none>           <none>
verify-36804-rt-pulsar-broker-0                                   1/1     Running            0               9h      10.104.14.15    4am-node18   <none>           <none>
verify-36804-rt-pulsar-proxy-0                                    1/1     Running            0               9h      10.104.5.15     4am-node12   <none>           <none>
verify-36804-rt-pulsar-pulsar-init-hnnh5                          0/1     Completed          0               9h      10.104.14.12    4am-node18   <none>           <none>
verify-36804-rt-pulsar-recovery-0                                 1/1     Running            0               9h      10.104.14.13    4am-node18   <none>           <none>
verify-36804-rt-pulsar-zookeeper-0                                1/1     Running            0               9h      10.104.19.55    4am-node28   <none>           <none>
verify-36804-rt-pulsar-zookeeper-1                                1/1     Running            0               9h      10.104.24.66    4am-node29   <none>           <none>
verify-36804-rt-pulsar-zookeeper-2                                1/1     Running            0               9h      10.104.21.206   4am-node24   <none>           <none>

image

client log: hybrid_search request timeout

截屏2024-10-15 10 54 03

@chyezh

chyezh commented 1 month ago

After Flushing Policy Fixed,

First

The Milvus With Streaming Service will finally generate 1.11k sealed segments while the milvus without streaming service will finally generate 2k sealed segments. So the segments in milvus with streaming service have double size comparing with milvus without streaming service. It's the major difference between two test case.

Streaming:

image

No Streaming:

image

Second

Milvus With Streaming Service's message consumer works correctly, so it's not introduced by streaming service.

image
[2024/10/14 19:14:50.024 +00:00] [DEBUG] [pipeline/insert_node.go:80] ["pipeline fetch insert msg"] [collectionID=453223040984548132] [segmentID=453223041097817164] [insertRowNum=1] [timestampMin=453229488314253321] [timestampMax=453229488314253321]

The timestamp `453229488314253321` is `2024-10-14 19:14:49.773`

Third

Found that the some request still wait for tsafe for long time whether using streaming or not. and ProcessInsert Delay increase periodically:

No Streaming:

image image

Streaming:

image image

Fourth

Found that inserting a new message cost 19min when creating a new segment.

[2024/10/14 19:14:50.024 +00:00] [DEBUG] [pipeline/insert_node.go:80] ["pipeline fetch insert msg"] [collectionID=453223040984548132] [segmentID=453223041097817164] [insertRowNum=1] [timestampMin=453229488314253321] [timestampMax=453229488314253321]
...
[2024/10/14 19:35:09.020 +00:00] [INFO] [delegator/delegator_data.go:341] ["add growing segments to delegator"] [collectionID=453223040984548132] [channel=by-dev-rootcoord-dml_3_453223040984548132v3] [replicaID=453223041145765889] [segmentIDs="[453223041097817164]"]
chyezh commented 1 month ago

Found that the insert operation is blocked by the acquirisition of mutex growingSegmentLock. And these mutex is also acquired by ReleaseSegments.

Release operation of segment 453223041097816587 use 1h3m. And the release operation will be blocked because of distribution expiration.

[2024/10/14 18:32:37.675 +00:00] [INFO] [querynodev2/services.go:545] ["received release segment request"] [traceID=be09552d1e1e59448c7da62ffb1a9f5f] [collectionID=453223040984548132] [shard=by-dev-rootcoord-dml_3_453223040984548132v3] [segmentIDs="[453223041097816587]"] [currentNodeID=6] [scope=Streaming] [needTransfer=true]
...
[2024/10/14 19:35:09.020 +00:00] [INFO] [segments/segment.go:1467] ["delete segment from memory"] [traceID=be09552d1e1e59448c7da62ffb1a9f5f] [collectionID=453223040984548132] [partitionID=453223040984548984] [segmentID=453223041097816587] [segmentType=Growing] [insertCount=1]
chyezh commented 1 month ago

tsafe timeout should be fixed by pr #36997 Another difference found:

After flushing policy fixed: Milvus Streaming Service will continously generate flush segment, and compaction will execute more frequently and fluently.

image

The flush operation is triggered by policy: binlog file number.

image

So the streaming service performs more compaction and handoff operation than the milvus without streaming service. Reach the less segment counts at final about 1000 L1 sealed segments.

Meanwhile, milvus without streaming service don't generate flushed segment fluently.

image

It performs less compaction and handoff operation, reach the segment counts about 1750 L1 sealed segments at last.

So the milvus without streaming service encounter less race condition than the milvus with streaming service when handing off, and performs more better RT.

chyezh commented 1 month ago

@wangting0128 please retry the test at commit ac178eeea569cb5c1f86e57ebe448ac4e15f4cb4. thx.

chyezh commented 1 month ago

At latest commit, tsafe problem is fixed. But the search latency is still high.

image
chyezh commented 1 month ago

Found that f43527e increase the RT. 03a78ec keep the RT.

chyezh commented 1 month ago

Found that scalar search latency increase: df7070e2-95be-40b5-8a0f-f43e004753f2

b1e520f2-2963-4ca5-b09a-774ebb2e72e4

xiaofan-luan commented 4 weeks ago

Found that scalar search latency increase: df7070e2-95be-40b5-8a0f-f43e004753f2

b1e520f2-2963-4ca5-b09a-774ebb2e72e4

this is comapred master with what? could this be impacted by null?

wangting0128 commented 3 weeks ago

Found that scalar search latency increase: df7070e2-95be-40b5-8a0f-f43e004753f2 b1e520f2-2963-4ca5-b09a-774ebb2e72e4

this is comapred master with what? could this be impacted by null?

This is a comparison of the deployment of instances with and without streamingNode on the same case.

chyezh commented 3 weeks ago

This is a comparison of the deployment of instances with and without streamingNode on the same case.

Nope, two tests both ran on a milvus with different commit without streaming enabled.

xiaofan-luan commented 3 weeks ago

tsafe timeout should be fixed by pr #36997 Another difference found:

After flushing policy fixed: Milvus Streaming Service will continously generate flush segment, and compaction will execute more frequently and fluently. image The flush operation is triggered by policy: binlog file number. image So the streaming service performs more compaction and handoff operation than the milvus without streaming service. Reach the less segment counts at final about 1000 L1 sealed segments.

Meanwhile, milvus without streaming service don't generate flushed segment fluently. image It performs less compaction and handoff operation, reach the segment counts about 1750 L1 sealed segments at last.

So the milvus without streaming service encounter less race condition than the milvus with streaming service when handing off, and performs more better RT.

is there a special reason why so many bin logs is actaully generated?

chyezh commented 2 weeks ago

is there a special reason why so many bin logs is actaully generated?

There's a binlog-num-based flush policy in milvus. At previous implementation:

  1. Milvus without streaming: use stats-log-num to determine the "binlog-num".
  2. Milvus with streaming: use real bin-log-num to determine the "binlog-num", so there's a multiply (field count).

It has been fixed by #37037, milvus with streaming has kept consistency with milvus without streaming.