milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.36k stars 2.91k forks source link

[Bug]: [benchmark][standalone] TimeTick Lag is very high, causing DQL request timeout #36195

Open wangting0128 opened 1 month ago

wangting0128 commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version:master-20240910-f4d0c589-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):rocksmq    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.5rc7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task:fouramf-bitmap-scenes-fdgrx test case name:test_bitmap_locust_dql_dml_standalone

server:

NAME                                                              READY   STATUS             RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-bitmap-scenes-fdgrx-5-etcd-0                              1/1     Running            0                 3h12m   10.104.18.154   4am-node25   <none>           <none>
fouramf-bitmap-scenes-fdgrx-5-milvus-standalone-78f779649fl5ffr   1/1     Running            3 (3h10m ago)     3h12m   10.104.16.101   4am-node21   <none>           <none>
fouramf-bitmap-scenes-fdgrx-5-minio-6dcc448b8c-vnljg              1/1     Running            0                 3h12m   10.104.18.153   4am-node25   <none>           <none> 
截屏2024-09-11 19 35 16 截屏2024-09-11 19 35 35 截屏2024-09-11 19 36 22 截屏2024-09-11 19 39 02

client test result:

[2024-09-10 07:09:41,546 - ERROR - fouram]: grpc RpcError: [search], <_InactiveRpcError: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2024-09-10 07:08:41.544852', 'gRPC error': '2024-09-10 07:09:41.546277'}> (decorators.py:157)
[2024-09-10 07:09:41,547 - ERROR - fouram]: (api_response) : [Collection.search] <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.DEADLINE_EXCEEDED
    details = "Deadline Exceeded"
    debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2024-09-10T07:09:41.545818402+00:00"}"
>, [requestId: 828b5b70-6f43-11ef-99b2-72ddfb74a677] (api_request.py:57)
[2024-09-10 07:09:41,547 - ERROR - fouram]: [CheckFunc] search request check failed, response:<_InactiveRpcError of RPC that terminated with:
    status = StatusCode.DEADLINE_EXCEEDED
    details = "Deadline Exceeded"
    debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2024-09-10T07:09:41.545818402+00:00"}"
> (func_check.py:106)
[2024-09-10 07:09:41,548 - ERROR - fouram]: [ClientTask] 
Traceback (most recent call last):
  File "/src/fouram/client/concurrent/locust_client.py", line 28, in wrapper
    result = func(*args, **kwargs)
  File "/src/fouram/client/cases/base.py", line 874, in concurrent_search
    return self.collection_wrap.search(data=_data, **params.obj_params)
  File "/src/fouram/client/client_base/collection_wrapper.py", line 144, in search
    check_result = ResponseChecker(res, func_name, check_task, check_items, res_result, data=data,
  File "/src/fouram/client/check/func_check.py", line 85, in run
    result = self.check_search_output(self.response, self.succ, self.check_items)
  File "/src/fouram/client/check/func_check.py", line 274, in check_search_output
    self.assert_success(actual_res_check, True)
  File "/src/fouram/client/check/func_check.py", line 107, in assert_success
    assert actual is expect
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/user/task.py", line 347, in run
    self.execute_next_task()
  File "/usr/local/lib/python3.8/dist-packages/locust/user/task.py", line 372, in execute_next_task
    self.execute_task(self._task_queue.pop(0))
  File "/usr/local/lib/python3.8/dist-packages/locust/user/task.py", line 493, in execute_task
    task(self.user)
  File "/src/fouram/client/concurrent/locust_client.py", line 46, in search
    self.client.search(self.tasks_params.search.params)
  File "/src/fouram/client/concurrent/locust_client.py", line 36, in wrapper
    raise Exception(f"[ClientTask] {e}")
Exception: [ClientTask] 
 (task.py:366)
[2024-09-10 07:09:44,983 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-09-10 07:09:44,984 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     delete                                                                           541     0(0.00%) |   8094       1  107954      7 |    0.05        0.00 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     flush                                                                            573     0(0.00%) | 116451     506  668524  57000 |    0.05        0.00 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     hybrid_search                                                                    547  529(96.71%) |    236       0   92795      0 |    0.05        0.05 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     insert                                                                           516     0(0.00%) |   8734       4   71060     26 |    0.05        0.00 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     load                                                                             562     0(0.00%) |  15715       3  119987     40 |    0.05        0.00 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     query                                                                            542  496(91.51%) |    389       0   78990      0 |    0.05        0.05 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: grpc     search                                                                           549  495(90.16%) |    481       0   51179      0 |    0.05        0.05 (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-09-10 07:09:44,984 -  INFO - fouram]:          Aggregated                                                                      3830 1520(39.69%) |  22206       0  668524      6 |    0.35        0.14 (stats.py:789)
[2024-09-10 07:09:44,985 -  INFO - fouram]:  (stats.py:790)
[2024-09-10 07:09:44,989 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'standalone',
            'config_name': 'standalone_16c64m',
            'config': {'standalone': {'resources': {'limits': {'cpu': '16.0', 'memory': '64Gi'}, 'requests': {'cpu': '9.0', 'memory': '33Gi'}}},
                       'cluster': {'enabled': False},
                       'etcd': {'replicaCount': 1, 'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'minio': {'mode': 'standalone', 'metrics': {'podMonitor': {'enabled': True}}},
                       'pulsar': {'enabled': False},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20240910-f4d0c589-amd64'}}},
            'host': 'fouramf-bitmap-scenes-fdgrx-5-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_bitmap_locust_dql_dml_standalone',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'max_length': 100,
                                                    'scalars_index': {'int8_1': {'index_type': 'BITMAP'},
                                                                      'int16_1': {'index_type': 'BITMAP'},
                                                                      'int32_1': {'index_type': 'BITMAP'},
                                                                      'int64_1': {'index_type': 'BITMAP'},
                                                                      'varchar_1': {'index_type': 'BITMAP'},
                                                                      'bool_1': {'index_type': 'BITMAP'},
                                                                      'array_int8_1': {'index_type': 'BITMAP'},
                                                                      'array_int16_1': {'index_type': 'BITMAP'},
                                                                      'array_int32_1': {'index_type': 'BITMAP'},
                                                                      'array_int64_1': {'index_type': 'BITMAP'},
                                                                      'array_varchar_1': {'index_type': 'BITMAP'},
                                                                      'array_bool_1': {'index_type': 'BITMAP'}},
                                                    'vectors_index': {'sparse_float_vector': {'index_type': 'SPARSE_INVERTED_INDEX',
                                                                                              'index_param': {'drop_ratio_build': 0.2},
                                                                                              'metric_type': 'IP'}},
                                                    'scalars_params': {'array_int8_1': {'params': {'max_capacity': 13},
                                                                                        'other_params': {'dataset': 'random_algorithm',
                                                                                                         'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                              'specify_range': [-128, 128],
                                                                                                                              'max_capacity': 13}}},
                                                                       'array_int16_1': {'params': {'max_capacity': 13},
                                                                                         'other_params': {'dataset': 'random_algorithm',
                                                                                                          'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                               'specify_range': [-200, 200],
                                                                                                                               'max_capacity': 13}}},
                                                                       'array_int32_1': {'params': {'max_capacity': 13},
                                                                                         'other_params': {'dataset': 'random_algorithm',
                                                                                                          'algorithm_params': {'algorithm_name': 'specify_scope',
                                                                                                                               'specify_range': [-300, 300],
                                                                                                                               'max_capacity': 13}}},
                                                                       'array_int64_1': {'params': {'max_capacity': 13},
                                                                                         'other_params': {'dataset': 'random_algorithm',
                                                                                                          'algorithm_params': {'algorithm_name': 'fixed_value_range',
                                                                                                                               'specify_range': [-400, 432],
                                                                                                                               'batch': 50,
                                                                                                                               'max_capacity': 13}}},
                                                                       'array_varchar_1': {'params': {'max_capacity': 13},
                                                                                           'other_params': {'dataset': 'random_algorithm',
                                                                                                            'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                                 'specify_range': [-1500, 1500],
                                                                                                                                 'max_capacity': 13}}},
                                                                       'array_bool_1': {'params': {'max_capacity': 13}},
                                                                       'int8_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                   'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                        'specify_range': [-128, 128],
                                                                                                                        'max_capacity': 13}}},
                                                                       'int16_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                         'specify_range': [-200, 200],
                                                                                                                         'max_capacity': 13}}},
                                                                       'int32_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'specify_scope',
                                                                                                                         'specify_range': [-300, 300],
                                                                                                                         'max_capacity': 13}}},
                                                                       'int64_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                    'algorithm_params': {'algorithm_name': 'fixed_value_range',
                                                                                                                         'specify_range': [-400, 432],
                                                                                                                         'batch': 50,
                                                                                                                         'max_capacity': 13}}},
                                                                       'varchar_1': {'other_params': {'dataset': 'random_algorithm',
                                                                                                      'algorithm_params': {'algorithm_name': 'random_range',
                                                                                                                           'specify_range': [-1500, 1500],
                                                                                                                           'max_capacity': 13}}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 2000000,
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['sparse_float_vector', 'int8_1', 'int16_1', 'int32_1', 'int64_1', 'varchar_1', 'bool_1',
                                                                        'array_int8_1', 'array_int16_1', 'array_int32_1', 'array_int64_1', 'array_varchar_1',
                                                                        'array_bool_1'],
                                                       'shards_num': 1,
                                                       'auto_id': True},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'IVF_SQ8', 'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 20, 'during_time': '3h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'search_param': {'nprobe': 16},
                                                                  'expr': 'int8_1 == 100',
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'output_fields': ['id', 'float_vector', 'int64_1'],
                                                                  'ignore_growing': False,
                                                                  'group_by_field': None,
                                                                  'timeout': 60,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'nq': 10}}},
                                                      {'type': 'query',
                                                       'weight': 1,
                                                       'params': {'ids': None,
                                                                  'expr': 'int64_1 > -1',
                                                                  'output_fields': ['*'],
                                                                  'offset': None,
                                                                  'limit': 10,
                                                                  'ignore_growing': False,
                                                                  'partition_names': None,
                                                                  'timeout': 60,
                                                                  'consistency_level': None,
                                                                  'random_data': False,
                                                                  'random_count': 0,
                                                                  'random_range': [0, 1],
                                                                  'field_name': 'id',
                                                                  'field_type': 'int64',
                                                                  'check_task': 'check_query_output',
                                                                  'check_items': {'expect_length': 10}}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 10,
                                                                  'top_k': 1,
                                                                  'reqs': [{'search_param': {'nprobe': 128},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': '(array_contains_any(array_int32_1, [0]) || array_contains(array_int64_1, '
                                                                                    '1)) || ((varchar_1 like "1%") and (bool_1 == True))',
                                                                            'top_k': 100},
                                                                           {'search_param': {'drop_ratio_search': 0.1},
                                                                            'anns_field': 'sparse_float_vector',
                                                                            'expr': 'not (int16_1 == int8_1) && ARRAY_CONTAINS_ANY(array_int64_1, [-1, 0, '
                                                                                    '1])'}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'timeout': 120,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'output_fields': ['sparse_float_vector', 'int8_1', 'int16_1', 'int32_1',
                                                                                                    'int64_1', 'varchar_1', 'bool_1', 'array_int8_1',
                                                                                                    'array_int16_1', 'array_int32_1', 'array_int64_1',
                                                                                                    'array_varchar_1', 'array_bool_1', 'id', 'float_vector'],
                                                                                  'nq': 10}}},
                                                      {'type': 'load',
                                                       'weight': 1,
                                                       'params': {'replica_number': 1, 'timeout': 180, 'check_task': 'check_response', 'check_items': None}},
                                                      {'type': 'insert',
                                                       'weight': 1,
                                                       'params': {'nb': 10,
                                                                  'timeout': 30,
                                                                  'random_id': True,
                                                                  'random_vector': True,
                                                                  'varchar_filled': False,
                                                                  'start_id': 2000000,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'delete',
                                                       'weight': 1,
                                                       'params': {'expr': '',
                                                                  'delete_length': 10,
                                                                  'timeout': 30,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'flush',
                                                       'weight': 1,
                                                       'params': {'timeout': 600,
                                                                  'check_task': 'check_ignore_expected_errors',
                                                                  'check_items': [{'message': 'request is rejected by grpc RateLimiter middleware, please '
                                                                                              'retry later'},
                                                                                  {'message': 'wait for flush timeout'}]}}]},
            'run_id': 2024091006944126,
            'datetime': '2024-09-10 03:58:14.896097',
            'client_version': '2.5.0'},
 'result': {'test_result': {'index': {'RT': 163.2507,
                                      'sparse_float_vector': {'RT': 2.0226},
                                      'int8_1': {'RT': 0.5125},
                                      'int16_1': {'RT': 12.0945},
                                      'int32_1': {'RT': 0.5128},
                                      'int64_1': {'RT': 1.0224},
                                      'varchar_1': {'RT': 0.5131},
                                      'bool_1': {'RT': 0.5117},
                                      'array_int8_1': {'RT': 0.5106},
                                      'array_int16_1': {'RT': 0.5156},
                                      'array_int32_1': {'RT': 0.512},
                                      'array_int64_1': {'RT': 0.512},
                                      'array_varchar_1': {'RT': 0.5108},
                                      'array_bool_1': {'RT': 0.5117}},
                            'insert': {'total_time': 178.2534, 'VPS': 11219.9823, 'batch_time': 0.4456, 'batch': 5000},
                            'flush': {'RT': 3.0197},
                            'load': {'RT': 4.2674},
                            'Locust': {'Aggregated': {'Requests': 3830,
                                                      'Fails': 1520,
                                                      'RPS': 0.35,
                                                      'fail_s': 0.4,
                                                      'RT_max': 668524.17,
                                                      'RT_avg': 22206.06,
                                                      'TP50': 6,
                                                      'TP99': 439000.0},
                                       'delete': {'Requests': 541,
                                                  'Fails': 0,
                                                  'RPS': 0.05,
                                                  'fail_s': 0.0,
                                                  'RT_max': 107954.74,
                                                  'RT_avg': 8094.23,
                                                  'TP50': 7,
                                                  'TP99': 60000.0},
                                       'flush': {'Requests': 573,
                                                 'Fails': 0,
                                                 'RPS': 0.05,
                                                 'fail_s': 0.0,
                                                 'RT_max': 668524.17,
                                                 'RT_avg': 116451.01,
                                                 'TP50': 57000.0,
                                                 'TP99': 611000.0},
                                       'hybrid_search': {'Requests': 547,
                                                         'Fails': 529,
                                                         'RPS': 0.05,
                                                         'fail_s': 0.97,
                                                         'RT_max': 92795.97,
                                                         'RT_avg': 236.31,
                                                         'TP50': 0,
                                                         'TP99': 2400.0},
                                       'insert': {'Requests': 516,
                                                  'Fails': 0,
                                                  'RPS': 0.05,
                                                  'fail_s': 0.0,
                                                  'RT_max': 71060.58,
                                                  'RT_avg': 8734.32,
                                                  'TP50': 26,
                                                  'TP99': 60000.0},
                                       'load': {'Requests': 562,
                                                'Fails': 0,
                                                'RPS': 0.05,
                                                'fail_s': 0.0,
                                                'RT_max': 119987.21,
                                                'RT_avg': 15715.3,
                                                'TP50': 41,
                                                'TP99': 107000.0},
                                       'query': {'Requests': 542,
                                                 'Fails': 496,
                                                 'RPS': 0.05,
                                                 'fail_s': 0.92,
                                                 'RT_max': 78990.47,
                                                 'RT_avg': 389.64,
                                                 'TP50': 0,
                                                 'TP99': 12000.0},
                                       'search': {'Requests': 549,
                                                  'Fails': 495,
                                                  'RPS': 0.05,
                                                  'fail_s': 0.9,
                                                  'RT_max': 51179.59,
                                                  'RT_avg': 481.69,
                                                  'TP50': 0,
                                                  'TP99': 31000.0}}}}}

Expected Behavior

No response

Steps To Reproduce

concurrent test and calculation of RT and QPS

        :purpose:  `primary key: INT64 autoID`
            1. building `BITMAP` index on all supported 12 scalar fields
            2. 2 fields of different vector types
            3. verify DQL & DML requests

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim
                'sparse_float_vector': sparse_range=[1, 100] <- the range of non-zero values of a sparse vector
                'id': primary key type is INT64

                all scalar fields: varchar max_length=100, array max_capacity=13
            2. build indexes:
                IVF_SQ8: 'float_vector'
                SPARSE_WAND: 'sparse_float_vector'
                BITMAP: all scalar fields
            3. insert 2 million data
            4. flush collection
            5. build indexes again using the same params
            6. load collection
            7. concurrent request:
                - search
                - query
                - hybrid_search
                - load
                - insert
                - delete: delete all inserted data
                - flush: ignore RateLimiter

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 1 month ago
image

the cpu is full and request takes hours. I think time out is just fine. for any system beyond it's capacity, you will timeout

xiaofan-luan commented 1 month ago

as long as service didn't crash i thought it's fine.

wangting0128 commented 1 month ago

as long as service didn't crash i thought it's fine.

The normal average DQL time is < 500ms. During the DQL request timeout(60s) period, the CPU is not fully utilized. I think this should be a problem that needs to be checked. 🤔️

d0bf967e-236a-4cd7-9774-18dab1740562

wangting0128 commented 1 month ago

Only 2M of data was inserted, but the Queryable Entity Num showed 44.8M, and the memory increased from 5G to 57+G.

截屏2024-09-14 10 50 56 截屏2024-09-14 10 50 46
xiaofan-luan commented 1 month ago

Only 2M of data was inserted, but the Queryable Entity Num showed 44.8M, and the memory increased from 5G to 57+G.

截屏2024-09-14 10 50 56 截屏2024-09-14 10 50 46

How did you define only 2M is inserted? is seems that you have delete and insert in the test. I think most of the 44.8M data has been deleted but not compacted in time. is this what you are trying to test? the compaction can catch up with deletes

wangting0128 commented 1 month ago

Only 2M of data was inserted, but the Queryable Entity Num showed 44.8M, and the memory increased from 5G to 57+G. 截屏2024-09-14 10 50 56 截屏2024-09-14 10 50 46

How did you define only 2M is inserted? is seems that you have delete and insert in the test. I think most of the 44.8M data has been deleted but not compacted in time. is this what you are trying to test? the compaction can catch up with deletes

  1. 2M data were inserted in the preparation phase, and 516 inserts were performed in the concurrent test phase, with 10 data inserted each time. The insertion ID was incremented from 2000000, so the total number of records inserted was 2m + 5160 = 2005160 data.
  2. The deletion was done 541 times, 10 data were deleted each time, and the deleted id was the id of the inserted data. When the number of deletions was greater than the number of insertions, 0 to 9 were used to fill the gap, so the visible data was ~ 2M
  3. This is a case to verify concurrent DQL and low DML, delete the inserted incremental data and verify the compaction b739234b-0bb2-4b3d-814d-bf00c41195d4
XuanYang-cn commented 3 weeks ago

Here's what I see: search timeout(but cannot cancel)-> pining segments -> memory and segment count raising

  1. Did DataNode process normally? yes, the L0 segment and L1 segment were maintained inside a safe range. image
  2. Did QueryNode exchange targets normally? yes, the target keeps up with DN processing speed. image
  3. Why did QueryNode load so many more segments in Memory? They are pinned in the Memory wait for submitted search/query task finishing. image Offline segment tasks are queueing to wait for search done image
XuanYang-cn commented 3 weeks ago

This is how search works: if thery are submiited into c++, when golang timeout and returned for like 1min, the c++ part will continuous to run.

CPU is down when all search/query in c++ finished. In the mean time ,some of the search/query finished during this short time. And all the other time, search/query just failed of timeout. image image

XuanYang-cn commented 3 weeks ago

The behavior is expected, nothing abnormal, except perhaps we need smaller search tasks that took less than 1hrs.

Also, we'might need to be able to cancel c++ tasks from golang side to aviod such long-time pin. The memory status of querynode looks fragile, which means long search tasks could easily breaks querynode's memory and causing limit writing or even OOM.

/unassign /assign @wangting0128

XuanYang-cn commented 3 weeks ago

@wangting0128 In your tests, from the metrics, it's more likely there're 99% of VERY LONG DQL with 1% of quick DQL.

wangting0128 commented 3 weeks ago

The behavior is expected, nothing abnormal, except perhaps we need smaller search tasks that took less than 1hrs.

Also, we'might need to be able to cancel c++ tasks from golang side to aviod such long-time pin. The memory status of querynode looks fragile, which means long search tasks could easily breaks querynode's memory and causing limit writing or even OOM.

/unassign /assign @wangting0128

image 2M of data, only 10 pieces of data are DQL each time, but it takes 1 hour, is this reasonable?

XuanYang-cn commented 3 weeks ago

doesn't seen reasonable, I'll look into this.

zhagnlu commented 5 days ago

image image Now for expr like two column compare like A < B, if A and B has index, need to reverse look up raw data from index one by one, actually it is slow.

wangting0128 commented 3 days ago

verified scalar fields compare argo task: fouramf-9j5lj-query-expr-3

scalar fields not build index

[2024-11-05 07:48:21,400 -  INFO - fouram]: [Base] expr of query: "int16_1 == int8_1", kwargs:{'limit': 10000} (base.py:548)
[2024-11-05 07:48:21,447 -  INFO - fouram]: [Time] Collection.query run in 0.0464s (api_request.py:49)

scalar build INVERTED index

[2024-11-05 07:58:24,558 -  INFO - fouram]: [Base] expr of query: "int16_inverted == int8_inverted", kwargs:{'limit': 10000} (base.py:548)
[2024-11-05 07:58:24,629 -  INFO - fouram]: [Time] Collection.query run in 0.0703s (api_request.py:49)

scalar build BITMAP index

[2024-11-05 07:48:24,825 -  INFO - fouram]: [Base] expr of query: "int16_bitmap == int8_bitmap", kwargs:{'limit': 10000} (base.py:548)
[2024-11-05 07:48:29,795 -  INFO - fouram]: [Time] Collection.query run in 4.9695s (api_request.py:49)