milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.48k stars 2.92k forks source link

[Bug]: [benchmark][standalone][multipleChunkedEnable] hybrid_search raises error `incomplete query result, missing id %!s(int64=6408144), len(searchIDs) = 100, len(queryIDs) = 93, collection=453463256739087164: inconsistent requery result` in concurrent DQL scene #37143

Closed wangting0128 closed 6 days ago

wangting0128 commented 2 weeks ago

Is there an existing issue for this?

Environment

- Milvus version:master-20241025-ad2df904-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):rocksmq    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc97
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: memory-opt-scenes-mn2cd

Test case execution succeeds when the parameter multipleChunkedEnable is turned off server:

NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
memory-opt-scenes-mn2cd-1-etcd-0                                  1/1     Running     0               3h16m   10.104.19.37    4am-node28   <none>           <none>
memory-opt-scenes-mn2cd-1-milvus-standalone-586f69c78f-qg4kv      1/1     Running     1 (3h15m ago)   3h16m   10.104.34.205   4am-node37   <none>           <none>
memory-opt-scenes-mn2cd-1-minio-9b8fd7bcb-q5gg8                   1/1     Running     0               3h16m   10.104.30.75    4am-node38   <none>           <none>

client log:

[2024-10-25 06:12:31,186 - ERROR - fouram]: RPC error: [hybrid_search], <MilvusException: (code=2200, message=incomplete query result, missing id %!s(int64=6408144), len(searchIDs) = 100, len(queryIDs) = 93, collection=453463256739087164: inconsistent requery result)>, <Time:{'RPC start': '2024-10-25 06:12:10.021510', 'RPC error': '2024-10-25 06:12:31.186452'}> (decorators.py:140)
[2024-10-25 06:12:31,187 - ERROR - fouram]: (api_response) : [Collection.hybrid_search] <MilvusException: (code=2200, message=incomplete query result, missing id %!s(int64=6408144), len(searchIDs) = 100, len(queryIDs) = 93, collection=453463256739087164: inconsistent requery result)>, [requestId: 11a0ddea-9298-11ef-ad0d-32a2e3b7d54b] (api_request.py:57)
[2024-10-25 06:12:31,187 - ERROR - fouram]: [CheckFunc] hybrid_search request check failed, response:<MilvusException: (code=2200, message=incomplete query result, missing id %!s(int64=6408144), len(searchIDs) = 100, len(queryIDs) = 93, collection=453463256739087164: inconsistent requery result)> (func_check.py:101)
[2024-10-25 06:12:44,388 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-10-25 06:12:44,388 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: grpc     hybrid_search                                                                   4736 4736(100.00%) |  21220   21024   21696  21024 |    0.44        0.44 (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: grpc     query                                                                           4808     0(0.00%) |    807       6   21259     13 |    0.45        0.00 (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: grpc     search                                                                          4648     0(0.00%) |    742       7   21012     12 |    0.43        0.00 (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]:          Aggregated                                                                     14192 4736(33.37%) |   7598       6   21696     18 |    1.32        0.44 (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]:  (stats.py:790)

Expected Behavior

No response

Steps To Reproduce

1. deploy a standalone milvus and enabled queryNode.segcore. multipleChunkedEnable=true
2. create a collection with fields ['id', 'float_vector', 'varchar_1', 'varchar_2', 'json_1', 'int64_1']
3. build index of vector field 'float_vector': IVF_SQ8
4. insert 10m data
5. flush
6. build index again
9. load collection
10. concurrent request:
   - query
   - search
   - hybrid_search <- raises error

Milvus Log

No response

Anything else?

test result:

[2024-10-25 06:12:44,388 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: grpc     hybrid_search                                                                   4736 4736(100.00%) |  21220   21024   21696  21024 |    0.44        0.44 (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: grpc     query                                                                           4808     0(0.00%) |    807       6   21259     13 |    0.45        0.00 (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: grpc     search                                                                          4648     0(0.00%) |    742       7   21012     12 |    0.43        0.00 (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]:          Aggregated                                                                     14192 4736(33.37%) |   7598       6   21696     18 |    1.32        0.44 (stats.py:789)
[2024-10-25 06:12:44,388 -  INFO - fouram]:  (stats.py:790)
[2024-10-25 06:12:44,390 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'standalone',
            'config_name': 'standalone_8c16m',
            'config': {'standalone': {'resources': {'limits': {'cpu': '64.0', 'memory': '64Gi'}, 'requests': {'cpu': '16.0', 'memory': '32Gi'}}},
                       'cluster': {'enabled': False},
                       'etcd': {'replicaCount': 1, 'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'minio': {'mode': 'standalone', 'metrics': {'podMonitor': {'enabled': True}}},
                       'pulsar': {'enabled': False},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'extraConfigFiles': {'user.yaml': 'queryNode:\n  segcore:\n    multipleChunkedEnable: true\n'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20241025-ad2df904-amd64'}}},
            'host': 'memory-opt-scenes-mn2cd-1-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'scalars_params': {'varchar_1': {'params': {'max_length': 65535},
                                                                                     'other_params': {'dataset': 'laion2b_url'}},
                                                                       'varchar_2': {'params': {'max_length': 65535},
                                                                                     'other_params': {'dataset': 'laion2b_caption'}},
                                                                       'json_1': {'other_params': {'dataset': 'laion2b_json'}},
                                                                       'int64_1': {'other_params': {'dataset': 'laion2b_int64'}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '10m',
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['varchar_1', 'varchar_2', 'json_1', 'int64_1'], 'shards_num': 1},
                                 'index_params': {'index_type': 'IVF_SQ8', 'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 10, 'during_time': '3h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'query',
                                                       'weight': 1,
                                                       'params': {'expr': '',
                                                                  'output_fields': ['varchar_1', 'varchar_2', 'json_1', 'int64_1'],
                                                                  'timeout': 3000,
                                                                  'random_data': True,
                                                                  'random_count': 100,
                                                                  'random_range': [0, 5000000],
                                                                  'field_name': 'id',
                                                                  'field_type': 'int64'}},
                                                      {'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'search_param': {'nprobe': 32},
                                                                  'output_fields': ['varchar_1', 'varchar_2', 'json_1', 'int64_1'],
                                                                  'timeout': 3000,
                                                                  'random_data': True}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'reqs': [{'anns_field': 'float_vector', 'search_param': {'nprobe': 64}}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['varchar_1', 'varchar_2', 'json_1', 'int64_1'],
                                                                  'timeout': 3000,
                                                                  'random_data': True}}]},
            'run_id': 2024102550222166,
            'datetime': '2024-10-25 02:57:02.238890',
            'client_version': '2.5.0'},
 'result': {'test_result': {'index': {'RT': 153.9338},
                            'insert': {'total_time': 492.7142, 'VPS': 20295.7414, 'batch_time': 0.2464, 'batch': 5000},
                            'flush': {'RT': 3.0174},
                            'load': {'RT': 6.4894},
                            'Locust': {'Aggregated': {'Requests': 14192,
                                                      'Fails': 4736,
                                                      'RPS': 1.32,
                                                      'fail_s': 0.33,
                                                      'RT_max': 21696.05,
                                                      'RT_avg': 7598.02,
                                                      'TP50': 18,
                                                      'TP99': 21000.0},
                                       'hybrid_search': {'Requests': 4736,
                                                         'Fails': 4736,
                                                         'RPS': 0.44,
                                                         'fail_s': 1.0,
                                                         'RT_max': 21696.05,
                                                         'RT_avg': 21220.45,
                                                         'TP50': 21000.0,
                                                         'TP99': 21000.0},
                                       'query': {'Requests': 4808,
                                                 'Fails': 0,
                                                 'RPS': 0.45,
                                                 'fail_s': 0.0,
                                                 'RT_max': 21259.65,
                                                 'RT_avg': 807.07,
                                                 'TP50': 13,
                                                 'TP99': 20000.0},
                                       'search': {'Requests': 4648,
                                                  'Fails': 0,
                                                  'RPS': 0.43,
                                                  'fail_s': 0.0,
                                                  'RT_max': 21012.5,
                                                  'RT_avg': 742.38,
                                                  'TP50': 12,
                                                  'TP99': 20000.0}}}}}
xiaofan-luan commented 2 weeks ago

this seems to be a retrieve issue.

I thought @congqixia is working on retrieve with segmentID and offset instead of retrieving directly with ID. how is that going on?

wangting0128 commented 2 weeks ago

same case, same error

image: master-20241029-7dd66511-amd64 server:

NAME                                                            READY   STATUS             RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
memory-opt-scenes-q2wkh-1-etcd-0                                1/1     Running            0                3h27m   10.104.24.202   4am-node29   <none>           <none>
memory-opt-scenes-q2wkh-1-milvus-standalone-84c54dc586-7r525    1/1     Running            1 (3h25m ago)    3h27m   10.104.32.66    4am-node39   <none>           <none>
memory-opt-scenes-q2wkh-1-minio-6c8f7d8984-djpf5                1/1     Running            0                3h27m   10.104.21.41    4am-node24   <none>           <none>

client log:

截屏2024-10-30 10 50 06
sunby commented 2 weeks ago

/assign

xiaofan-luan commented 2 weeks ago

/assign @wangting0128 please help on verifying

wangting0128 commented 1 week ago

/assign @wangting0128 please help on verifying

verification failed

image: 2.5-20241031-6b9b6999-amd64 test case name: test_hybrid_search_locust_multi_ddl_dql_hybrid_search_cluster

client log:

截屏2024-11-01 11 53 07
sunby commented 1 week ago

/assign @wangting0128 please help on verifying

verification failed

image: 2.5-20241031-6b9b6999-amd64 test case name: test_hybrid_search_locust_multi_ddl_dql_hybrid_search_cluster

client log: 截屏2024-11-01 11 53 07

oh sorry I tested with 100,000 dataset and did not notice this problem. But it appeared when I test with 1 million. I have found the root cause and will fix it in another pr.

xiaofan-luan commented 1 week ago

/assign @wangting0128 please help on verifying it

wangting0128 commented 6 days ago

verification passed

argo task:memory-opt-scenes-7vrcm image:master-20241108-a0315783-amd64