milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
28.4k stars 2.73k forks source link

[Bug]: [benchmark][standalone][LRU] search and query raiase error in concurrent DQL and insert scene #32453

Closed wangting0128 closed 2 months ago

wangting0128 commented 3 months ago

Is there an existing issue for this?

Environment

- Milvus version: milvus-io-lru-dev-fbd02f7-20240418
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc66
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: lru-fouramf-q2hm8

server:

NAME                                                              READY   STATUS                            RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
lru-verify-32099-insert-dql-etcd-0                                1/1     Running                           0                 12h     10.104.26.135   4am-node32   <none>           <none>
lru-verify-32099-insert-dql-milvus-standalone-698cc7bfb6-2t774    1/1     Running                           0                 12h     10.104.33.245   4am-node36   <none>           <none>
lru-verify-32099-insert-dql-minio-7786766f7d-9nr4m                1/1     Running                           0                 12h     10.104.33.244   4am-node36   <none>           <none>
lru-verify-32099-insert-dql-pulsar-bookie-0                       1/1     Running                           0                 12h     10.104.26.134   4am-node32   <none>           <none>
lru-verify-32099-insert-dql-pulsar-bookie-1                       1/1     Running                           0                 12h     10.104.20.184   4am-node22   <none>           <none>
lru-verify-32099-insert-dql-pulsar-bookie-2                       1/1     Running                           0                 12h     10.104.32.141   4am-node39   <none>           <none>
lru-verify-32099-insert-dql-pulsar-bookie-init-d8kgk              0/1     Completed                         0                 12h     10.104.13.128   4am-node16   <none>           <none>
lru-verify-32099-insert-dql-pulsar-broker-0                       1/1     Running                           0                 12h     10.104.26.128   4am-node32   <none>           <none>
lru-verify-32099-insert-dql-pulsar-proxy-0                        1/1     Running                           0                 12h     10.104.13.127   4am-node16   <none>           <none>
lru-verify-32099-insert-dql-pulsar-pulsar-init-ws9nb              0/1     Completed                         0                 12h     10.104.26.127   4am-node32   <none>           <none>
lru-verify-32099-insert-dql-pulsar-recovery-0                     1/1     Running                           0                 12h     10.104.4.152    4am-node11   <none>           <none>
lru-verify-32099-insert-dql-pulsar-zookeeper-0                    1/1     Running                           0                 12h     10.104.26.133   4am-node32   <none>           <none>
lru-verify-32099-insert-dql-pulsar-zookeeper-1                    1/1     Running                           0                 12h     10.104.31.115   4am-node34   <none>           <none>
lru-verify-32099-insert-dql-pulsar-zookeeper-2                    1/1     Running                           0                 12h     10.104.27.86    4am-node31   <none>           <none>
截屏2024-04-19 11 48 21

client pod name: lru-fouramf-q2hm8-339575076 client log: {pod=~"lru-fouramf-q2hm8-339575076"} |~ "ERROR"

[2024-04-18 12:44:45,567 - ERROR - fouram]: (api_response) : [Collection.query] <MilvusException: (code=65535, message=fail to Query on QueryNode 1: worker(1) query failed: segment released: segment not loaded[segment=449168875793244817])>, [requestId: 6efac4d0-fd81-11ee-ab19-f67a001bbf9e] (api_request.py:57)
[2024-04-18 12:44:45,567 - ERROR - fouram]: [CheckFunc] query request check failed, response:<MilvusException: (code=65535, message=fail to Query on QueryNode 1: worker(1) query failed: segment released: segment not loaded[segment=449168875793244817])> (func_check.py:48)
[2024-04-18 12:44:46,130 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65535, message=fail to search on QueryNode 1: worker(1) query failed: segment released: segment not loaded[segment=449168875793244818])>, <Time:{'RPC start': '2024-04-18 12:44:45.523389', 'RPC error': '2024-04-18 12:44:46.130285'}> (decorators.py:146)
[2024-04-18 12:44:46,130 - ERROR - fouram]: (api_response) : [Collection.search] <MilvusException: (code=65535, message=fail to search on QueryNode 1: worker(1) query failed: segment released: segment not loaded[segment=449168875793244818])>, [requestId: 6f50a2e2-fd81-11ee-ab19-f67a001bbf9e] (api_request.py:57)
[2024-04-18 12:44:46,130 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=65535, message=fail to search on QueryNode 1: worker(1) query failed: segment released: segment not loaded[segment=449168875793244818])> (func_check.py:48)
[2024-04-18 12:44:46,170 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65535, message=fail to search on QueryNode 1: worker(1) query failed: segment released: segment not loaded[segment=449168875793244818])>, <Time:{'RPC start': '2024-04-18 12:44:45.566187', 'RPC error': '2024-04-18 12:44:46.170768'}> (decorators.py:146)

[2024-04-18 12:45:13,711 - ERROR - fouram]: RPC error: [query], <MilvusException: (code=65535, message=fail to Query on QueryNode 1: worker(1) query failed: Operator::GetOutput failed for [Operator:FilterBits, plan node id: 0] : Assert "get_bit(field_data_ready_bitset_, field_id)" at /go/src/github.com/milvus-io/milvus/internal/core/src/segcore/SegmentSealedImpl.cpp:675
 => Can't get bitset element at 102)>, <Time:{'RPC start': '2024-04-18 12:45:13.104562', 'RPC error': '2024-04-18 12:45:13.711940'}> (decorators.py:146)
[2024-04-18 12:45:13,712 - ERROR - fouram]: (api_response) : [Collection.query] <MilvusException: (code=65535, message=fail to Query on QueryNode 1: worker(1) query failed: Operator::GetOutput failed for [Operator:FilterBits, plan node id: 0] : Assert "get_bit(field_data_ready_bitset_, field_id)" at /go/src/github.com/milvus-io/milvus/internal/core/src/segcore/SegmentSealedImpl.cpp:675
 => Can't get bitset element at 102)>, [requestId: 7fc142bc-fd81-11ee-ab19-f67a001bbf9e] (api_request.py:57)
[2024-04-18 12:45:13,712 - ERROR - fouram]: [CheckFunc] query request check failed, response:<MilvusException: (code=65535, message=fail to Query on QueryNode 1: worker(1) query failed: Operator::GetOutput failed for [Operator:FilterBits, plan node id: 0] : Assert "get_bit(field_data_ready_bitset_, field_id)" at /go/src/github.com/milvus-io/milvus/internal/core/src/segcore/SegmentSealedImpl.cpp:675
 => Can't get bitset element at 102)> (func_check.py:48)
[2024-04-18 12:45:13,721 - ERROR - fouram]: RPC error: [query], <MilvusException: (code=65535, message=fail to Query on QueryNode 1: worker(1) query failed: segment released: segment not loaded[segment=449168875793244817])>, <Time:{'RPC start': '2024-04-18 12:45:13.113292', 'RPC error': '2024-04-18 12:45:13.721147'}> (decorators.py:146)

[2024-04-18 12:46:28,536 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65535, message=fail to search on QueryNode 1: worker(1) query failed:  => failed to load row ID or timestamp, potential missing bin logs or empty segments. Segment ID = 449168875793244818)>, <Time:{'RPC start': '2024-04-18 12:46:27.932114', 'RPC error': '2024-04-18 12:46:28.536859'}> (decorators.py:146)
[2024-04-18 12:46:28,537 - ERROR - fouram]: (api_response) : [Collection.search] <MilvusException: (code=65535, message=fail to search on QueryNode 1: worker(1) query failed:  => failed to load row ID or timestamp, potential missing bin logs or empty segments. Segment ID = 449168875793244818)>, [requestId: ac5af926-fd81-11ee-ab19-f67a001bbf9e] (api_request.py:57)
[2024-04-18 12:46:28,537 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=65535, message=fail to search on QueryNode 1: worker(1) query failed:  => failed to load row ID or timestamp, potential missing bin logs or empty segments. Segment ID = 449168875793244818)> (func_check.py:48)

Expected Behavior

No response

Steps To Reproduce

1. create a collection with fields: id(primary key), float_vector(768dim), int64_1(partitionKey=64)
2. build HNSW index
3. insert 10w data
4. flush collection
5. build index with the same param again
6. load collection
7. concurrent requests:
   - insert
   - search
   - query

Milvus Log

No response

Anything else?

client config:

{
     "dataset_params": {
          "dataset_name": "laion",
          "dim": 768,
          "dataset_size": "10w",
          "ni_per": 10000,
          "metric_type": "L2",
          "scalars_params": {
               "int64_1": {
                    "params": {
                         "is_partition_key": true
                    }
               }
          }
     },
     "collection_params": {
          "other_fields": [
               "int64_1"
          ],
          "num_partitions": 64
     },
     "load_params": {
          "prepare_load": true
     },
     "index_params": {
          "index_type": "HNSW",
          "index_param": {
               "M": 30,
               "efConstruction": 360
          }
     },
     "concurrent_tasks": [
          {
               "type": "insert",
               "weight": 1,
               "params": {
                    "nb": 5,
                    "timeout": 60,
                    "random_vector": true
               }
          },
          {
               "type": "search",
               "weight": 1,
               "params": {
                    "top_k": 1,
                    "nq": 1,
                    "search_param": {
                         "ef": 64
                    },
                    "expr": "int64_1 == 1",
                    "timeout": 3000,
                    "random_data": true
               }
          },
          {
               "type": "query",
               "weight": 1,
               "params": {
                    "expr": "int64_1 != 2",
                    "timeout": 3000,
                    "offset": 0,
                    "limit": 10
               }
          }
     ],
     "concurrent_params": {
          "interval": 20,
          "during_time": "12h",
          "concurrent_number": 3
     }
}

test result:

[2024-04-19 00:33:35,807 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-04-19 00:33:35,807 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-04-19 00:33:35,807 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-04-19 00:33:35,808 -  INFO - fouram]: grpc     insert                                                                        222747     0(0.00%) |     30      17    3901     26 |    5.16        0.00 (stats.py:789)
[2024-04-19 00:33:35,808 -  INFO - fouram]: grpc     query                                                                         221787 101162(45.61%) |    281       2    3804     15 |    5.13        2.34 (stats.py:789)
[2024-04-19 00:33:35,808 -  INFO - fouram]: grpc     search                                                                        222089 96980(43.67%) |    269       1    3820     13 |    5.14        2.24 (stats.py:789)
[2024-04-19 00:33:35,808 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-04-19 00:33:35,808 -  INFO - fouram]:          Aggregated                                                                    666623 198142(29.72%) |    193       1    3901     24 |   15.43        4.59 (stats.py:789)
[2024-04-19 00:33:35,808 -  INFO - fouram]:  (stats.py:790)
[2024-04-19 00:33:35,810 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'standalone',
            'config_name': 'standalone_8c16m',
            'config': {'standalone': {'resources': {'limits': {'cpu': '16',
                                                               'memory': '8Gi',
                                                               'ephemeral-storage': '70Gi'},
                                                    'requests': {'cpu': '16',
                                                                 'memory': '8Gi'}},
                                      'messageQueue': 'pulsar',
                                      'extraEnv': [{'name': 'LOCAL_STORAGE_SIZE',
                                                    'value': '70'}],
                                      'disk': {'size': {'enabled': True}}},
                       'cluster': {'enabled': False},
                       'etcd': {'replicaCount': 1,
                                'metrics': {'enabled': True,
                                            'podMonitor': {'enabled': True}}},
                       'minio': {'mode': 'standalone',
                                 'metrics': {'podMonitor': {'enabled': True}},
                                 'persistence': {'size': '320Gi'}},
                       'pulsar': {'enabled': True},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'extraConfigFiles': {'user.yaml': 'queryNode:\n'
                                                         '  '
                                                         'diskCacheCapacityLimit: '
                                                         '51539607552\n'
                                                         '  mmap:\n'
                                                         '    mmapEnabled: '
                                                         'true\n'
                                                         '  lazyloadEnabled: '
                                                         'true\n'
                                                         '  '
                                                         'useStreamComputing: '
                                                         'true\n'
                                                         '  cache:\n'
                                                         '    warmup: sync\n'
                                                         '  '
                                                         'lazyloadWaitTimeout: '
                                                         '300000\n'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus',
                                         'tag': 'milvus-io-lru-dev-fbd02f7-20240418'}}},
            'host': 'lru-verify-32099-insert-dql-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 768,
                                                    'scalars_params': {'int64_1': {'params': {'is_partition_key': True}}},
                                                    'dataset_name': 'laion',
                                                    'dataset_size': '10w',
                                                    'ni_per': 10000},
                                 'collection_params': {'other_fields': ['int64_1'],
                                                       'num_partitions': 64},
                                 'load_params': {'prepare_load': True},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 30,
                                                                  'efConstruction': 360}},
                                 'concurrent_params': {'concurrent_number': 3,
                                                       'during_time': '12h',
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'insert',
                                                       'weight': 1,
                                                       'params': {'nb': 5,
                                                                  'timeout': 60,
                                                                  'random_vector': True}},
                                                      {'type': 'search',
                                                       'weight': 1,
                                                       'params': {'top_k': 1,
                                                                  'nq': 1,
                                                                  'search_param': {'ef': 64},
                                                                  'expr': 'int64_1 '
                                                                          '== '
                                                                          '1',
                                                                  'timeout': 3000,
                                                                  'random_data': True}},
                                                      {'type': 'query',
                                                       'weight': 1,
                                                       'params': {'expr': 'int64_1 '
                                                                          '!= '
                                                                          '2',
                                                                  'timeout': 3000,
                                                                  'offset': 0,
                                                                  'limit': 10}}]},
            'run_id': 2024041831986777,
            'datetime': '2024-04-18 12:26:38.129694',
            'client_version': '2.2'},
 'result': {'test_result': {'index': {'RT': 51.3859},
                            'load': {'RT': 0.0047},
                            'insert': {'total_time': 16.7375,
                                       'VPS': 5974.6079,
                                       'batch_time': 1.6738,
                                       'batch': 10000},
                            'flush': {'RT': 5.5475},
                            'Locust': {'Aggregated': {'Requests': 666623,
                                                      'Fails': 198142,
                                                      'RPS': 15.43,
                                                      'fail_s': 0.3,
                                                      'RT_max': 3901.97,
                                                      'RT_avg': 193.58,
                                                      'TP50': 24,
                                                      'TP99': 610.0},
                                       'insert': {'Requests': 222747,
                                                  'Fails': 0,
                                                  'RPS': 5.16,
                                                  'fail_s': 0.0,
                                                  'RT_max': 3901.97,
                                                  'RT_avg': 30.25,
                                                  'TP50': 26,
                                                  'TP99': 120.0},
                                       'query': {'Requests': 221787,
                                                 'Fails': 101162,
                                                 'RPS': 5.13,
                                                 'fail_s': 0.46,
                                                 'RT_max': 3804.32,
                                                 'RT_avg': 281.71,
                                                 'TP50': 15,
                                                 'TP99': 610.0},
                                       'search': {'Requests': 222089,
                                                  'Fails': 96980,
                                                  'RPS': 5.14,
                                                  'fail_s': 0.44,
                                                  'RT_max': 3820.17,
                                                  'RT_avg': 269.39,
                                                  'TP50': 13,
                                                  'TP99': 610.0}}}}}
chyezh commented 3 months ago

Bug introduced by recent commit: #32438. Mixture the growing and sealed segments together.

chyezh commented 2 months ago

@wangting0128 please verify it with latest lru-dev image.

wangting0128 commented 2 months ago

verification passed

image: master-20240513-9e3f3d99 argo task: lru-fouramf-q2hm8-0513