milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.71k stars 2.93k forks source link

[Bug]: Querynode oomkilled when concurrent upserting data into 1024 partitions #34058

Open ThreadDao opened 4 months ago

ThreadDao commented 4 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4-20240621-7d1d5a83-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):   pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy milvus with config

  components:
    dataNode:
      replicas: 1
      resources:
        limits:
          cpu: "8"
          memory: 16Gi
        requests:
          cpu: "4"
          memory: 8Gi
    indexNode:
      replicas: 3
      resources:
        limits:
          cpu: "8"
          memory: 8Gi
        requests:
          cpu: "4"
          memory: 2Gi
    mixCoord:
      replicas: 1
      resources:
        limits:
          cpu: "4"
          memory: 16Gi
        requests:
          cpu: "2" 
          memory: 8Gi 
    proxy:
      resources:
        limits:
          cpu: "1" 
          memory: 8Gi 
    queryNode:
      replicas: 2
      resources:
        limits:
          cpu: "16"
          memory: 72Gi
        requests:
          cpu: "4" 
          memory: 64Gi
  config:
    dataCoord:
      segment:
        sealProportion: 1.52e-05
    log:
      level: debug
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces
      sampleFraction: 1

test steps

  1. create a collection with 1 shard, enable partition-key with 1024 partitions
  2. create hnsw index {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 8, 'efConstruction': 200}}
  3. insert 10m-128d entities -> flush
  4. concurrent requests: search + upsert + flush
    'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2', 
                                                    'dim': 128,
                                                    'scalars_params': {'int64_1': {'params': {'is_partition_key': True}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '10m',
                                                    'ni_per': 50000},
                                 'collection_params': {'other_fields': ['int64_1'],
                                                       'shards_num': 1,
                                                       'num_partitions': 1024},
                                 'load_params': {},
                                 'release_params': {'release_of_reload': False},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 8,
                                                                  'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 30,
                                                       'during_time': '3h', 
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 10,
                                                       'params': {'nq': 100,
                                                                  'top_k': 100,
                                                                  'output_fields': ['int64_1'],
                                                                  'search_param': {'ef': 128}, 
                                                                  'timeout': 120}},
                                                      {'type': 'flush',
                                                       'weight': 1,
                                                       'params': {'timeout': 120}},
                                                      {'type': 'upsert',
                                                       'weight': 19,
                                                       'params': {'nb': 200,
                                                                  'timeout': 120,
                                                                  'start_id': 0,
                                                                  'random_id': True, 
                                                                  'random_vector': True}}]},
            'run_id': 2024062191801273,
            'datetime': '2024-06-21 03:06:20.115933',
            'client_version': '2.2'},

    queryNode oomkilled

    The qn oomkilled after about two minutes of concurrent requests, around at 2024-06-21 03:40:52 image

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log



### Anything else?

_No response_
xiaofan-luan commented 4 months ago

With so many partitions, we might need to change concurrency of compaction and more datanodes. Currently I think if we can dd more datanodes and catchup the compaction then it work for us

XuanYang-cn commented 4 months ago

Even though there're 50K segment, the thing is why 2 * 64G querynode cannot hold 7GB data in memory.

XuanYang-cn commented 1 week ago

/assign @ThreadDao /unassign

Is this reproducing?

xiaofan-luan commented 1 week ago

can we reproduce this still? I thought this might due to flush can not catch up and we need to improve flush performance