milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.35k stars 2.91k forks source link

[Bug]: QueryNode oomkilled due to sudden increase in the number of growing segments #34554

Open ThreadDao opened 4 months ago

ThreadDao commented 4 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4-20240709-0d8defb1
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy milvus

deploy level-zero-insert-op-96-4610 with 4 queryNodes

    queryNode:
      paused: false
      replicas: 4
      resources:
        limits:
          cpu: "8" 
          memory: 24Gi
        requests:
          cpu: "4" 
          memory: 16Gi
  config:
    dataCoord:
      enableActiveStandby: true
      segment:
        enableLevelZero: true
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    queryCoord:
      enableActiveStandby: true
    rootCoord:
      enableActiveStandby: true
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces
      sampleFraction: 1

test

  1. create collection with 2 shards -> index
  2. insert 3m-128d data -> flush
  3. index -> load
  4. concurrent requests: search + insert + delete + flush image

results

queryNode (one of delegator oomkilled) metrics of level-zero-insert-op-96-4610 image

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/dff50b01-bbe4-4995-970d-db19181d2f12?nodeId=level-zero-stable-1720537200-171318015

Milvus Log

pods:

level-zero-insert-op-96-4610-etcd-0                               1/1     Running            0               12h     10.104.18.78    4am-node25   <none>           <none>
level-zero-insert-op-96-4610-etcd-1                               1/1     Running            0               12h     10.104.23.150   4am-node27   <none>           <none>
level-zero-insert-op-96-4610-etcd-2                               1/1     Running            0               12h     10.104.19.156   4am-node28   <none>           <none>
level-zero-insert-op-96-4610-milvus-datanode-76bd6668f6-kcmmt     1/1     Running            1 (12h ago)     12h     10.104.5.81     4am-node12   <none>           <none>
level-zero-insert-op-96-4610-milvus-datanode-76bd6668f6-wdl76     1/1     Running            1 (12h ago)     12h     10.104.18.92    4am-node25   <none>           <none>
level-zero-insert-op-96-4610-milvus-indexnode-6cf6fbfbfb-6vztj    1/1     Running            0               12h     10.104.23.159   4am-node27   <none>           <none>
level-zero-insert-op-96-4610-milvus-indexnode-6cf6fbfbfb-q8kz8    1/1     Running            0               12h     10.104.18.108   4am-node25   <none>           <none>
level-zero-insert-op-96-4610-milvus-mixcoord-5c757dfd46-xx2zc     1/1     Running            0               12h     10.104.16.170   4am-node21   <none>           <none>
level-zero-insert-op-96-4610-milvus-proxy-67c78785-nm4b5          1/1     Running            1 (12h ago)     12h     10.104.25.85    4am-node30   <none>           <none>
level-zero-insert-op-96-4610-milvus-querynode-0-67d5895bd4dklpj   1/1     Running            0               12h     10.104.19.167   4am-node28   <none>           <none>
level-zero-insert-op-96-4610-milvus-querynode-0-67d5895bd4hk25l   1/1     Running            0               12h     10.104.30.197   4am-node38   <none>           <none>
level-zero-insert-op-96-4610-milvus-querynode-0-67d5895bd4t2v99   0/1     CrashLoopBackOff   25 (8s ago)     12h     10.104.20.65    4am-node22   <none>           <none>
level-zero-insert-op-96-4610-milvus-querynode-0-67d5895bd4zdbg6   1/1     Running            0               12h     10.104.30.196   4am-node38   <none>           <none>
level-zero-insert-op-96-4610-minio-0                              1/1     Running            0               12h     10.104.18.81    4am-node25   <none>           <none>
level-zero-insert-op-96-4610-minio-1                              1/1     Running            0               12h     10.104.15.70    4am-node20   <none>           <none>
level-zero-insert-op-96-4610-minio-2                              1/1     Running            0               12h     10.104.30.178   4am-node38   <none>           <none>
level-zero-insert-op-96-4610-minio-3                              1/1     Running            0               12h     10.104.27.211   4am-node31   <none>           <none>
level-zero-insert-op-96-4610-pulsar-bookie-0                      1/1     Running            0               12h     10.104.23.153   4am-node27   <none>           <none>
level-zero-insert-op-96-4610-pulsar-bookie-1                      1/1     Running            0               12h     10.104.21.230   4am-node24   <none>           <none>
level-zero-insert-op-96-4610-pulsar-bookie-2                      1/1     Running            0               12h     10.104.24.84    4am-node29   <none>           <none>
level-zero-insert-op-96-4610-pulsar-bookie-init-pw674             0/1     Completed          0               12h     10.104.18.70    4am-node25   <none>           <none>
level-zero-insert-op-96-4610-pulsar-broker-0                      1/1     Running            0               12h     10.104.14.113   4am-node18   <none>           <none>
level-zero-insert-op-96-4610-pulsar-proxy-0                       1/1     Running            0               12h     10.104.27.204   4am-node31   <none>           <none>
level-zero-insert-op-96-4610-pulsar-pulsar-init-q4kcs             0/1     Completed          0               12h     10.104.26.77    4am-node32   <none>           <none>
level-zero-insert-op-96-4610-pulsar-recovery-0                    1/1     Running            0               12h     10.104.1.195    4am-node10   <none>           <none>
level-zero-insert-op-96-4610-pulsar-zookeeper-0                   1/1     Running            0               12h     10.104.18.82    4am-node25   <none>           <none>
level-zero-insert-op-96-4610-pulsar-zookeeper-1                   1/1     Running            0               12h     10.104.25.84    4am-node30   <none>           <none>
level-zero-insert-op-96-4610-pulsar-zookeeper-2                   1/1     Running            0               12h     10.104.23.157   4am-node27   <none>           <none>

Anything else?

No response

ThreadDao commented 4 months ago

/assign @congqixia

ThreadDao commented 4 months ago

Previous test result:

congqixia commented 4 months ago

image the target failed to sync and cause growing segment stuck in delegator. Digging why target update failed

ThreadDao commented 4 months ago

Previous test result:

argo: https://argo-workflows.zilliz.cc/archived-workflows/qa/8c535d6e-68ea-4dc3-ac49-13772107181f?nodeId=level-zero-stable-1720105200-try-2485191624 image: 2.4-20240705-261b61e8 metrics of level-zero-insert-op-40-4759 image image

xiaofan-luan commented 3 months ago

/assign @bigsheeper Let's add a seal policy to keep growing segment of each shard to be less than 4GB

bigsheeper commented 5 hours ago

/assign @ThreadDao should be fixed, please help to verify

bigsheeper commented 5 hours ago

/unassign