milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.33k stars 2.91k forks source link

[Bug]: L0 compaction cannot keep up with upsert of one billion data #34670

Open ThreadDao opened 3 months ago

ThreadDao commented 3 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4-20240711-1d2062a6-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy

deploy cluster with 2 dataNode

    dataNode:
      replicas: 2
      resources:
        limits:
          cpu: "4" 
          memory: 16Gi
        requests:
          cpu: "4" 
          memory: 16Gi
  config:
    log:
      level: debug
    minio:
      accessKeyID: miniozong
      bucketName: bucket-zong
      rootPath: compact
      secretAccessKey: miniozong
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces
      sampleFraction: 1

tests

  1. create collection -> index -> insert 1b-128d data -> flush -> index

  2. upsert image

    results

    grafana link: metrics of compact-no-flush-1b-5

  3. many client upsert requests failed: <MilvusException: (code=65535, message=message send timeout: TimeoutError)> image

  4. L0 compaction cannot keep up with the billion data upsert, and it seems that the latency of l0 compaction is slightly higher. image

  5. dataNode compact-no-flush-1b-5-milvus-datanode-b964bd999-8x55b oomkilled during compaction image

Expected Behavior

No response

Steps To Reproduce

argo: 
1. https://argo-workflows.zilliz.cc/archived-workflows/qa/8c35a58b-84d4-4e8f-80d6-ac66c256a15b?nodeId=compact-opt-1b-no-flush-5
2. https://argo-workflows.zilliz.cc/archived-workflows/qa/686b993b-bf01-47aa-a049-7e5fe0ac38a1?nodeId=compact-opt-1b-no-flush-6

Milvus Log

pods:

compact-no-flush-1b-5-etcd-0                                      1/1     Running     0                12d     10.104.18.242   4am-node25   <none>           <none>
compact-no-flush-1b-5-etcd-1                                      1/1     Running     0                11d     10.104.30.182   4am-node38   <none>           <none>
compact-no-flush-1b-5-etcd-2                                      1/1     Running     0                12d     10.104.23.182   4am-node27   <none>           <none>
compact-no-flush-1b-5-milvus-datanode-b964bd999-8x55b             1/1     Running     1 (25h ago)      3d      10.104.23.32    4am-node27   <none>           <none>
compact-no-flush-1b-5-milvus-datanode-b964bd999-wrm87             1/1     Running     0                3d      10.104.6.70     4am-node13   <none>           <none>
compact-no-flush-1b-5-milvus-indexnode-56879c85d4-7r859           1/1     Running     0                3d      10.104.1.118    4am-node10   <none>           <none>
compact-no-flush-1b-5-milvus-indexnode-56879c85d4-sgfgd           1/1     Running     0                3d      10.104.23.29    4am-node27   <none>           <none>
compact-no-flush-1b-5-milvus-indexnode-56879c85d4-xrfsq           1/1     Running     0                3d      10.104.6.71     4am-node13   <none>           <none>
compact-no-flush-1b-5-milvus-mixcoord-865ffd89dd-pnh2d            1/1     Running     0                3d      10.104.23.31    4am-node27   <none>           <none>
compact-no-flush-1b-5-milvus-proxy-7fc7f9cc65-rf589               1/1     Running     0                3d      10.104.6.69     4am-node13   <none>           <none>
compact-no-flush-1b-5-milvus-querynode-0-6869bd768b-j6jjn         1/1     Running     0                3d      10.104.23.30    4am-node27   <none>           <none>
compact-no-flush-1b-5-pulsar-bookie-0                             1/1     Running     0                12d     10.104.18.243   4am-node25   <none>           <none>
compact-no-flush-1b-5-pulsar-bookie-1                             1/1     Running     0                11d     10.104.30.195   4am-node38   <none>           <none>
compact-no-flush-1b-5-pulsar-bookie-2                             1/1     Running     0                12d     10.104.23.183   4am-node27   <none>           <none>
compact-no-flush-1b-5-pulsar-bookie-init-vjzcf                    0/1     Completed   0                12d     10.104.18.237   4am-node25   <none>           <none>
compact-no-flush-1b-5-pulsar-broker-0                             1/1     Running     0                12d     10.104.18.238   4am-node25   <none>           <none>
compact-no-flush-1b-5-pulsar-proxy-0                              1/1     Running     0                12d     10.104.18.236   4am-node25   <none>           <none>
compact-no-flush-1b-5-pulsar-pulsar-init-w4ntj                    0/1     Completed   0                12d     10.104.18.235   4am-node25   <none>           <none>
compact-no-flush-1b-5-pulsar-recovery-0                           1/1     Running     0                12d     10.104.18.234   4am-node25   <none>           <none>
compact-no-flush-1b-5-pulsar-zookeeper-0                          1/1     Running     0                12d     10.104.18.244   4am-node25   <none>           <none>
compact-no-flush-1b-5-pulsar-zookeeper-1                          1/1     Running     0                12d     10.104.23.193   4am-node27   <none>           <none>
compact-no-flush-1b-5-pulsar-zookeeper-2                          1/1     Running     0                11d     10.104.30.176   4am-node38   <none>           <none>

Anything else?

No response

xiaofan-luan commented 3 months ago

/assign @XuanYang-cn

at least we need some quota limitation to back pressure the delete/upsert when we can not catch up?

also we need investigate on reasons and otpimizations