Open ThreadDao opened 3 months ago
- Milvus version: 2.4-20240711-1d2062a6-amd64 - Deployment mode(standalone or cluster): cluster - MQ type(rocksmq, pulsar or kafka): pulsar - SDK version(e.g. pymilvus v2.0.0rc2): - OS(Ubuntu or CentOS): - CPU/Memory: - GPU: - Others:
deploy cluster with 2 dataNode
dataNode: replicas: 2 resources: limits: cpu: "4" memory: 16Gi requests: cpu: "4" memory: 16Gi config: log: level: debug minio: accessKeyID: miniozong bucketName: bucket-zong rootPath: compact secretAccessKey: miniozong trace: exporter: jaeger jaeger: url: http://tempo-distributor.tempo:14268/api/traces sampleFraction: 1
create collection -> index -> insert 1b-128d data -> flush -> index
upsert
grafana link: metrics of compact-no-flush-1b-5
many client upsert requests failed: <MilvusException: (code=65535, message=message send timeout: TimeoutError)>
L0 compaction cannot keep up with the billion data upsert, and it seems that the latency of l0 compaction is slightly higher.
dataNode compact-no-flush-1b-5-milvus-datanode-b964bd999-8x55b oomkilled during compaction
compact-no-flush-1b-5-milvus-datanode-b964bd999-8x55b
No response
argo: 1. https://argo-workflows.zilliz.cc/archived-workflows/qa/8c35a58b-84d4-4e8f-80d6-ac66c256a15b?nodeId=compact-opt-1b-no-flush-5 2. https://argo-workflows.zilliz.cc/archived-workflows/qa/686b993b-bf01-47aa-a049-7e5fe0ac38a1?nodeId=compact-opt-1b-no-flush-6
pods:
compact-no-flush-1b-5-etcd-0 1/1 Running 0 12d 10.104.18.242 4am-node25 <none> <none> compact-no-flush-1b-5-etcd-1 1/1 Running 0 11d 10.104.30.182 4am-node38 <none> <none> compact-no-flush-1b-5-etcd-2 1/1 Running 0 12d 10.104.23.182 4am-node27 <none> <none> compact-no-flush-1b-5-milvus-datanode-b964bd999-8x55b 1/1 Running 1 (25h ago) 3d 10.104.23.32 4am-node27 <none> <none> compact-no-flush-1b-5-milvus-datanode-b964bd999-wrm87 1/1 Running 0 3d 10.104.6.70 4am-node13 <none> <none> compact-no-flush-1b-5-milvus-indexnode-56879c85d4-7r859 1/1 Running 0 3d 10.104.1.118 4am-node10 <none> <none> compact-no-flush-1b-5-milvus-indexnode-56879c85d4-sgfgd 1/1 Running 0 3d 10.104.23.29 4am-node27 <none> <none> compact-no-flush-1b-5-milvus-indexnode-56879c85d4-xrfsq 1/1 Running 0 3d 10.104.6.71 4am-node13 <none> <none> compact-no-flush-1b-5-milvus-mixcoord-865ffd89dd-pnh2d 1/1 Running 0 3d 10.104.23.31 4am-node27 <none> <none> compact-no-flush-1b-5-milvus-proxy-7fc7f9cc65-rf589 1/1 Running 0 3d 10.104.6.69 4am-node13 <none> <none> compact-no-flush-1b-5-milvus-querynode-0-6869bd768b-j6jjn 1/1 Running 0 3d 10.104.23.30 4am-node27 <none> <none> compact-no-flush-1b-5-pulsar-bookie-0 1/1 Running 0 12d 10.104.18.243 4am-node25 <none> <none> compact-no-flush-1b-5-pulsar-bookie-1 1/1 Running 0 11d 10.104.30.195 4am-node38 <none> <none> compact-no-flush-1b-5-pulsar-bookie-2 1/1 Running 0 12d 10.104.23.183 4am-node27 <none> <none> compact-no-flush-1b-5-pulsar-bookie-init-vjzcf 0/1 Completed 0 12d 10.104.18.237 4am-node25 <none> <none> compact-no-flush-1b-5-pulsar-broker-0 1/1 Running 0 12d 10.104.18.238 4am-node25 <none> <none> compact-no-flush-1b-5-pulsar-proxy-0 1/1 Running 0 12d 10.104.18.236 4am-node25 <none> <none> compact-no-flush-1b-5-pulsar-pulsar-init-w4ntj 0/1 Completed 0 12d 10.104.18.235 4am-node25 <none> <none> compact-no-flush-1b-5-pulsar-recovery-0 1/1 Running 0 12d 10.104.18.234 4am-node25 <none> <none> compact-no-flush-1b-5-pulsar-zookeeper-0 1/1 Running 0 12d 10.104.18.244 4am-node25 <none> <none> compact-no-flush-1b-5-pulsar-zookeeper-1 1/1 Running 0 12d 10.104.23.193 4am-node27 <none> <none> compact-no-flush-1b-5-pulsar-zookeeper-2 1/1 Running 0 11d 10.104.30.176 4am-node38 <none> <none>
/assign @XuanYang-cn
at least we need some quota limitation to back pressure the delete/upsert when we can not catch up?
also we need investigate on reasons and otpimizations
Is there an existing issue for this?
Environment
Current Behavior
deploy
deploy cluster with 2 dataNode
tests
create collection -> index -> insert 1b-128d data -> flush -> index
upsert
results
grafana link: metrics of compact-no-flush-1b-5
many client upsert requests failed: <MilvusException: (code=65535, message=message send timeout: TimeoutError)>
L0 compaction cannot keep up with the billion data upsert, and it seems that the latency of l0 compaction is slightly higher.
dataNode
compact-no-flush-1b-5-milvus-datanode-b964bd999-8x55b
oomkilled during compactionExpected Behavior
No response
Steps To Reproduce
Milvus Log
pods:
Anything else?
No response