Closed elstic closed 1 week ago
This is mainly due to excessive memory usage.
@SimFG but milvus did not enable dml quota and limit, why it reports that error?
/assign @SimFG /unassign
The concurrency test afterward did not trigger the compaction, resulting in too many segments and memory usage growing to nearly 100%.
could be related to pr #32326
this issue is not fixed. verify image: master-20240430-5bb672d7 server:
fouram-disk-sta72800-3-40-5704-etcd-0 1/1 Running 0 5h8m 10.104.23.27 4am-node27 <none> <none>
fouram-disk-sta72800-3-40-5704-etcd-1 1/1 Running 0 5h8m 10.104.15.121 4am-node20 <none> <none>
fouram-disk-sta72800-3-40-5704-etcd-2 1/1 Running 0 5h8m 10.104.34.187 4am-node37 <none> <none>
fouram-disk-sta72800-3-40-5704-milvus-datacoord-746568b9c88z2df 1/1 Running 4 (5h6m ago) 5h8m 10.104.25.156 4am-node30 <none> <none>
fouram-disk-sta72800-3-40-5704-milvus-datanode-6dc8b86fbc-sxg6s 1/1 Running 4 (5h6m ago) 5h8m 10.104.16.153 4am-node21 <none> <none>
fouram-disk-sta72800-3-40-5704-milvus-indexcoord-56c6d79b4xr885 1/1 Running 0 5h8m 10.104.25.154 4am-node30 <none> <none>
fouram-disk-sta72800-3-40-5704-milvus-indexnode-5954d9b694h4vpp 1/1 Running 4 (5h6m ago) 5h8m 10.104.21.37 4am-node24 <none> <none>
fouram-disk-sta72800-3-40-5704-milvus-proxy-64768cd77-fckmh 1/1 Running 4 (5h6m ago) 5h8m 10.104.25.155 4am-node30 <none> <none>
fouram-disk-sta72800-3-40-5704-milvus-querycoord-5699467b9tcxqf 1/1 Running 4 (5h6m ago) 5h8m 10.104.25.153 4am-node30 <none> <none>
fouram-disk-sta72800-3-40-5704-milvus-querynode-865c5b87c66vjjn 1/1 Running 4 (5h6m ago) 5h8m 10.104.18.8 4am-node25 <none> <none>
fouram-disk-sta72800-3-40-5704-milvus-rootcoord-594c9cc978l78t6 1/1 Running 4 (5h6m ago) 5h8m 10.104.25.152 4am-node30 <none> <none>
fouram-disk-sta72800-3-40-5704-minio-0 1/1 Running 0 5h8m 10.104.33.109 4am-node36 <none> <none>
fouram-disk-sta72800-3-40-5704-minio-1 1/1 Running 0 5h8m 10.104.24.247 4am-node29 <none> <none>
fouram-disk-sta72800-3-40-5704-minio-2 1/1 Running 0 5h8m 10.104.34.184 4am-node37 <none> <none>
fouram-disk-sta72800-3-40-5704-minio-3 1/1 Running 0 5h8m 10.104.23.30 4am-node27 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-bookie-0 1/1 Running 0 5h8m 10.104.25.170 4am-node30 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-bookie-1 1/1 Running 0 5h8m 10.104.23.28 4am-node27 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-bookie-2 1/1 Running 0 5h8m 10.104.34.185 4am-node37 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-bookie-init-rhlxf 0/1 Completed 0 5h8m 10.104.4.17 4am-node11 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-broker-0 1/1 Running 0 5h8m 10.104.13.203 4am-node16 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-proxy-0 1/1 Running 0 5h8m 10.104.4.18 4am-node11 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-pulsar-init-z9nx2 0/1 Completed 0 5h8m 10.104.13.204 4am-node16 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-recovery-0 1/1 Running 0 5h8m 10.104.14.192 4am-node18 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-zookeeper-0 1/1 Running 0 5h8m 10.104.25.169 4am-node30 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-zookeeper-1 1/1 Running 0 5h6m 10.104.23.32 4am-node27 <none> <none>
fouram-disk-sta72800-3-40-5704-pulsar-zookeeper-2 1/1 Running 0 5h5m 10.104.32.67 4am-node39 <none> <none>
issue fixed. verify image: master-20240511-8a9a4219
Hello @elstic! I met the same issue in milvus v2.4.1 (deployed by helm using the chart version milvus-4.1.30). I am wondering the reason of this error. And for now to fix this need re-deploy the milvus with pulling the image master-20240511-8a9a4219, right? Thank you if you could offer any help!
Hello @elstic! I met the same issue in milvus v2.4.1 (deployed by helm using the chart version milvus-4.1.30). I am wondering the reason of this error. And for now to fix this need re-deploy the milvus with pulling the image master-20240511-8a9a4219, right? Thank you if you could offer any help!
Hello, the essence of the problem I'm documenting is that there is no compaction, resulting in the segment not being merged. For example, there is no data for the compaction latency in the graph.
But as far as I know, v2.4.1 should not have this problem. Or can you describe your problem in detail? Is it either of the following?
1 ) Error: “quota exceeded[reason=rate type: DMLInsert]” Please check if your memory usage is high or if you have configured flow limiting: quotaAndLimits.dml.enabled
2) Check if your instance has been compacted , if not then your problem is the same as mine.
Hello @elstic! I met the same issue in milvus v2.4.1 (deployed by helm using the chart version milvus-4.1.30). I am wondering the reason of this error. And for now to fix this need re-deploy the milvus with pulling the image master-20240511-8a9a4219, right? Thank you if you could offer any help!
Hello, the essence of the problem I'm documenting is that there is no compaction, resulting in the segment not being merged. For example, there is no data for the compaction latency in the graph. But as far as I know, v2.4.1 should not have this problem. Or can you describe your problem in detail? Is it either of the following? 1 ) Error: “quota exceeded[reason=rate type: DMLInsert]” Please check if your memory usage is high or if you have configured flow limiting:
quotaAndLimits.dml.enabled
2) Check if your instance has been compacted , if not then your problem is the same as mine.
Thanks for your reply first!
I am testing the bottleneck of the milvus inserting performance. My scene is just doing batch insert (10000 of 768dim random data at a time) continually, and the insert API reported the error RPC error: [batch_insert], <MilvusException: (code=9, message=quota exceeded[reason=rate type: DMLInsert])>
when the entities num reached 200 million level. I didn't set the quotaAndLimits.dml
at deployment, I guess the default value is False.
I checked the log of rootcoord at that time, reported QueryNode memory to low water level, limit writing rate
. And the log of querynode also reported no sufficient resource to load segments
. So I tried to allocate more memory quota for querynodes and restart the pods, but it still raises the same error when inserting.
I think I should upgrade the whole milvus cluster then.
Hello @elstic! I met the same issue in milvus v2.4.1 (deployed by helm using the chart version milvus-4.1.30). I am wondering the reason of this error. And for now to fix this need re-deploy the milvus with pulling the image master-20240511-8a9a4219, right? Thank you if you could offer any help!
Hello, the essence of the problem I'm documenting is that there is no compaction, resulting in the segment not being merged. For example, there is no data for the compaction latency in the graph. But as far as I know, v2.4.1 should not have this problem. Or can you describe your problem in detail? Is it either of the following? 1 ) Error: “quota exceeded[reason=rate type: DMLInsert]” Please check if your memory usage is high or if you have configured flow limiting:
quotaAndLimits.dml.enabled
2) Check if your instance has been compacted , if not then your problem is the same as mine.Thanks for your reply first!
I am testing the bottleneck of the milvus inserting performance. My scene is just doing batch insert (10000 of 768dim random data at a time) continually, and the insert API reported the error
RPC error: [batch_insert], <MilvusException: (code=9, message=quota exceeded[reason=rate type: DMLInsert])>
when the entities num reached 200 million level. I didn't set thequotaAndLimits.dml
at deployment, I guess the default value is False.I checked the log of rootcoord at that time, reported
QueryNode memory to low water level, limit writing rate
. And the log of querynode also reportedno sufficient resource to load segments
. So I tried to allocate more memory quota for querynodes and restart the pods, but it still raises the same error when inserting.I think I should upgrade the whole milvus cluster then.
you need to caculate how much memory you need for load 200 million vectors . This could cost a couple of hundred giga bytes
Hello @elstic! I met the same issue in milvus v2.4.1 (deployed by helm using the chart version milvus-4.1.30). I am wondering the reason of this error. And for now to fix this need re-deploy the milvus with pulling the image master-20240511-8a9a4219, right? Thank you if you could offer any help!
Hello, the essence of the problem I'm documenting is that there is no compaction, resulting in the segment not being merged. For example, there is no data for the compaction latency in the graph. But as far as I know, v2.4.1 should not have this problem. Or can you describe your problem in detail? Is it either of the following? 1 ) Error: “quota exceeded[reason=rate type: DMLInsert]” Please check if your memory usage is high or if you have configured flow limiting:
quotaAndLimits.dml.enabled
2) Check if your instance has been compacted , if not then your problem is the same as mine.Thanks for your reply first! I am testing the bottleneck of the milvus inserting performance. My scene is just doing batch insert (10000 of 768dim random data at a time) continually, and the insert API reported the error
RPC error: [batch_insert], <MilvusException: (code=9, message=quota exceeded[reason=rate type: DMLInsert])>
when the entities num reached 200 million level. I didn't set thequotaAndLimits.dml
at deployment, I guess the default value is False. I checked the log of rootcoord at that time, reportedQueryNode memory to low water level, limit writing rate
. And the log of querynode also reportedno sufficient resource to load segments
. So I tried to allocate more memory quota for querynodes and restart the pods, but it still raises the same error when inserting. I think I should upgrade the whole milvus cluster then.you need to caculate how much memory you need for load 200 million vectors . This could cost a couple of hundred giga bytes
Thanks for your advice!
I estimate the request resources quota before deployment (I tried 16 * 16cpu64Gi querynodes first). I guess the cause is the high writing rate leads to the high memory usage.
Actually I am wondering whether Milvus supports resizing resources quota of workernodes dynamically? In this scene, I tried to use helm like helm upgrade -f values_custom.yaml my_milvus zilliztech/milvus --reuse-values --set queryNode.resources.requests.memory=128Gi
or helm upgrade -f values_custom.yaml my_milvus zilliztech/milvus --reuse-values --set queryNode.replicas=32
to resize and restart the querynodes, it did get more resource but still raises the above error. So I uninstalled then installed the whole Milvus cluster again, it works finally.
I have two questions, hoping you would give some advice :)
1) How does the milvus resize its scale correctly as the data increasing towards its limit? 2) When components crash and restart (due to disconnecting from etcd for example) during the insertion, how does the milvus recover and sync the data? Could you explain or offer some reference documents to show me the pipeline? I've run into this scene a few times before, but sometimes it restarts and recovers automatically, sometimes it crashes completely to make insertion API unavailable.
Hello @elstic! I met the same issue in milvus v2.4.1 (deployed by helm using the chart version milvus-4.1.30). I am wondering the reason of this error. And for now to fix this need re-deploy the milvus with pulling the image master-20240511-8a9a4219, right? Thank you if you could offer any help!
Hello, the essence of the problem I'm documenting is that there is no compaction, resulting in the segment not being merged. For example, there is no data for the compaction latency in the graph. But as far as I know, v2.4.1 should not have this problem. Or can you describe your problem in detail? Is it either of the following? 1 ) Error: “quota exceeded[reason=rate type: DMLInsert]” Please check if your memory usage is high or if you have configured flow limiting:
quotaAndLimits.dml.enabled
2) Check if your instance has been compacted , if not then your problem is the same as mine.Thanks for your reply first! I am testing the bottleneck of the milvus inserting performance. My scene is just doing batch insert (10000 of 768dim random data at a time) continually, and the insert API reported the error
RPC error: [batch_insert], <MilvusException: (code=9, message=quota exceeded[reason=rate type: DMLInsert])>
when the entities num reached 200 million level. I didn't set thequotaAndLimits.dml
at deployment, I guess the default value is False. I checked the log of rootcoord at that time, reportedQueryNode memory to low water level, limit writing rate
. And the log of querynode also reportedno sufficient resource to load segments
. So I tried to allocate more memory quota for querynodes and restart the pods, but it still raises the same error when inserting. I think I should upgrade the whole milvus cluster then.you need to caculate how much memory you need for load 200 million vectors . This could cost a couple of hundred giga bytes
Thanks for your advice!
I estimate the request resources quota before deployment (I tried 16 * 16cpu64Gi querynodes first). I guess the cause is the high writing rate leads to the high memory usage.
Actually I am wondering whether Milvus supports resizing resources quota of workernodes dynamically? In this scene, I tried to use helm like
helm upgrade -f values_custom.yaml my_milvus zilliztech/milvus --reuse-values --set queryNode.resources.requests.memory=128Gi
orhelm upgrade -f values_custom.yaml my_milvus zilliztech/milvus --reuse-values --set queryNode.replicas=32
to resize and restart the querynodes, it did get more resource but still raises the above error. So I uninstalled then installed the whole Milvus cluster again, it works finally.I have two questions, hoping you would give some advice :)
- How does the milvus resize its scale correctly as the data increasing towards its limit?
- When components crash and restart (due to disconnecting from etcd for example) during the insertion, how does the milvus recover and sync the data? Could you explain or offer some reference documents to show me the pipeline? I've run into this scene a few times before, but sometimes it restarts and recovers automatically, sometimes it crashes completely to make insertion API unavailable.
can you explain your use case a little bit. it seems to be a large developement. I'm glad to setup offline meeting and offer some help. Please connect me at xiaofan.luan@zilliz.com
Is there an existing issue for this?
Environment
Current Behavior
case: test_concurrent_locust_diskann_compaction_standalone, test_concurrent_locust_diskann_compaction_cluster
server:
client pod :fouram-disk-stab-1714413600-861861569 client error log:
client result: About 30% of insert interfaces fail
Expected Behavior
Milvus can insert normally and will not be limited.
Steps To Reproduce
Milvus Log
No response
Anything else?
No response