Open wangting0128 opened 5 months ago
argo task:multi-vector-corn-576nt image: 2.4-20240407-c18193cb-amd64
server:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
multi-vector-corn-576nt-32-etcd-0 1/1 Running 0 7h51m 10.104.27.15 4am-node31 <none> <none>
multi-vector-corn-576nt-32-milvus-standalone-698c66d746-l8kbx 1/1 Running 0 7h51m 10.104.29.72 4am-node35 <none> <none>
multi-vector-corn-576nt-32-minio-68957bbcfc-nlrnh 1/1 Running 0 7h51m 10.104.27.16 4am-node31 <none> <none>
{pod=~"multi-vector-corn-576nt-32-milvus-standalone-698c66d746-l8kbx"} |~ "load"
milvus_load.log
client:
/unassign
The behavior may be related to the loading order of segments.
PredictedMemUsageAfterLoad = MemUsage + Predict(segment), and the Predict(segment) is positively related to the segment size, which will cause this issue.
Taking a very simple example, suppose we load all the segments sequentially, and the size of 5 segments are 1g, 2g, 3g, 4g, 5g, and the Predict(segment) = 2.5 segment size, but the ActualMemUsage(segment) = 1 segment size. Why we can have this assumption is that we have 48 segments and if all segments need 2.5 times size memory then 64G is far not enough.
So if we load them in descending order by segment size, the usage sequence will be [5, 9, 12, 14, 15], and the predict sequence will be [12.5, 15, 16.5, 17, 16.5], the final predicted memory usage is 16.5g. On the contray, if we load them in ascending order by segment size, the usage sequence will be [1, 3, 6, 10, 15], but the predict sequence will be [2.5, 6, 10.5, 16, 22.5], the final predicted memory usage is 22.5g.
I think above example can explain this issue.
@longjiquan how about modify the loading order to be descending? and verfity this case.
@longjiquan any updates
Is there an existing issue for this?
Environment
Current Behavior
argo task: multi-vector-corn-sp4ff test case name: test_hybrid_search_locust_shard16_float_dql_ivf_flat_standalone
server:
Segment Loaded Num
memory usage
OOM log
load failed segment: 448956254485300637
client log:
Expected Behavior
No response
Steps To Reproduce
Milvus Log
No response
Anything else?
step 1 test result: test_hybrid_search_locust_shard16_float_dql_hnsw_standalone