Open pycui opened 2 months ago
@pycui pleas share more info about what you did before the issue pops up, how much data/collections/entities are you running, what is the schema, also please attach all the milvus pods' log files for investigation. /assign @pycui /unassign
1 collection, 750M rows at the time at happens. Was probably writing 5K rows / sec for a while before the issue. Unfornately cannot share the schema, but it's very simple (1024 dim vector, DiskANN index, a few scalar fields, no index on scalar fields, collection not loaded)
the datanode's log is already gone in k8s. Let me know what other specific pod's log you want.
if you have only on shard, then one datanode can be used.
for 750 million data we recommend to use 4-8 shard to start.
We recommend you to use bulkinsert instead of insertion when you want to import large collections.
There is no reason datanode will use 100GB memory or more. The largest laster managed by us has 16GB datanode at most and it works perfectly. you need to figure out where those memory are used by running pprof
@pycui do you want to setup a meeting with me so we can know more details about your use case and help.
Your use cases seems to be an interesting one and I'm sure we can help My email is james.luan@zilliz.com
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
Is there an existing issue for this?
Environment
Current Behavior
One datanode's memory consumption increases very rapidly and always results a OOM error. See screenshot: 7 of 8 datanode is using very limit mem, but one is using 330G+ after only 8 minutes of restart.
I suspect this is due backlog caused by heavy write for a while. But still this is unreasonable. It's not resuming normal on its own now (keep crash loop). Have to increase memory limit to 512G to resolve this. However, this is very fragile so would like better handling going forward.
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
However, there is no error from the log itself (the pod was killed by OOM externally).
Anything else?
No response