Closed kdabbir closed 6 months ago
/assign @MrPresent-Han could u please help to take a look? /unassign
@yanliang567 @xiaofan-luan would appreciate your inputs on this issue. We are unable to run even 1000 collections with 2.3.9. I've attached log export of current state of cluster. If you need more logs I can gather them from splunk and provide them when the errors started occurring
We plan to add upto 10k collections per cluster, so having this instability will have issues in our workflows
I will handle this
Hi, @kdabbir, I checked the two log files provided and I found it weird for failing to find logs related to milvus most of logout are associated with 'exporter_metrics.go'
Can you confirm the content of the files provided? And I recommend using a specific tool for exporting milvus pod log: https://github.com/milvus-io/milvus/tree/master/deployments/export-log
@MrPresent-Han let me re-export the logs not sure why it exported only metrics
@MrPresent-Han I've reuploaded the logs in the original post/comment. I had to export logs from splunk as there was an issue with k9s export. But I made sure all respective pod logs are exported. let me know if you need any other logs
@kdabbir I took a quick look. It looks like the coordinators were overwhelmed by certain requests/tasks. Could you check the the resource usages (CPU & Memory) of coordinators during those time periods?
@yiwangdr I see data coord taking upto 12GB memory in spikes during this period. Why does data coord need so much memory? Other nodes metrics seems fine in terms of cpu/memory. Attaching snapshots of our graphana.
Data coord:
Data worker:
Index-coord:
Index-worker:
proxy-worker:
Query-coord:
Query-worker:
root-coord:
@kdabbir Yes, the datacoord memory usage pattern looks bizarre. We can't reproduce it. Could you pprof it when the memory usage is high? Milvus has pprof enabled by default at port 9091.
sample commands: curl http://{ip_of_datacoord}:9091/debug/pprof/heap > heap.out curl http://{ip_of_datacoord}:9091/debug/pprof/goroutine > goroutine.out
most likely it is due to some bottleneck of datacoord meta. See #30837, #30823 Since there is a single lock on meta operation, many of the datacoord request might be blocked by the lock
a pprof could really helpful. From the current info, datacoord definitely went timeout
@kdabbir I am trying to reproduce this issue in house, but it seems that I met a different situation with you. I did not see much memory increase on datacoord pod, while I see much cpu and memory increase on datanode pods. So if i convenient, could you please share some code snippets for reproduce this issue? /assign @kdabbir
@yanliang567 @MrPresent-Han can we have a call to discuss on this? I can show my test setup and show the logs/dashboard of the issue. Let me know what time works tomorrow for you. Test code might not give a full picture on this issue.
@jaime0815 @xiaofan-luan I had debugged into this flow and I see UpdateChannelCheckPoint meta lock from datacoord is causing issues in this case, in the bug I've reported. I have seen this PR is closed
| Could you pprof it when the memory usage is high? Milvus has pprof enabled by default at port 9091.
@xiaofan-luan @yanliang567 @MrPresent-Han Attaching pprof of goroutune and heap when memory usage for datacoord is around 8GB. goroutine and heap compressed.zip
| could you please share some code snippets for reproduce this issue?
Attached python milvus SDK scale script we are using: milvus_scale_script.py.zip
I will take a look at this the overwhelming memory usage for this case is due to the blocked rpc connection, as most of UpdateChannelCP request is blocked by lock and cannot be terminated, loads of following request accumulated and reach to 373277, the memory consumed by groutine crushed the datacoord
@kdabbir, this problem is still because of global lock, another pr: https://github.com/milvus-io/milvus/pull/30941 will fix this error
thanks. when can we expect a release for this? would like to test this fix out and report back
thanks. when can we expect a release for this? would like to test this fix out and report back
the version 2.3.11 is on the way, next Tuesday or Wednesday, I suppose,
Milvus v2.3.11 was released last nightly, please enjoy it.:P
/assign @yanliang567 mark it as fixed?
@kdabbir any chance you could try with milvus 2.3.11 ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
Is there an existing issue for this?
Environment
Current Behavior
Follow-up to the previous issue we've raised, we re-ran our tests with 2 milvus clusters having setting: 1) Normal Insert 2) Bulk Insert
Have created 1000 collections (no partitions created) with 1 shard and started ingesting 500k records per collection. Every insert call has 100k records and we insert randomly into these 1000 collections till they reach 500k records.
After we inserted around 20 Million vectors we saw errors in data nodes for both bulk & normal insert throwing exceptions
rpc error: code = DeadlineExceeded desc = context deadline exceeded
,retry func failed rpc error: code = DeadlineExceeded desc = context deadline exceeded
andfailed to update channel checkpoint
I've attached the current log export for both normal insert and bulk insert. Please help understand the issue and suggest a fix to this.
Expected Behavior
Cluster should be able to work with 1 billion vectors for 1000 collections
Steps To Reproduce
No response
Milvus Log
Bulk Insert Milvus Error Export Logs.zip Standard Insert Milvus Error Export Logs.zip
Anything else?
No response