[Bug]: Data nodes crashing with error: empty grpc client: context canceled in milvus version 2.3.9

kdabbir commented 8 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.3.9
- Deployment mode(standalone or cluster): Milvus cluster deployed in kubernetes
- MQ type(rocksmq, pulsar or kafka):  External Kafka (AWS MSK)
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.6  
num_shards of collection: 1 
total of 1000 collections with 1 shards
Vector dim: 768
Scale of upto 1 billion
- CPU/Memory:
Worker nodes: Data nodes - 15 CPU/54GB memory, Index nodes - 15 CPU/25GB Mem, Query nodes - 14 CPU/118GB
Referred to milvus sizing tool & scaled up 8 index nodes, 8 data nodes, 6 proxy & 25 query nodes.
coordinator nodes:  8 core cpu/16gb memory, Based on c5a.2xlarge.

- Also increased etcd ETCD_MAX_TXN_OPS to 10000 to support more operations

Current Behavior

Follow-up to the previous issue we've raised, we re-ran our tests with 2 milvus clusters having setting: 1) Normal Insert 2) Bulk Insert

Have created 1000 collections (no partitions created) with 1 shard and started ingesting 500k records per collection. Every insert call has 100k records and we insert randomly into these 1000 collections till they reach 500k records.

After we inserted around 20 Million vectors we saw errors in data nodes for both bulk & normal insert throwing exceptions rpc error: code = DeadlineExceeded desc = context deadline exceeded , retry func failed rpc error: code = DeadlineExceeded desc = context deadline exceeded and failed to update channel checkpoint

I've attached the current log export for both normal insert and bulk insert. Please help understand the issue and suggest a fix to this.

Expected Behavior

Cluster should be able to work with 1 billion vectors for 1000 collections

Steps To Reproduce

No response

Milvus Log

Bulk Insert Milvus Error Export Logs.zip Standard Insert Milvus Error Export Logs.zip

Anything else?

No response

yanliang567 commented 8 months ago

/assign @MrPresent-Han could u please help to take a look? /unassign

kdabbir commented 8 months ago

@yanliang567 @xiaofan-luan would appreciate your inputs on this issue. We are unable to run even 1000 collections with 2.3.9. I've attached log export of current state of cluster. If you need more logs I can gather them from splunk and provide them when the errors started occurring

We plan to add upto 10k collections per cluster, so having this instability will have issues in our workflows

MrPresent-Han commented 8 months ago

I will handle this

Hi, @kdabbir, I checked the two log files provided and I found it weird for failing to find logs related to milvus most of logout are associated with 'exporter_metrics.go'

Can you confirm the content of the files provided? And I recommend using a specific tool for exporting milvus pod log: https://github.com/milvus-io/milvus/tree/master/deployments/export-log

kdabbir commented 8 months ago

@MrPresent-Han let me re-export the logs not sure why it exported only metrics

kdabbir commented 8 months ago

@MrPresent-Han I've reuploaded the logs in the original post/comment. I had to export logs from splunk as there was an issue with k9s export. But I made sure all respective pod logs are exported. let me know if you need any other logs

yiwangdr commented 8 months ago

@kdabbir I took a quick look. It looks like the coordinators were overwhelmed by certain requests/tasks. Could you check the the resource usages (CPU & Memory) of coordinators during those time periods?

kdabbir commented 8 months ago

@yiwangdr I see data coord taking upto 12GB memory in spikes during this period. Why does data coord need so much memory? Other nodes metrics seems fine in terms of cpu/memory. Attaching snapshots of our graphana.

Data coord:

Data worker:

Index-coord:

Index-worker:

proxy-worker:

Query-coord:

Query-worker:

root-coord:

yiwangdr commented 8 months ago

@kdabbir Yes, the datacoord memory usage pattern looks bizarre. We can't reproduce it. Could you pprof it when the memory usage is high? Milvus has pprof enabled by default at port 9091.

sample commands: curl http://{ip_of_datacoord}:9091/debug/pprof/heap > heap.out curl http://{ip_of_datacoord}:9091/debug/pprof/goroutine > goroutine.out

xiaofan-luan commented 8 months ago

most likely it is due to some bottleneck of datacoord meta. See #30837, #30823 Since there is a single lock on meta operation, many of the datacoord request might be blocked by the lock

xiaofan-luan commented 8 months ago

a pprof could really helpful. From the current info, datacoord definitely went timeout

yanliang567 commented 8 months ago

@kdabbir I am trying to reproduce this issue in house, but it seems that I met a different situation with you. I did not see much memory increase on datacoord pod, while I see much cpu and memory increase on datanode pods. So if i convenient, could you please share some code snippets for reproduce this issue? /assign @kdabbir

kdabbir commented 8 months ago

@yanliang567 @MrPresent-Han can we have a call to discuss on this? I can show my test setup and show the logs/dashboard of the issue. Let me know what time works tomorrow for you. Test code might not give a full picture on this issue.

kdabbir commented 8 months ago

@jaime0815 @xiaofan-luan I had debugged into this flow and I see UpdateChannelCheckPoint meta lock from datacoord is causing issues in this case, in the bug I've reported. I have seen this PR is closed

Remove the global lock from channel checkpoint operations https://github.com/milvus-io/milvus/pull/30903 Can we know why this is closed? This could fix the issue we are seeing here

kdabbir commented 8 months ago

| Could you pprof it when the memory usage is high? Milvus has pprof enabled by default at port 9091.

@xiaofan-luan @yanliang567 @MrPresent-Han Attaching pprof of goroutune and heap when memory usage for datacoord is around 8GB. goroutine and heap compressed.zip

| could you please share some code snippets for reproduce this issue?

Attached python milvus SDK scale script we are using: milvus_scale_script.py.zip

MrPresent-Han commented 8 months ago

I will take a look at this the overwhelming memory usage for this case is due to the blocked rpc connection, as most of UpdateChannelCP request is blocked by lock and cannot be terminated, loads of following request accumulated and reach to 373277, the memory consumed by groutine crushed the datacoord

MrPresent-Han commented 8 months ago

@kdabbir, this problem is still because of global lock, another pr: https://github.com/milvus-io/milvus/pull/30941 will fix this error

kdabbir commented 8 months ago

thanks. when can we expect a release for this? would like to test this fix out and report back

MrPresent-Han commented 8 months ago

thanks. when can we expect a release for this? would like to test this fix out and report back

the version 2.3.11 is on the way, next Tuesday or Wednesday, I suppose,

yanliang567 commented 8 months ago

Milvus v2.3.11 was released last nightly, please enjoy it.:P

xiaofan-luan commented 8 months ago

/assign @yanliang567 mark it as fixed?

yanliang567 commented 8 months ago

@kdabbir any chance you could try with milvus 2.3.11 ?

stale[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

milvus-io / milvus