[Bug]: Database crashing into a degraded state

nairan-deshaw commented 2 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.4.7
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0
- OS(Ubuntu or CentOS): RHEL
- CPU/Memory: 512 CPU / 1TB K8s cluster
- GPU: NA
- Others: NA

Current Behavior

Since upgrading our Milvus deployment to v2.4.7 we are facing crashes where the database goes into a degraded state. This has happened twice in the last 30 days. The components also stay functional, but none of the DB operations work. The attu UI gives gRPC context deadline exceeded errors. Usually restarting the mixcoord component fixes this issue.

While we took a deeper look at the logs prior to the issue and during the issue, we could not find a trigger point and why the database was not able to recover from the failures. Attaching the logs for the window with this issue. Can you help with what could be causing this and why the database needs the mixcoord to be restarted for things to be functional again?

Expected Behavior

Database should continue working normally

Steps To Reproduce

Reproducers are unclear, but has been happening on v2.4.7

Milvus Log

lines.txt

Anything else?

From the logs, the errors continue to cascade and the database becomes unoperational. While we do see the same error logs every now and then, they do not cause db crashes every time.

yanliang567 commented 2 months ago

@nairan-deshaw quick questions:

which version of milvus did you upgrade from?
the logs you attached seems to only have rootcoord pod info after crash, but we need all tge completed milvus pods logs, so could you please help to collect the logs with the tool below after the crashing happening. please refer this doc to export the whole Milvus logs for investigation(with this script, it collects the logs before crashing, which helps us understand what causes the crash)

/assign @nairan-deshaw

nairan-deshaw commented 2 months ago

We upgraded from v2.3.12
We might not be able to provide full logs, but can send logs in the 30min window before the crash from all the components. The attached logs have all the WARN logs from all pods in the said window.

yanliang567 commented 2 months ago

@nairan-deshaw logs in 30 min window before the crash is okay, but i suggest set the milvus log level to debug to provide more info to us. in case you need to know how to config the milvus config: https://milvus.io/docs/configure-helm.md?tab=component

nairan-deshaw commented 2 months ago

@yanliang567 attaching a file with all the logs (DEBUG level) in the 30 min window. Let me know if that helps. logs.tar.gz

yanliang567 commented 2 months ago

/assign @weiliu1031 please help to take a look /unassign

xiaofan-luan commented 2 months ago

Is there an existing issue for this?

[x] I have searched the existing issues

Environment
- Milvus version: 2.4.7
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0
- OS(Ubuntu or CentOS): RHEL
- CPU/Memory: 512 CPU / 1TB K8s cluster
- GPU: NA
- Others: NA
Current Behavior

Since upgrading our Milvus deployment to v2.4.7 we are facing crashes where the database goes into a degraded state. This has happened twice in the last 30 days. The components also stay functional, but none of the DB operations work. The attu UI gives gRPC context deadline exceeded errors. Usually restarting the mixcoord component fixes this issue.

While we took a deeper look at the logs prior to the issue and during the issue, we could not find a trigger point and why the database was not able to recover from the failures. Attaching the logs for the window with this issue. Can you help with what could be causing this and why the database needs the mixcoord to be restarted for things to be functional again?

Expected Behavior

Database should continue working normally

Steps To Reproduce
Reproducers are unclear, but has been happening on v2.4.7
Milvus Log

lines.txt

Anything else?

From the logs, the errors continue to cascade and the database becomes unoperational. While we do see the same error logs every now and then, they do not cause db crashes every time.

I think from the current log, there is no obvious reason from log except for some timeout .

What do you mean by "none of the DB operations work." does that mean all the coord crashed?

Could you try to pprof and catch goroutine info and cpu usage of coordinator so we can analyze on it?

nairan-deshaw commented 2 months ago

What do you mean by "none of the DB operations work." does that mean all the coord crashed?

The inserts, queries etc. all time out and we end up seeing the below error on client side:

E0819 01:13:41.110673102 1719680 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:14:41.120204252 1720476 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:15:41.133475728 1721014 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:16:41.166404019 1721536 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:17:41.259244512 1722012 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
[20240819 01:17:41:259 decorators.py:100 WARNING] [release_partitions] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed>

The mixcoord itself does not crash which is why k8s does not restart the pod, instead it goes into a degraded state from which it becomes non recoverable. In addition to that attu also becomes unresponsive and keeps loading for a long time before erroring out with Context deadline exceeded.

Could you try to pprof and catch goroutine info and cpu usage of coordinator so we can analyze on it?

Any steps / references on how to do this?

xiaofan-luan commented 2 months ago

To get pprof, check this scripts https://github.com/milvus-io/milvus/blob/3b01b7dc9a2c72bdfcca87bc0420ae9121f41f3f/tests/scripts/export_pprof_goroutine.sh#L26

The first command goroutines, mutex, cpu profiling and block point, this will give us more details to analyze. curl -o goroutine.txt http://10.28.191.52:9091/debug/pprof/goroutine curl -o mutex.txt http://10.28.191.52:9091/debug/pprof/mutex curl -o profile.txt http://10.28.191.52:9091/debug/pprof/profile curl -o block.txt http://10.28.191.52:9091/debug/pprof/block

to check logs, check this scripts https://github.com/milvus-io/milvus/tree/master/deployments/export-log

Most likely there is a very slow operation in coordinator, but we need to know what the operation is. most likely this will be etcd

nairan-deshaw commented 1 month ago

Hi @xiaofan-luan, attaching the logs for the suggested endpoints from the current mixcoord component. Let me if that helps or more logs are needed: logs.tar.gz

xiaofan-luan commented 1 month ago

from the log, there isn't any suspicious log except for one index node failed. [datacoord/indexnode_manager.go:131] ["get IndexNode slots failed"] [nodeID=5293]

Is there any warn/error or any node crash?

From the inital log, the query is in many section_ids and this is not recommend(could be very slow). there are many timeout at that moment but we don't have many details so no idea about whether this is querynode timeout or due to some crash or what

xiaofan-luan commented 1 month ago

right now the datacoord works fine, cpu usage is high on doing

One guess is that we have a bug of leaking files , see https://github.com/milvus-io/milvus/pull/36161. And what might happens is after running for while there maybe many leakage cause the datacoord to be slower. This is fixed on 2.4.12

My question is how many collections/partitions are there in your use case? the file seems to be accumulating very fast, because we don't see this happen on even large clusters

nairan-deshaw commented 1 month ago

Is there any warn/error or any node crash?

Checked that no other pod crashes right before this state happens

how many collections/partitions are there in your use case?

We have 12 collections about 1300 partitions which increases at regular intervals. We have currently under 150 million embeddings.

xiaofan-luan commented 1 month ago

Is there any warn/error or any node crash?

Checked that no other pod crashes right before this state happens

how many collections/partitions are there in your use case?

We have 12 collections about 1300 partitions which increases at regular intervals. We have currently under 150 million embeddings.

try to upgrade to 2.4.12 see if it helps. Maybe also try to increase the resources of coords. Did you see any increase on mixcoord's cpu usage?

nairan-deshaw commented 1 month ago

Did you see any increase on mixcoord's cpu usage?

We did see some normal spikes on the pods, but not anything unusual. Currently we have allocated 4 CPU and 8GB for the mixcoord.

We will try upgrading to v2.4.12. In addition, we are planning to deploy a standby mixcoord. Will that provide any benefits in this scenario? @xiaofan-luan

xiaofan-luan commented 1 month ago

Did you see any increase on mixcoord's cpu usage?

We did see some normal spikes on the pods, but not anything unusual. Currently we have allocated 4 CPU and 8GB for the mixcoord.

We will try upgrading to v2.4.12. In addition, we are planning to deploy a standby mixcoord. Will that provide any benefits in this scenario? @xiaofan-luan

I'm doubting on adding a standby mixcoord could help. We probably need to figure out the reason of panic before do the next step.

Please check:

if there are any warn/error log when coord exit
is there any etcd slow message
why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge

nairan-deshaw commented 1 month ago

if there are any warn/error log when coord exit

Logs before and during the crash are attached in https://github.com/milvus-io/milvus/issues/36217#issuecomment-2357502225

is there any etcd slow message

Any specific filters for looking at these logs?

why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge

Below are the usage patterns for CPU and memory from our cluster during the crash window:

xiaofan-luan commented 1 month ago

if there are any warn/error log when coord exit

Logs before and during the crash are attached in #36217 (comment)

is there any etcd slow message

Any specific filters for looking at these logs?

why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge

Below are the usage patterns for CPU and memory from our cluster during the crash window:

Are you using bulk insertions for all the inserted data?

nairan-deshaw commented 1 month ago

Yes, we are using bulk inserts.

xiaofan-luan commented 1 month ago

this might be a known bug since some of the imported segment info is leaked and the cpu accumulated for a while until coord reboot. @bigsheeper is working on solving it

stale[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

yanliang567 commented 1 week ago

@bigsheeper were all the fixed for this issue merged? is it related to a pr that we reverted recently?

bigsheeper commented 1 week ago

@bigsheeper were all the fixed for this issue merged?

Yes, the imported segment leak issue has been resolved, which should help prevent high CPU usage in mixcoord.

is it related to a pr that we reverted recently?

Nope.

nairan-deshaw commented 1 week ago

Thanks, does 2.4.15 have the potential fix in place?

milvus-io / milvus