Open nairan-deshaw opened 2 months ago
@nairan-deshaw quick questions:
/assign @nairan-deshaw
v2.3.12
WARN
logs from all pods in the said window. @nairan-deshaw logs in 30 min window before the crash is okay, but i suggest set the milvus log level to debug to provide more info to us. in case you need to know how to config the milvus config: https://milvus.io/docs/configure-helm.md?tab=component
@yanliang567 attaching a file with all the logs (DEBUG
level) in the 30 min window. Let me know if that helps.
logs.tar.gz
/assign @weiliu1031 please help to take a look /unassign
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: 2.4.7 - Deployment mode(standalone or cluster): cluster - MQ type(rocksmq, pulsar or kafka): pulsar - SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0 - OS(Ubuntu or CentOS): RHEL - CPU/Memory: 512 CPU / 1TB K8s cluster - GPU: NA - Others: NA
Current Behavior
Since upgrading our Milvus deployment to
v2.4.7
we are facing crashes where the database goes into a degraded state. This has happened twice in the last 30 days. The components also stay functional, but none of the DB operations work. The attu UI gives gRPC context deadline exceeded errors. Usually restarting themixcoord
component fixes this issue.While we took a deeper look at the logs prior to the issue and during the issue, we could not find a trigger point and why the database was not able to recover from the failures. Attaching the logs for the window with this issue. Can you help with what could be causing this and why the database needs the mixcoord to be restarted for things to be functional again?
Expected Behavior
Database should continue working normally
Steps To Reproduce
Reproducers are unclear, but has been happening on v2.4.7
Milvus Log
Anything else?
From the logs, the errors continue to cascade and the database becomes unoperational. While we do see the same error logs every now and then, they do not cause db crashes every time.
I think from the current log, there is no obvious reason from log except for some timeout .
What do you mean by "none of the DB operations work." does that mean all the coord crashed?
Could you try to pprof and catch goroutine info and cpu usage of coordinator so we can analyze on it?
What do you mean by "none of the DB operations work." does that mean all the coord crashed?
The inserts, queries etc. all time out and we end up seeing the below error on client side:
E0819 01:13:41.110673102 1719680 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value
E0819 01:14:41.120204252 1720476 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value
E0819 01:15:41.133475728 1721014 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value
E0819 01:16:41.166404019 1721536 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value
E0819 01:17:41.259244512 1722012 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value
[20240819 01:17:41:259 decorators.py:100 WARNING] [release_partitions] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed>
The mixcoord itself does not crash which is why k8s does not restart the pod, instead it goes into a degraded state from which it becomes non recoverable. In addition to that attu also becomes unresponsive and keeps loading for a long time before erroring out with Context deadline exceeded.
Could you try to pprof and catch goroutine info and cpu usage of coordinator so we can analyze on it?
Any steps / references on how to do this?
To get pprof, check this scripts https://github.com/milvus-io/milvus/blob/3b01b7dc9a2c72bdfcca87bc0420ae9121f41f3f/tests/scripts/export_pprof_goroutine.sh#L26
The first command goroutines, mutex, cpu profiling and block point, this will give us more details to analyze. curl -o goroutine.txt http://10.28.191.52:9091/debug/pprof/goroutine curl -o mutex.txt http://10.28.191.52:9091/debug/pprof/mutex curl -o profile.txt http://10.28.191.52:9091/debug/pprof/profile curl -o block.txt http://10.28.191.52:9091/debug/pprof/block
to check logs, check this scripts https://github.com/milvus-io/milvus/tree/master/deployments/export-log
Most likely there is a very slow operation in coordinator, but we need to know what the operation is. most likely this will be etcd
Hi @xiaofan-luan, attaching the logs for the suggested endpoints from the current mixcoord component. Let me if that helps or more logs are needed: logs.tar.gz
from the log, there isn't any suspicious log except for one index node failed. [datacoord/indexnode_manager.go:131] ["get IndexNode slots failed"] [nodeID=5293]
Is there any warn/error or any node crash?
From the inital log, the query is in many section_ids and this is not recommend(could be very slow). there are many timeout at that moment but we don't have many details so no idea about whether this is querynode timeout or due to some crash or what
right now the datacoord works fine, cpu usage is high on doing
One guess is that we have a bug of leaking files , see https://github.com/milvus-io/milvus/pull/36161. And what might happens is after running for while there maybe many leakage cause the datacoord to be slower. This is fixed on 2.4.12
My question is how many collections/partitions are there in your use case? the file seems to be accumulating very fast, because we don't see this happen on even large clusters
Is there any warn/error or any node crash?
Checked that no other pod crashes right before this state happens
how many collections/partitions are there in your use case?
We have 12 collections about 1300 partitions which increases at regular intervals. We have currently under 150 million embeddings.
Is there any warn/error or any node crash?
Checked that no other pod crashes right before this state happens
how many collections/partitions are there in your use case?
We have 12 collections about 1300 partitions which increases at regular intervals. We have currently under 150 million embeddings.
try to upgrade to 2.4.12 see if it helps. Maybe also try to increase the resources of coords. Did you see any increase on mixcoord's cpu usage?
Did you see any increase on mixcoord's cpu usage?
We did see some normal spikes on the pods, but not anything unusual. Currently we have allocated 4 CPU and 8GB for the mixcoord.
We will try upgrading to v2.4.12. In addition, we are planning to deploy a standby mixcoord. Will that provide any benefits in this scenario? @xiaofan-luan
Did you see any increase on mixcoord's cpu usage?
We did see some normal spikes on the pods, but not anything unusual. Currently we have allocated 4 CPU and 8GB for the mixcoord.
We will try upgrading to v2.4.12. In addition, we are planning to deploy a standby mixcoord. Will that provide any benefits in this scenario? @xiaofan-luan
I'm doubting on adding a standby mixcoord could help. We probably need to figure out the reason of panic before do the next step.
Please check:
if there are any warn/error log when coord exit
Logs before and during the crash are attached in https://github.com/milvus-io/milvus/issues/36217#issuecomment-2357502225
is there any etcd slow message
Any specific filters for looking at these logs?
why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge
Below are the usage patterns for CPU and memory from our cluster during the crash window:
if there are any warn/error log when coord exit
Logs before and during the crash are attached in #36217 (comment)
is there any etcd slow message
Any specific filters for looking at these logs?
why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge
Below are the usage patterns for CPU and memory from our cluster during the crash window:
Are you using bulk insertions for all the inserted data?
Yes, we are using bulk inserts.
this might be a known bug since some of the imported segment info is leaked and the cpu accumulated for a while until coord reboot. @bigsheeper is working on solving it
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
@bigsheeper were all the fixed for this issue merged? is it related to a pr that we reverted recently?
@bigsheeper were all the fixed for this issue merged?
Yes, the imported segment leak issue has been resolved, which should help prevent high CPU usage in mixcoord.
is it related to a pr that we reverted recently?
Nope.
Thanks, does 2.4.15 have the potential fix in place?
Is there an existing issue for this?
Environment
Current Behavior
Since upgrading our Milvus deployment to
v2.4.7
we are facing crashes where the database goes into a degraded state. This has happened twice in the last 30 days. The components also stay functional, but none of the DB operations work. The attu UI gives gRPC context deadline exceeded errors. Usually restarting themixcoord
component fixes this issue.While we took a deeper look at the logs prior to the issue and during the issue, we could not find a trigger point and why the database was not able to recover from the failures. Attaching the logs for the window with this issue. Can you help with what could be causing this and why the database needs the mixcoord to be restarted for things to be functional again?
Expected Behavior
Database should continue working normally
Steps To Reproduce
Milvus Log
lines.txt
Anything else?
From the logs, the errors continue to cascade and the database becomes unoperational. While we do see the same error logs every now and then, they do not cause db crashes every time.