Open syang1997 opened 2 months ago
quick questions:
/assign @congqixia @syang1997
quick questions:
- are you running all the query nodes with exclusive hardwares underlying?
- do you happen to have any metrics about network and disk performance at that moment?
- please share the completed milvus logs
/assign @congqixia @syang1997
Node partial monitoring indicators @yanliang567
/assign @congqixia please help to take a look
@yanliang567 Who can help with the investigation? Recently it appeared again.
not have any clues yet. We need more info in logs to know what was happening at that moment. Please offer the full milvus pod logs during the period.
milvus-log (3).tar.gz @yanliang567 I got the latest debug log of p99 timeout. I filtered the log information and found that there is a log that is waiting for search to return to the long log in the proxy log, as follows:
Line 13498: [2024/07/23 15:05:56.323 +00:00] [DEBUG] [proxy/task_search.go:436] ["tr/proxy execute search 451345685543387137"] [traceID=02c29e210100a6076c41537ce36985cb] [msg=done] [duration=3.780114072s]
Line 13511: [2024/07/23 15:05:56.323 +00:00] [DEBUG] [proxy/impl.go:2634] [tr/Search] [traceID=02c29e210100a6076c41537ce36985cb] [msg="wait search result"] [duration=3.780748241s]
But I did not filter a similar query time in the log of QueryNode. It was initially suspected that it was a MQ problem, but the resource usage rate of monitoring and viewing Pulsar was very low, and the nodes at the time were not abnormal.
@yanliang567 Can you help me analyze the cause of timeout?
okay, let me check the logs
Hello @syang1997 , could you please provide us the monitoring for wait tsafe latency?
Hello @syang1997 , could you please provide us the monitoring for wait tsafe latency?
Additionally, please attach the metric screenshots around 2024/07/23 15:05(+1h、-1h), it would be helpful for us to address the issue. @syang1997
@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.
Additionally, please attach the metric screenshots around 2024/07/23 15:05(+1h、-1h), it would be helpful for us to address the issue. @syang1997
Is there any way to export the specified time period? It has been a long time to introduce the log script to choose the 24 -hour log is too large
@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.
@syang1997 The wait tsafe latency monitoring is like this:
@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long. @syang1997 The wait tsafe latency monitoring is like this:
I didn't find this panel, which version of Grafana Dashboard you used
I didn't find this panel, which version of Grafana Dashboard you used
It's ok, if the querynode search request latency is low, then the wait tsafe latency is likely to be low as well.
@bigsheeper @yanliang567 What more information do you need from me?
@syang1997 quick question, have you detected any network issue during this problem happened?
quick question, have you detected any network issue during this problem happened?
Judging from the node network monitoring, there is no problem with the network
what about your host machine cpu? sometimes if you have some burst on other pod this will also affect your performance.
you may also check the go garbage collection and K8s throttle info
what about your host machine cpu? sometimes if you have some burst on other pod this will also affect your performance.
The nodes 'indicators are normal, cpu utilization is very low, and network bandwidth is not high. I also uploaded node monitoring information before
you may also check the go garbage collection and K8s throttle info
I checked the gc status of the components in milvus monitoring and found no problems. As for k8s resources, there are no oversold or restricted situations
@yanliang567 My preliminary judgment is that the problem lies with plusar, but plusar's cpu and memory usage are very low. Is it caused by plusar's pvc being exhausted and cleaned up? I haven't checked Pulsar's PVC usage yet.
Hello @syang1997 , it seems your monitoring dashboard version is quite outdated. Could you please update your dashboard by importing this configuration file? (https://github.com/milvus-io/milvus/blob/master/deployments/monitor/grafana/milvus-dashboard.json) Then, provide us with the updated monitoring information to help us further investigate the issue. Thank you very much.
Okay, I have deployed the latest panel. What indicator information do I need?
Okay, I have deployed the latest panel. What indicator information do I need?
@syang1997 Sorry, our colleague is currently submitting a PR to the master branch to update the configuration. We will notify you once the PR is merged.
Okay, I have deployed the latest panel. What indicator information do I need?
@syang1997 Sorry, our colleague is currently submitting a PR to the master branch to update the configuration. We will notify you once the PR is merged.
Okay, please as soon as possible, because surveillance information is only retained for 7 days
@bigsheeper I upgraded grafana. What monitoring indicators do I need to provide
@bigsheeper I upgraded grafana. What monitoring indicators do I need to provide
@syang1997 It would be best to have all the dashboards for the proxy and querynode.
@bigsheeper
@syang1997 The monitoring looks fine, I still think it might be a network issue between the proxy and the querynode. I meant stability, not throughput related.
I found some warning logs that back up my point. These logs appeared during a slow search, and show that the proxy timed out connecting to querynode-53
(address=10.37.113.183:21123).
[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:408] ["call received grpc error"] [clientRole=querynode-53] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [retry/retry.go:102] ["retry func failed"] [retried=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:554] ["ClientBase Call grpc call get error"] [role=querynode-53] [address=10.37.113.183:21123] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n/usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88\n | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102\n | github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n | /go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n | /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]
Please check if there are frequent connection timeout logs with querynode-53
during slow search. If so, I suggest killing this querynode and redeploying it on a different physical machine. @syang1997
I think the architecture of version 2.4 is Proxy directly to QueryNode, version 2.3 is through MQ. My Milvus version is 2.3.15. @bigsheeper
@syang1997 In the 2.3 architecture diagram, "query" refers to the process of the querynode consuming data from the message storage, rather than the query operation itself.
@syang1997 In the 2.3 architecture diagram, "query" refers to the process of the querynode consuming data from the message storage, rather than the query operation itself.
Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.
I found some warning logs that back up my point. These logs appeared during a slow search, and show that the proxy timed out connecting to
querynode-53
(address=10.37.113.183:21123).[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:408] ["call received grpc error"] [clientRole=querynode-53] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [2024/07/23 15:05:55.145 +00:00] [WARN] [retry/retry.go:102] ["retry func failed"] [retried=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:554] ["ClientBase Call grpc call get error"] [role=querynode-53] [address=10.37.113.183:21123] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n/usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88\n | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102\n | github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n | /go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n | /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]
Please check if there are frequent connection timeout logs with
querynode-53
during slow search. If so, I suggest killing this querynode and redeploying it on a different physical machine. @syang1997
@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?
Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.
From the monitoring data, the MQ has no affect to the slow search. So no need to suspect the MQ is this issue.
@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?
Yep, in milvus, search requests are sent directly from proxy to querynode, so the problem is due to network instability between the proxy and the querynode. @syang1997
Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.
From the monitoring data, the MQ has no affect to the slow search. So no need to suspect the MQ is this issue.
@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?
Yep, in milvus, search requests are sent directly from proxy to querynode, so the problem is due to network instability between the proxy and the querynode. @syang1997
Okay, I will continue to position in this direction
@bigsheeper The link about query in the 2.3 architecture diagram is easy to misunderstand. Can you consider optimizing it?
@bigsheeper Recently, there was another timeout, and only one request was abnormal. This is the monitoring information and logs. We contacted the cloud manufacturers for investigation. The response of the cloud manufacturer did not find that the network was abnormal. Do you have a clue?
I only saw an error in the GRPC link in the Proxy log, and I didn't find any abnormal logs in QueryNode. I only saw a three -line GRPC error log in one trial log, but I saw this request for up to 3 seconds through time stamps and monitoring.
I only saw an error in the GRPC link in the Proxy log, and I didn't find any abnormal logs in QueryNode. I only saw a three -line GRPC error log in one trial log, but I saw this request for up to 3 seconds through time stamps and monitoring.
Yep, still the network issue; because the connection timeout is set to 3 seconds.
@bigsheeper The result of the cloud provider assistance to the investigation is that there is no problem with the network, which makes us very headache.
@syang1997 do you want to setup an offline meeting so we can got more details to help?
My email xiaofan.luan@zilliz.com
from the monitoring metrics, it's hard to find more information, deploy the new tracing might be helpful
@syang1997 do you want to setup an offline meeting so we can got more details to help?
My email xiaofan.luan@zilliz.com
Very willing, I hope to arrange it as soon as possible
@xiaofan-luan I sent an email to you, looking forward to your reply
proxy-57.log querynode-70.log querynode-71.log
traceID=e8d19598e7a4b42dfaddd2ea28565acd
Is there an existing issue for this?
Environment
Current Behavior
During the smooth request, p99 suddenly increased to 15kms, but the resources were sufficient and the CPU and memory were low. What was the reason?
The following is the monitoring
The following is the qureynode log
Expected Behavior
No response
Steps To Reproduce
Milvus Log
No response
Anything else?
No response