milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.52k stars 2.83k forks source link

[Bug]: Very high p99 with low requests and sufficient resources #34476

Open syang1997 opened 2 months ago

syang1997 commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: v2.3.15
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):  Java sdk v2.3.4
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: qurey node 8c 16g
- GPU: 
- Others:

Current Behavior

During the smooth request, p99 suddenly increased to 15kms, but the resources were sufficient and the CPU and memory were low. What was the reason?

The following is the monitoring

img img_1 img_2 img_3 img_4

The following is the qureynode log

[newVersion=1720315568890138213] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:18.893 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315558893267636] [newVersion=1720315568890093833] [growingSegmentNum=0] [sealedSegmentNum=2]
[2024/07/07 01:26:28.887 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=5f972dd954fb612893c39b167c22d19c] [collectionID=449980176319159710] [channel=milvus-c-customer-service-rootcoord-dml_4_449980176319159710v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895106277]
[2024/07/07 01:26:28.887 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889907859] [newVersion=1720315578895106277] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=f99fdc60cdb2dfa46725f861cad74639] [collectionID=449980176295686266] [channel=milvus-c-customer-service-rootcoord-dml_10_449980176295686266v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895383796]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=37a70ecd2a87c776cf6b344bec75bfc4] [collectionID=449980176309032169] [channel=milvus-c-customer-service-rootcoord-dml_15_449980176309032169v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578894055581]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568890093833] [newVersion=1720315578894055581] [growingSegmentNum=0] [sealedSegmentNum=2]
[2024/07/07 01:26:28.888 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568890025022] [newVersion=1720315578895383796] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.889 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=70147845a02f2277c5442f83e17b61f5] [collectionID=449980176317610505] [channel=milvus-c-customer-service-rootcoord-dml_3_449980176317610505v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578894246706]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889805018] [newVersion=1720315578894246706] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=a588afc25adcb0b5d29fb43e1eaeb6a3] [collectionID=449980176308582545] [channel=milvus-c-customer-service-rootcoord-dml_13_449980176308582545v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895006088]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568890138213] [newVersion=1720315578895006088] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=dd7f1cc82b6dcd01add1589bb330dec3] [collectionID=449980176316006753] [channel=milvus-c-customer-service-rootcoord-dml_1_449980176316006753v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895190401]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889576391] [newVersion=1720315578895190401] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=6a896409a6ffd785eac14e3e97cdd30b] [collectionID=450884966347940353] [channel=milvus-c-customer-service-rootcoord-dml_2_450884966347940353v0] [currentNodeID=39] [Action=UpdateVersion] [TargetVersion=1720315578895063008]
[2024/07/07 01:26:28.889 +00:00] [INFO] [delegator/distribution.go:294] ["Update readable segment version"] [oldVersion=1720315568889949859] [newVersion=1720315578895063008] [growingSegmentNum=0] [sealedSegmentNum=1]
[2024/07/07 01:26:28.888 +00:00] [INFO] [querynodev2/services.go:1385] ["sync action"] [traceID=798958dbfb9c98e936505940cacb1e67] [collectionID=449980176314540038] [channel=milvus-c-customer-service-rootcoord-dml_0_449980176314540038v0] [currentNodeID=39] 

Expected Behavior

No response

Steps To Reproduce

Problems occur occasionally but cannot be repeated

Milvus Log

No response

Anything else?

No response

yanliang567 commented 2 months ago

quick questions:

  1. are you running all the query nodes with exclusive hardwares underlying?
  2. do you happen to have any metrics about network and disk performance at that moment?
  3. please share the completed milvus logs

/assign @congqixia @syang1997

syang1997 commented 2 months ago

quick questions:

  1. are you running all the query nodes with exclusive hardwares underlying?
  2. do you happen to have any metrics about network and disk performance at that moment?
  3. please share the completed milvus logs

/assign @congqixia @syang1997

  1. The k8s cluster is dedicated to milvus, and the underlying node of this milvus qureynode has no fluctuation in cpu usage and resources are available.
  2. The memory index used, and no writes and other operations were performed during the abnormal time period. No obvious abnormalities were found on the network
  3. There are too many logs for all components. I checked qureynode's logs and found no valuable information. Which component logs you need to exclude here, I can intercept the logs for the time period and upload them.
syang1997 commented 2 months ago

Node partial monitoring indicators @yanliang567 img_5 img_6

yanliang567 commented 2 months ago

/assign @congqixia please help to take a look

syang1997 commented 2 months ago

@yanliang567 Who can help with the investigation? Recently it appeared again.

yanliang567 commented 2 months ago

not have any clues yet. We need more info in logs to know what was happening at that moment. Please offer the full milvus pod logs during the period.

syang1997 commented 1 month ago

milvus-log (3).tar.gz @yanliang567 I got the latest debug log of p99 timeout. I filtered the log information and found that there is a log that is waiting for search to return to the long log in the proxy log, as follows:

Line 13498: [2024/07/23 15:05:56.323 +00:00] [DEBUG] [proxy/task_search.go:436] ["tr/proxy execute search 451345685543387137"] [traceID=02c29e210100a6076c41537ce36985cb] [msg=done] [duration=3.780114072s]
Line 13511: [2024/07/23 15:05:56.323 +00:00] [DEBUG] [proxy/impl.go:2634] [tr/Search] [traceID=02c29e210100a6076c41537ce36985cb] [msg="wait search result"] [duration=3.780748241s]

But I did not filter a similar query time in the log of QueryNode. It was initially suspected that it was a MQ problem, but the resource usage rate of monitoring and viewing Pulsar was very low, and the nodes at the time were not abnormal.

syang1997 commented 1 month ago

@yanliang567 Can you help me analyze the cause of timeout?

yanliang567 commented 1 month ago

okay, let me check the logs

bigsheeper commented 1 month ago

Hello @syang1997 , could you please provide us the monitoring for wait tsafe latency?

syang1997 commented 1 month ago

Hello @syang1997 , could you please provide us the monitoring for wait tsafe latency?

image image
yanliang567 commented 1 month ago

Additionally, please attach the metric screenshots around 2024/07/23 15:05(+1h、-1h), it would be helpful for us to address the issue. @syang1997

syang1997 commented 1 month ago

@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.

syang1997 commented 1 month ago

Additionally, please attach the metric screenshots around 2024/07/23 15:05(+1h、-1h), it would be helpful for us to address the issue. @syang1997

Is there any way to export the specified time period? It has been a long time to introduce the log script to choose the 24 -hour log is too large

bigsheeper commented 1 month ago

@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long.

@syang1997 The wait tsafe latency monitoring is like this:

image

syang1997 commented 1 month ago

@bigsheeper The delay waiting for search is long, but the delay of QueryNode is not long. @syang1997 The wait tsafe latency monitoring is like this:

image

I didn't find this panel, which version of Grafana Dashboard you used

bigsheeper commented 1 month ago

I didn't find this panel, which version of Grafana Dashboard you used

It's ok, if the querynode search request latency is low, then the wait tsafe latency is likely to be low as well.

syang1997 commented 1 month ago

@bigsheeper @yanliang567 What more information do you need from me?

congqixia commented 1 month ago

@syang1997 quick question, have you detected any network issue during this problem happened?

syang1997 commented 1 month ago

quick question, have you detected any network issue during this problem happened?

Judging from the node network monitoring, there is no problem with the network

xiaofan-luan commented 1 month ago

what about your host machine cpu? sometimes if you have some burst on other pod this will also affect your performance.

xiaofan-luan commented 1 month ago

you may also check the go garbage collection and K8s throttle info

syang1997 commented 1 month ago

what about your host machine cpu? sometimes if you have some burst on other pod this will also affect your performance.

The nodes 'indicators are normal, cpu utilization is very low, and network bandwidth is not high. I also uploaded node monitoring information before

syang1997 commented 1 month ago

you may also check the go garbage collection and K8s throttle info

I checked the gc status of the components in milvus monitoring and found no problems. As for k8s resources, there are no oversold or restricted situations

syang1997 commented 1 month ago

@yanliang567 My preliminary judgment is that the problem lies with plusar, but plusar's cpu and memory usage are very low. Is it caused by plusar's pvc being exhausted and cleaned up? I haven't checked Pulsar's PVC usage yet.

syang1997 commented 1 month ago

Hello @syang1997 , it seems your monitoring dashboard version is quite outdated. Could you please update your dashboard by importing this configuration file? (https://github.com/milvus-io/milvus/blob/master/deployments/monitor/grafana/milvus-dashboard.json) Then, provide us with the updated monitoring information to help us further investigate the issue. Thank you very much.

Okay, I have deployed the latest panel. What indicator information do I need?

bigsheeper commented 1 month ago

Okay, I have deployed the latest panel. What indicator information do I need?

@syang1997 Sorry, our colleague is currently submitting a PR to the master branch to update the configuration. We will notify you once the PR is merged.

syang1997 commented 1 month ago

Okay, I have deployed the latest panel. What indicator information do I need?

@syang1997 Sorry, our colleague is currently submitting a PR to the master branch to update the configuration. We will notify you once the PR is merged.

Okay, please as soon as possible, because surveillance information is only retained for 7 days

syang1997 commented 1 month ago

@bigsheeper I upgraded grafana. What monitoring indicators do I need to provide

bigsheeper commented 1 month ago

@bigsheeper I upgraded grafana. What monitoring indicators do I need to provide

@syang1997 It would be best to have all the dashboards for the proxy and querynode.

syang1997 commented 1 month ago
image image image image image image image

@bigsheeper

bigsheeper commented 1 month ago

@syang1997 The monitoring looks fine, I still think it might be a network issue between the proxy and the querynode. I meant stability, not throughput related.

bigsheeper commented 1 month ago

I found some warning logs that back up my point. These logs appeared during a slow search, and show that the proxy timed out connecting to querynode-53 (address=10.37.113.183:21123).

[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:408] ["call received grpc error"] [clientRole=querynode-53] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [retry/retry.go:102] ["retry func failed"] [retried=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:554] ["ClientBase Call grpc call get error"] [role=querynode-53] [address=10.37.113.183:21123] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n/usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102\n  | github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n  | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | /go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]

Please check if there are frequent connection timeout logs with querynode-53 during slow search. If so, I suggest killing this querynode and redeploying it on a different physical machine. @syang1997

syang1997 commented 1 month ago
image image

I think the architecture of version 2.4 is Proxy directly to QueryNode, version 2.3 is through MQ. My Milvus version is 2.3.15. @bigsheeper

bigsheeper commented 1 month ago

@syang1997 In the 2.3 architecture diagram, "query" refers to the process of the querynode consuming data from the message storage, rather than the query operation itself.

syang1997 commented 1 month ago

@syang1997 In the 2.3 architecture diagram, "query" refers to the process of the querynode consuming data from the message storage, rather than the query operation itself.

Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.

syang1997 commented 1 month ago

I found some warning logs that back up my point. These logs appeared during a slow search, and show that the proxy timed out connecting to querynode-53 (address=10.37.113.183:21123).

[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:408] ["call received grpc error"] [clientRole=querynode-53] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [retry/retry.go:102] ["retry func failed"] [retried=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
[2024/07/23 15:05:55.145 +00:00] [WARN] [grpcclient/client.go:554] ["ClientBase Call grpc call get error"] [role=querynode-53] [address=10.37.113.183:21123] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n/usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n  | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88\n  | github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102\n  | github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n  | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  | /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:102 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).GetComponentStates\n  | /go/src/github.com/milvus-io/milvus/internal/proxy/look_aside_balancer.go:229 github.com/milvus-io/milvus/internal/proxy.(*LookAsideBalancer).checkQueryNodeHealthLoop.func1.1\n  | /go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81 github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | /usr/local/go/src/runtime/asm_amd64.s:1598 runtime.goexit\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]

Please check if there are frequent connection timeout logs with querynode-53 during slow search. If so, I suggest killing this querynode and redeploying it on a different physical machine. @syang1997

@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?

bigsheeper commented 1 month ago

Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.

From the monitoring data, the MQ has no affect to the slow search. So no need to suspect the MQ is this issue.

@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?

Yep, in milvus, search requests are sent directly from proxy to querynode, so the problem is due to network instability between the proxy and the querynode. @syang1997

syang1997 commented 1 month ago

Yes, the timeout according to the monitoring information is when the process of proxy to querynode rather than the Search operation itself. So the problem of 2.3 is whether the transmission process of the MQ is guessing.

From the monitoring data, the MQ has no affect to the slow search. So no need to suspect the MQ is this issue.

@bigsheeper So is the log explanation request directly sent to QueryNode directly from proxy?

Yep, in milvus, search requests are sent directly from proxy to querynode, so the problem is due to network instability between the proxy and the querynode. @syang1997

Okay, I will continue to position in this direction

syang1997 commented 1 month ago

@bigsheeper The link about query in the 2.3 architecture diagram is easy to misunderstand. Can you consider optimizing it?

syang1997 commented 1 month ago
image image image image image image image image image

@bigsheeper Recently, there was another timeout, and only one request was abnormal. This is the monitoring information and logs. We contacted the cloud manufacturers for investigation. The response of the cloud manufacturer did not find that the network was abnormal. Do you have a clue?

syang1997 commented 1 month ago

milvus-log-8-4.tar.gz

syang1997 commented 1 month ago

I only saw an error in the GRPC link in the Proxy log, and I didn't find any abnormal logs in QueryNode. I only saw a three -line GRPC error log in one trial log, but I saw this request for up to 3 seconds through time stamps and monitoring.

bigsheeper commented 1 month ago

I only saw an error in the GRPC link in the Proxy log, and I didn't find any abnormal logs in QueryNode. I only saw a three -line GRPC error log in one trial log, but I saw this request for up to 3 seconds through time stamps and monitoring.

Yep, still the network issue; because the connection timeout is set to 3 seconds.

syang1997 commented 1 month ago

@bigsheeper The result of the cloud provider assistance to the investigation is that there is no problem with the network, which makes us very headache.

xiaofan-luan commented 1 month ago

@syang1997 do you want to setup an offline meeting so we can got more details to help?

My email xiaofan.luan@zilliz.com

xiaofan-luan commented 1 month ago

from the monitoring metrics, it's hard to find more information, deploy the new tracing might be helpful

syang1997 commented 1 month ago

@syang1997 do you want to setup an offline meeting so we can got more details to help?

My email xiaofan.luan@zilliz.com

Very willing, I hope to arrange it as soon as possible

syang1997 commented 1 month ago

@xiaofan-luan I sent an email to you, looking forward to your reply

syang1997 commented 1 month ago

proxy-57.log querynode-70.log querynode-71.log

traceID=e8d19598e7a4b42dfaddd2ea28565acd