milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.68k stars 2.93k forks source link

[Bug]: msg: failed to get delegator 624 for channel base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624 #37710

Open sunwsh opened 1 day ago

sunwsh commented 1 day ago

Is there an existing issue for this?

Environment

- Milvus version:  v2.4.14
- Deployment mode(standalone or cluster):   cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2): milvus-sdk-java v2.3.3
- OS(Ubuntu or CentOS):  ubuntu20.04
- CPU/Memory:    24c/64g
- GPU:    null
- Others:

Current Behavior

偶尔出现 failed to get delegator, 才升级到 v2.4.14 两天,最近两天都有发生,以前用的版本是 2.3.21,没有发现类似问题。

Expected Behavior

我们负载均衡方式用了
proxy.replicaSelectionPolicy: round_robin

image

Steps To Reproduce

No response

Milvus Log

查询时日志都在 proxy 上,最开始出现查询 failed to get delegator 624 的日志是在,2024/11/15 10:32:45

[2024/11/15 10:32:45.729 +08:00] [ACCESS] <root: tcp-10.151.41.250:12682> Search [status: Failed] [code: 65535] [sdk: Java-Unknown] [msg: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624] [traceID: dc11a62b1e3a406ae3fda09c4f1230a2] [timeCost: 1.657523ms] [database: seareco] [collection: m185_128_dssm_v2_1] [partitions: [0]] [expr: ]

[2024/11/15 10:32:45.728 +08:00] [WARN] [retry/retry.go:46] ["retry func failed"] [traceID=dc11a62b1e3a406ae3fda09c4f1230a2] [retried=0] [error="failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624"] [errorVerbose="failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:190\n  | github.com/milvus-io/milvus/pkg/util/retry.Do\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/retry/retry.go:44\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:163\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:231\n  | golang.org/x/sync/errgroup.(*Group).Go.func1\n  | \t/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (2) failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0\nWraps: (3) can not find client of node 624\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString"]

[2024/11/15 10:32:45.729 +08:00] [WARN] [proxy/task_search.go:552] ["search execute failed"] [traceID=dc11a62b1e3a406ae3fda09c4f1230a2] [nq=1] [error="failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624"] [errorVerbose="failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:190\n  | github.com/milvus-io/milvus/pkg/util/retry.Do\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/retry/retry.go:44\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:163\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:231\n  | golang.org/x/sync/errgroup.(*Group).Go.func1\n  | \t/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (2) failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0\nWraps: (3) can not find client of node 624\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString"]

[2024/11/15 10:32:45.729 +08:00] [WARN] [proxy/task_scheduler.go:477] ["Failed to execute task: "] [traceID=dc11a62b1e3a406ae3fda09c4f1230a2] [error="failed to search: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624"] [errorVerbose="failed to search: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_search.go:553\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_scheduler.go:474\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_scheduler.go:553\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/conc/pool.go:81\n  | [...repeated from below...]\nWraps: (2) failed to search\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:190\n  | github.com/milvus-io/milvus/pkg/util/retry.Do\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/retry/retry.go:44\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:163\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:231\n  | golang.org/x/sync/errgroup.(*Group).Go.func1\n  | \t/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (4) failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0\nWraps: (5) can not find client of node 624\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString"]

[2024/11/15 10:32:45.729 +08:00] [WARN] [proxy/impl.go:3014] ["Search failed to WaitToFinish"] [traceID=dc11a62b1e3a406ae3fda09c4f1230a2] [role=proxy] [db=seareco] [collection=m185_128_dssm_v2_1] [partitions="[0]"] [dsl=] [len(PlaceholderGroup)=524] [OutputFields="[]"] [search_params="[{\"key\":\"anns_field\",\"value\":\"vec\"},{\"key\":\"topk\",\"value\":\"3000\"},{\"key\":\"metric_type\",\"value\":\"IP\"},{\"key\":\"round_decimal\",\"value\":\"-1\"},{\"key\":\"ignore_growing\",\"value\":\"false\"},{\"key\":\"offset\",\"value\":\"0\"},{\"key\":\"params\",\"value\":\"{\\\"nprobe\\\":20}\"}]"] [ConsistencyLevel=Strong] [useDefaultConsistency=true] [nq=1] [error="failed to search: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624"] [errorVerbose="failed to search: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_search.go:553\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_scheduler.go:474\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_scheduler.go:553\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/conc/pool.go:81\n  | [...repeated from below...]\nWraps: (2) failed to search\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:190\n  | github.com/milvus-io/milvus/pkg/util/retry.Do\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/retry/retry.go:44\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:163\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:231\n  | golang.org/x/sync/errgroup.(*Group).Go.func1\n  | \t/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (4) failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0\nWraps: (5) can not find client of node 624\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString"]

624 节点,这个 collection 没有什么异常日志,在 kill 之前一直有 查询日志:2024/11/15 11:54:09

[2024/11/15 11:54:09.136 +08:00] [INFO] [querynodev2/services.go:821] ["Received SearchRequest done"] [traceID=7dd04234bfc44f6b14bf5d7c1d069b5b] [collectionID=453896578611993902] [channels="[gd16-seareco-base-rootcoord-dml_14_453896578611993902v0]"] [nq=1] [segmentIDs="[]"] [reduceElapse=0] [requestElapse=23]

看proxy 的日志, 我们有 30个 proxy 进程, 提示这个查询不到的只有 8个proxy 进程, proxy 的前端是用的 7层负载均衡,每个节点的机会是均等的,说明只是部分 proxy 缓存不一致。

统计每个proxy 的失败次数,也不均衡

| proxy pod ip | count |
proxy-76f4ccd59f-djzdc@10.142.68.246 | 921
proxy-76f4ccd59f-ndk8r@10.151.200.71 | 957
proxy-76f4ccd59f-rldqc@10.142.248.58 | 914
proxy-76f4ccd59f-8xtbw@10.147.20.176 | 931
proxy-76f4ccd59f-qcpdt@10.147.88.54 | 893
proxy-76f4ccd59f-tt7wr@10.151.227.228 | 930
proxy-76f4ccd59f-8qn7c@10.142.123.211 | 30241
-proxy-76f4ccd59f-475cf@10.142.39.120 | 952

查询 10.142.123.211 这个 pod 缓存更新请求,
[2024/11/15 10:54:16.688 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=9e4453573fdf71e2a9eb334bc3a895e9] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:03:21.151 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=26fbf23dd2a15edd1f69a8c4d2aa55d8] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:06:04.896 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=81eff78d220cfa78cf5b8afb212b28fe] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:52:46.585 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=aea7c1acf8876da60fd5b86e97196c1b] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:52:51.864 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=6783b3df043cbb52295df11675753d0d] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:54:22.087 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=13d4302a1b92b39c5f47f2b87d481b22] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]

Anything else?

No response

yanliang567 commented 19 hours ago

@sunwsh could you please provide completed milvus logs for investigation? Please refer this doc to export the whole Milvus logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @sunwsh