shenkonghui / issue

问题记录
0 stars 0 forks source link

operator list-watch 很慢 #191

Closed shenkonghui closed 1 year ago

shenkonghui commented 1 year ago

问题现象:middleware资源是通过operator 根据每种中间件进行创建, 目前发现创建的非常慢。

服务器上

controller_runtime_active_workers{controller="middleware"} 为0代表当前协程没有在运行

Every 2.0s: curl 10.244.11.203:8080/metrics|grep -v second |grep worker                                                                                                                                                                                                                 Mon May 29 09:21:18 2023

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 100 82300    0 82300    0     0  27.7M      0 --:--:-- --:--:-- --:--:-- 39.2M
# HELP controller_runtime_active_workers Number of currently used workers per controller
# TYPE controller_runtime_active_workers gauge
controller_runtime_active_workers{controller="deployment"} 0
controller_runtime_active_workers{controller="escluster"} 1
controller_runtime_active_workers{controller="middleware"} 0
controller_runtime_active_workers{controller="mysqlcluster"} 2
controller_runtime_active_workers{controller="persistentvolume"} 1
controller_runtime_active_workers{controller="postgresql"} 2
controller_runtime_active_workers{controller="rediscluster"} 2
controller_runtime_active_workers{controller="statefulset"} 2

并且workqueue_depth{name="mysqlcluster"} 为7,代表存在阻塞

Every 2.0s: curl 10.244.11.203:8080/metrics|grep -v second |grep mysqlcluster                                                                                                                                                                                                           Mon May 29 09:22:07 2023

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 100 82294    0 82294    0     0  28.9M      0 --:--:-- --:--:-- --:--:-- 39.2M
controller_runtime_active_workers{controller="mysqlcluster"} 2
controller_runtime_max_concurrent_reconciles{controller="mysqlcluster"} 1
controller_runtime_reconcile_total{controller="mysqlcluster",result="requeue_after"} 829
controller_runtime_reconcile_total{controller="mysqlcluster",result="success"} 2464
workqueue_adds_total{name="mysqlcluster"} 3302
workqueue_depth{name="mysqlcluster"} 7
workqueue_retries_total{name="mysqlcluster"} 829

本地环境debug

发现,虽然controller_runtime_active_workers不为0,但是workqueue_depth大于0

controller_runtime_active_workers{controller="middleware"} 1
workqueue_depth{name="middleware"} 13
shenkonghui commented 1 year ago

qps

qps 20左右 rate(rest_client_requests_total{method="GET",endpoint="test"}[1m])

{code="200", endpoint="test", host="10.96.0.1:443", instance="10.244.11.203:8080", job="middleware-controller", method="GET", namespace="middleware-operator", pod="middleware-controller-manager-77bd9d9bc9-9bvvl", service="middleware-controller"} | 20.911111111111108

延迟

延迟有0.5秒

histogram_quantile(0.5, rate(rest_client_request_latency_seconds_bucket{verb="GET"}[5m]))

{endpoint="test", instance="10.244.11.203:8080", job="middleware-controller", namespace="middleware-operator", pod="middleware-controller-manager-77bd9d9bc9-9bvvl", service="middleware-controller", url="https://10.96.0.1:443/%7Bprefix%7D", verb="GET"} | 0.512

但是奇怪别的controller却是正常

shenkonghui commented 1 year ago

rate(rest_client_requests_total{method="GET",endpoint="test"}[1m]) >0

经过对比,发现高版本k8s对所有api group 进行了list-watch v1.26.3版本k8s image

v.1.21等低版本k8s image

shenkonghui commented 1 year ago

controller-runtime有bug,升级到最近版本即可修复