kubernetes / dashboard

General-purpose web UI for Kubernetes clusters
Apache License 2.0
14.46k stars 4.17k forks source link

ui updates seem to be too slow #8835

Closed rgl closed 7 months ago

rgl commented 7 months ago

What happened?

updating any resource takes too long (> 1s) which is substantial higher than the apparently equivalent kubectl command.

What did you expect to happen?

Expected to see deployments displayed in roughly the same amount of time as kubectl get deployments -A.

How can we reproduce it (as minimally and precisely as possible)?

Observe the time taken with kubectl, 0.068s:

$ time kubectl get deployments -A
NAMESPACE            NAME                                                  READY   UP-TO-DATE   AVAILABLE   AGE
cert-manager         cert-manager                                          1/1     1            1           9h
cert-manager         cert-manager-cainjector                               1/1     1            1           9h
cert-manager         cert-manager-webhook                                  1/1     1            1           9h
kube-dashboard       kube-dashboard-kong                                   1/1     1            1           9h
kube-dashboard       kube-dashboard-kubernetes-dashboard-api               1/1     1            1           9h
kube-dashboard       kube-dashboard-kubernetes-dashboard-auth              1/1     1            1           9h
kube-dashboard       kube-dashboard-kubernetes-dashboard-metrics-scraper   1/1     1            1           9h
kube-dashboard       kube-dashboard-kubernetes-dashboard-web               1/1     1            1           9h
kube-external-dns    kube-external-dns                                     1/1     1            1           9h
kube-external-dns    pdns                                                  1/1     1            1           9h
kube-metallb         kube-metallb-controller                               1/1     1            1           9h
kube-system          coredns                                               1/1     1            1           9h
kube-traefik         kube-traefik                                          1/1     1            1           9h
local-path-storage   local-path-provisioner                                1/1     1            1           9h
pgadmin              pgadmin-pgadmin4                                      1/1     1            1           9h

real    0m0,068s
user    0m0,065s
sys 0m0,016s

Getting, and displaying the entire yaml, 0.140s:

$ time kubectl get deployments -A -o yaml
...
real    0m0,140s
user    0m0,112s
sys 0m0,036s

Observe the time taken with the browser, 1.2s:

image

Anything else we need to know?

This was tested in a kind cluster, with traefik ingress controller, sending data to kong using http (without tls), and lifting all resource limits (also note that modifying the api replicas does not seem to make much difference):

- name: Install kubernetes-dashboard
  kubernetes.core.helm:
    name: kube-dashboard
    chart_ref: kubernetes-dashboard/kubernetes-dashboard
    chart_version: '{{ kind_kubernetes_dashboard_chart_version }}'
    release_namespace: kube-dashboard
    create_namespace: true
    update_repo_cache: true
    values:
      kong:
        proxy:
          http:
            enabled: true
      app:
        settings:
          global:
            logsAutoRefreshTimeInterval: 0
            resourceAutoRefreshTimeInterval: 30
      api:
        scaling:
          replicas: 1
        containers:
          resources:
            requests:
              cpu: 0
              memory: 0
            limits:
              cpu: 0
              memory: 0
      auth:
        containers:
          resources:
            requests:
              cpu: 0
              memory: 0
            limits:
              cpu: 0
              memory: 0
      web:
        containers:
          resources:
            requests:
              cpu: 0
              memory: 0
            limits:
              cpu: 0
              memory: 0
      metricsScraper:
        containers:
          resources:
            requests:
              cpu: 0
              memory: 0
            limits:
              cpu: 0
              memory: 0

The entire ansible playbook is at:

https://github.com/rgl/my-ubuntu-ansible-playbooks/tree/upgrade-kubernetes-dashboard

Have a look a the last commit in that branch to see just the kubernetes-dashboard changes.

What browsers are you seeing the problem on?

No response

Kubernetes Dashboard version

7.1.2

Kubernetes version

1.29.2

Dev environment

No response

floreks commented 7 months ago

Could you try disabling metrics and checking if it improves anything? Pass --metrics-provider=none arg to the API. That's the only thing I can think of that could be a bottleneck here.

rgl commented 7 months ago

@floreks ah that did the trick! now it pretty fast!

floreks commented 7 months ago

I will have to investigate that at some point. There were no real changes to metrics gathering. Maybe there is an issue with metrics server responsiveness.

rgl commented 7 months ago

hummm I do not have metrics-server installed in my kind cluster. without metrics-server is this expected to be slow?

if so, maybe the FAQ should make it more explicit?

the chart values.yaml comments seem to be more explicit. maybe put that in the faq?

sushain97 commented 7 months ago

:wave: I'm experiencing a similar issue after upgrading from a much earlier version. I added:

api:
  containers:
    args:
      - --metrics-provider=none

and things are substantially better on most pages.

However, some pages still struggle to load quickly (especially the Workloads page) and I have less than 150 pods.

I'm running k3s with the builtin metrics-server.

Some request timings:

image

image

edit: Eventually, things got super slow again after I clicked around a bunch. Then, I restarted the api pod and things got snappy again...

floreks commented 7 months ago

If you start clicking too much and spamming API server with requests throttling will kick in and significantly slow down your responses. Restarting API server can 'reset' throttling and it will work faster. Normal use should be ok.

sushain97 commented 7 months ago

If you start clicking too much and spamming API server with requests throttling will kick in and significantly slow down your responses. Restarting API server can 'reset' throttling and it will work faster. Normal use should be ok.

Hmm, I'm still a bit surprised that I can cause throttling by human-scale-clicking around. To be clear, I wasn't trying to stress the system, just view different panels in the UI :)

Here's how the requests from /#/workloads?namespace=_all look after ~6 hours of not accessing the dashboard at all:

image

There aren't any timeouts but this is still really slow, right?

floreks commented 7 months ago

That is definitely unexpected. What device are you using for your k3s installation?

sushain97 commented 7 months ago

That is definitely unexpected. What device are you using for your k3s installation?

4 cores of an AMD EPYC 7371.

Some quick benchmarks:

sushain@vesuvianite ~ ❯❯❯ hyperfine 'kubectl get pods -A'           18:20:19
Benchmark 1: kubectl get pods -A
  Time (mean ± σ):     220.6 ms ±   4.1 ms    [User: 207.0 ms, System: 71.1 ms]
  Range (min … max):   214.3 ms … 227.9 ms    13 runs

sushain@vesuvianite ~ ❯❯❯ hyperfine 'kubectl describe pods -A'      18:20:29
Benchmark 1: kubectl describe pods -A
  Time (mean ± σ):      1.231 s ±  0.031 s    [User: 0.533 s, System: 0.123 s]
  Range (min … max):    1.188 s …  1.294 s    10 runs

sushain@vesuvianite ~ ❯❯❯ hyperfine 'kubectl get deployments -A'    18:20:43
Benchmark 1: kubectl get deployments -A
  Time (mean ± σ):     177.5 ms ±   7.1 ms    [User: 175.0 ms, System: 58.1 ms]
  Range (min … max):   169.8 ms … 195.3 ms    16 runs

sushain@vesuvianite ~ ❯❯❯ hyperfine 'kubectl describe deployments -A'
Benchmark 1: kubectl describe deployments -A
  Time (mean ± σ):      1.021 s ±  0.032 s    [User: 0.426 s, System: 0.121 s]
  Range (min … max):    0.980 s …  1.097 s    10 runs

So I guess my timings in the UI aren't that much slower if it's calling the equivalent of kubectl describe...

floreks commented 7 months ago

We also can't directly compare kubectl to the UI as we have to make more calls than kubectl to get some extra information and apply additional logic such as server side pagination, sorting, filtering. It will always be slower.

sushain97 commented 7 months ago

We also can't directly compare kubectl to the UI as we have to make more calls than kubectl to get some extra information and apply additional logic such as server side pagination, sorting, filtering. It will always be slower.

Yep, that makes sense. FWIW I jumped from docker.io/kubernetesui/dashboard-api:v1.0.0 to docker.io/kubernetesui/dashboard-api:1.4.1 so there might be a bunch of changes... maybe I'll try bisecting through the Helm chart versions at some point.

floreks commented 7 months ago

@sushain97 I have been further debugging the performance issue and pinned it down exactly. Add --sidecar-host arg to dashboard API deployment. Example: --sidecar-host=kubernetes-dashboard-metrics-scraper.dashboard where kubernetes-dashboard-metrics-scraper is metrics-scraper service name and dashboard is your namespace where Dashboard is deployed.

I honestly have no idea what is causing in-cluster service proxy to be super slow compared to accessing metrics scraper with HTTP client through service proxy directly. I don't see anything that changed there recently.

sushain97 commented 7 months ago

Hm, it doesn't feel too different to me:

image

Here's what I have:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: kubernetes-dashboard
  namespace: kube-system
spec:
  repo: https://kubernetes.github.io/dashboard/
  chart: kubernetes-dashboard
  targetNamespace: kubernetes-dashboard
  version: 7.2.0
  valuesContent: |-
    app:
      scheduling:
        nodeSelector:
            kubernetes.io/hostname: kube.local.skc.name
    # https://github.com/kubernetes/dashboard/issues/8835
    api:
      containers:
        args:
          - --metrics-provider=none
          - --sidecar-host=kubernetes-dashboard-metrics-scraper.kubernetes-dashboard
    kong:
      proxy:
        http:
          enabled: true
bnabholz commented 7 months ago

I encountered a similar thing once I upgraded to the newer versions of kubernetes-dashboard (lots of requests timing out). API server logs showed this, client-side throttling in effect:

2024/04/10 04:24:53 Getting list of namespaces
2024/04/10 04:24:54 Getting list of all jobs in the cluster
2024/04/10 04:24:55 Getting list of all pods in the cluster
I0410 04:24:56.623578       1 request.go:697] Waited for 1.199406392s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kube-system/pod-list/kube-multus-ds-5cc2s,kubed-57f78db5b6-2hvch,external-dns-54fdb56c7-llcp8,kube-api-proxy-5769f97cdf-mkhgz,kube-proxy-m5l4b,kube-scheduler-swerver,kube-controller-manager-swerver,kube-apiserver-swerver,etcd-swerver,coredns-76f75df574-wt796,coredns-76f75df574-q99kd,openebs-lvm-controller-0,openebs-lvm-node-w5vhp,calico-node-sngft,smarter-device-manager-gs99r,metrics-server-85bc948865-b7xrv,calico-kube-controllers-9d77f677d-m84kv/metrics/cpu/usage_rate
2024/04/10 04:25:01 Getting pod metrics
2024/04/10 04:25:03 Getting list of namespaces
2024/04/10 04:25:04 Getting list of all pods in the cluster
I0410 04:25:06.823448       1 request.go:697] Waited for 2.783877882s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kubernetes-dashboard/pod-list/kubernetes-dashboard-api-774cf68885-sqdbw,kubernetes-dashboard-web-5b8d87bf85-n2smh,kubernetes-dashboard-auth-6cf78cdd47-5qb2h,kubernetes-dashboard-kong-6cf54d7fcf-74ltv,kubernetes-dashboard-metrics-scraper-9758854f6-gpzlb,kubernetes-dashboard-proxy-5c7cd7d76c-dxdw9/metrics/memory/usage
2024/04/10 04:25:13 Getting list of namespaces
2024/04/10 04:25:14 Getting pod metrics
2024/04/10 04:25:14 Getting list of all pods in the cluster
I0410 04:25:16.824112       1 request.go:697] Waited for 1.992075126s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kubernetes-dashboard/pod-list/kubernetes-dashboard-api-774cf68885-sqdbw,kubernetes-dashboard-web-5b8d87bf85-n2smh,kubernetes-dashboard-auth-6cf78cdd47-5qb2h,kubernetes-dashboard-kong-6cf54d7fcf-74ltv,kubernetes-dashboard-metrics-scraper-9758854f6-gpzlb,kubernetes-dashboard-proxy-5c7cd7d76c-dxdw9/metrics/cpu/usage_rate
2024/04/10 04:25:23 Getting list of namespaces
2024/04/10 04:25:24 Getting list of all pods in the cluster
I0410 04:25:27.023139       1 request.go:697] Waited for 5.387414496s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/guacamole/pod-list/postgres-77659c7c89-2n55q,guacamole-64b6c4c56-vxz9d,oauth2-proxy-64968f4c7f-df6cn/metrics/memory/usage
2024/04/10 04:25:28 Getting pod metrics
2024/04/10 04:25:33 Getting list of namespaces
2024/04/10 04:25:34 Getting list of all pods in the cluster
I0410 04:25:37.023852       1 request.go:697] Waited for 4.792209339s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kubevirt/pod-list/virt-handler-cgnwr,virt-api-54666f869-q8sg9,virt-controller-c67776ccb-949z4,virt-operator-67d55bb884-rwmjs,virtvnc-65986fb5d7-6ghmr,virt-operator-67d55bb884-nl2k8,kvpanel-d46bb99dd-7ss6d,virt-controller-c67776ccb-82pjg/metrics/memory/usage

Setting metrics-provider=none does seem to help:

2024/04/10 04:28:26 Getting list of namespaces
2024/04/10 04:28:36 Getting list of namespaces
2024/04/10 04:28:46 Getting list of namespaces
2024/04/10 04:28:52 Getting list of all pods in the cluster
2024/04/10 04:28:53 Getting pod metrics
2024/04/10 04:28:56 Getting list of namespaces
2024/04/10 04:29:02 Getting list of all pods in the cluster
2024/04/10 04:29:03 Getting pod metrics
2024/04/10 04:29:06 Getting list of namespaces
2024/04/10 04:29:06 Getting list of all pods in the cluster
2024/04/10 04:29:06 Getting pod metrics
2024/04/10 04:29:09 Getting list of all deployments in the cluster
2024/04/10 04:29:12 Getting list of all pods in the cluster
2024/04/10 04:29:12 Getting pod metrics
2024/04/10 04:29:16 Getting list of namespaces
2024/04/10 04:29:22 Getting list of all pods in the cluster
2024/04/10 04:29:22 Getting pod metrics
2024/04/10 04:29:26 Getting list of namespaces

...but that wasn't the first thing that I tried because I wanted to keep metrics.

What I found was that in https://github.com/kubernetes/dashboard/blob/567a38f476b33542534a94f622e1f7aa18a635e0/modules/common/client/init.go#L48 if the in-cluster config is being used (the common case?) then it's being immediately returned and the default request limits at https://github.com/kubernetes/dashboard/blob/567a38f476b33542534a94f622e1f7aa18a635e0/modules/common/client/init.go#L64 aren't being applied. I think that buildBaseConfig needs to fetch its config from whatever source it can, but then also apply its default settings on top of that, specifically the queries per second limit.

Below is the compare of what I ended up using for my own use case, but I feel like I could clean it up as far as pointer usage, happy for any advice.

https://github.com/kubernetes/dashboard/compare/master...bnabholz:kubernetes-dashboard:fixes/qps

floreks commented 7 months ago

Hm, it doesn't feel too different to me:

image

Here's what I have:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: kubernetes-dashboard
  namespace: kube-system
spec:
  repo: https://kubernetes.github.io/dashboard/
  chart: kubernetes-dashboard
  targetNamespace: kubernetes-dashboard
  version: 7.2.0
  valuesContent: |-
    app:
      scheduling:
        nodeSelector:
            kubernetes.io/hostname: kube.local.skc.name
    # https://github.com/kubernetes/dashboard/issues/8835
    api:
      containers:
        args:
          - --metrics-provider=none
          - --sidecar-host=kubernetes-dashboard-metrics-scraper.kubernetes-dashboard
    kong:
      proxy:
        http:
          enabled: true

You could actually reenable metrics with that sidecar host change. If that doesn't help then it might be your machine. When I was testing locally on my kind cluster response times went down from 1-3 seconds to 100ms on average for every view with all namespaces selected.

floreks commented 7 months ago

I encountered a similar thing once I upgraded to the newer versions of kubernetes-dashboard (lots of requests timing out). API server logs showed this, client-side throttling in effect:

2024/04/10 04:24:53 Getting list of namespaces
2024/04/10 04:24:54 Getting list of all jobs in the cluster
2024/04/10 04:24:55 Getting list of all pods in the cluster
I0410 04:24:56.623578       1 request.go:697] Waited for 1.199406392s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kube-system/pod-list/kube-multus-ds-5cc2s,kubed-57f78db5b6-2hvch,external-dns-54fdb56c7-llcp8,kube-api-proxy-5769f97cdf-mkhgz,kube-proxy-m5l4b,kube-scheduler-swerver,kube-controller-manager-swerver,kube-apiserver-swerver,etcd-swerver,coredns-76f75df574-wt796,coredns-76f75df574-q99kd,openebs-lvm-controller-0,openebs-lvm-node-w5vhp,calico-node-sngft,smarter-device-manager-gs99r,metrics-server-85bc948865-b7xrv,calico-kube-controllers-9d77f677d-m84kv/metrics/cpu/usage_rate
2024/04/10 04:25:01 Getting pod metrics
2024/04/10 04:25:03 Getting list of namespaces
2024/04/10 04:25:04 Getting list of all pods in the cluster
I0410 04:25:06.823448       1 request.go:697] Waited for 2.783877882s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kubernetes-dashboard/pod-list/kubernetes-dashboard-api-774cf68885-sqdbw,kubernetes-dashboard-web-5b8d87bf85-n2smh,kubernetes-dashboard-auth-6cf78cdd47-5qb2h,kubernetes-dashboard-kong-6cf54d7fcf-74ltv,kubernetes-dashboard-metrics-scraper-9758854f6-gpzlb,kubernetes-dashboard-proxy-5c7cd7d76c-dxdw9/metrics/memory/usage
2024/04/10 04:25:13 Getting list of namespaces
2024/04/10 04:25:14 Getting pod metrics
2024/04/10 04:25:14 Getting list of all pods in the cluster
I0410 04:25:16.824112       1 request.go:697] Waited for 1.992075126s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kubernetes-dashboard/pod-list/kubernetes-dashboard-api-774cf68885-sqdbw,kubernetes-dashboard-web-5b8d87bf85-n2smh,kubernetes-dashboard-auth-6cf78cdd47-5qb2h,kubernetes-dashboard-kong-6cf54d7fcf-74ltv,kubernetes-dashboard-metrics-scraper-9758854f6-gpzlb,kubernetes-dashboard-proxy-5c7cd7d76c-dxdw9/metrics/cpu/usage_rate
2024/04/10 04:25:23 Getting list of namespaces
2024/04/10 04:25:24 Getting list of all pods in the cluster
I0410 04:25:27.023139       1 request.go:697] Waited for 5.387414496s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/guacamole/pod-list/postgres-77659c7c89-2n55q,guacamole-64b6c4c56-vxz9d,oauth2-proxy-64968f4c7f-df6cn/metrics/memory/usage
2024/04/10 04:25:28 Getting pod metrics
2024/04/10 04:25:33 Getting list of namespaces
2024/04/10 04:25:34 Getting list of all pods in the cluster
I0410 04:25:37.023852       1 request.go:697] Waited for 4.792209339s due to client-side throttling, not priority and fairness, request: GET:https://10.152.183.1:443/api/v1/namespaces/kubernetes-dashboard/services/kubernetes-dashboard-metrics-scraper/proxy/api/v1/dashboard/namespaces/kubevirt/pod-list/virt-handler-cgnwr,virt-api-54666f869-q8sg9,virt-controller-c67776ccb-949z4,virt-operator-67d55bb884-rwmjs,virtvnc-65986fb5d7-6ghmr,virt-operator-67d55bb884-nl2k8,kvpanel-d46bb99dd-7ss6d,virt-controller-c67776ccb-82pjg/metrics/memory/usage

Setting metrics-provider=none does seem to help:

2024/04/10 04:28:26 Getting list of namespaces
2024/04/10 04:28:36 Getting list of namespaces
2024/04/10 04:28:46 Getting list of namespaces
2024/04/10 04:28:52 Getting list of all pods in the cluster
2024/04/10 04:28:53 Getting pod metrics
2024/04/10 04:28:56 Getting list of namespaces
2024/04/10 04:29:02 Getting list of all pods in the cluster
2024/04/10 04:29:03 Getting pod metrics
2024/04/10 04:29:06 Getting list of namespaces
2024/04/10 04:29:06 Getting list of all pods in the cluster
2024/04/10 04:29:06 Getting pod metrics
2024/04/10 04:29:09 Getting list of all deployments in the cluster
2024/04/10 04:29:12 Getting list of all pods in the cluster
2024/04/10 04:29:12 Getting pod metrics
2024/04/10 04:29:16 Getting list of namespaces
2024/04/10 04:29:22 Getting list of all pods in the cluster
2024/04/10 04:29:22 Getting pod metrics
2024/04/10 04:29:26 Getting list of namespaces

...but that wasn't the first thing that I tried because I wanted to keep metrics.

What I found was that in https://github.com/kubernetes/dashboard/blob/567a38f476b33542534a94f622e1f7aa18a635e0/modules/common/client/init.go#L48 if the in-cluster config is being used (the common case?) then it's being immediately returned and the default request limits at https://github.com/kubernetes/dashboard/blob/567a38f476b33542534a94f622e1f7aa18a635e0/modules/common/client/init.go#L64 aren't being applied. I think that buildBaseConfig needs to fetch its config from whatever source it can, but then also apply its default settings on top of that, specifically the queries per second limit.

Below is the compare of what I ended up using for my own use case, but I feel like I could clean it up as far as pointer usage, happy for any advice.

https://github.com/kubernetes/dashboard/compare/master...bnabholz:kubernetes-dashboard:fixes/qps

Ye, I have pinned it down to in-cluster client too, but I actually ended up using fake rate limiter as i.e. internal rest client derived from client was also overriding some configuration for me. I will create a PR with a bunch of changes including this fix a bit later today.

Thanks for your help anyway!