Resources capacity CPU and memory calculated wrong for nodes (multiplied 3 times)

lensapp / lens

Lens - The way the world runs Kubernetes

https://k8slens.dev/

MIT License

22.55k stars 1.47k forks source link

Resources capacity CPU and memory calculated wrong for nodes (multiplied 3 times) #6625

Open zagr0 opened 1 year ago

zagr0 commented 1 year ago

Describe the bug Lens calculates resources capacity for CPU and memory wrong for nodes, in my case it's 3 times bigger than actual: instance type is t3a.2xlarge, 8cores/32gb, but it shows 24cores / 96gb

To Reproduce I don't have particular steps to reproduce, I have kops managed k8s in AWS and prometheus operator installed

Expected behavior 8cores/32gb to be displayed for instance type t3a.2xlarge

Screenshots

Environment (please complete the following information):

Lens Version: OpenLens 6.2.0
OS: Fedora Linux 36
Installation method (e.g. snap or AppImage in Linux): rpm

Nokel81 commented 1 year ago

This might be caused you have multiple multiple instances of Prometheus data collection.

What prometheus configuration are you using?

zagr0 commented 1 year ago

We have prometheus operator with prometheus instance and kubecost that has it's own prometheus server, but it does not explain why the metrics are tripled... how exactly openlens discovers prometheus and can it be tweaked in configuration somehow?

Nokel81 commented 1 year ago

The code for discovering the prometheus installation is contained within the src/main/prometheus folder. Each provider exposes a getPrometheusService method which attempts to list the services in some or all namespaces based on some label selectors.

Once a service is found we then query that service as if it is a prometheus backend

zagr0 commented 1 year ago

not sure how to trace this properly, but in stdout I have only

[CONTEXT-HANDLER]: using helm as prometheus provider for clusterId=682a6fc81eb66334d80b3ecd68bb8420

zagr0 commented 1 year ago

When I disabled auto detect metrics settings to manual "Prometheus Operator" with exact namespace/service:port it shows metrics correctly. So for some reason autodetect for kube-prometheus-stack v36.6.2 deployed with helm chart does not work as expected in my cluster.

overall we have several prometheus related services that could disappoint lens autodiscovery:

prometheus service from prom-operator - kube-prometheus-prometheus
from kubecost - kubecost-cost-analyzer
from thanos agent shipped by prom-operator - prometheus-operated

$ kubectl get service|grep 9090
kube-prometheus-prometheus             ClusterIP      100.69.233.177   <none>         9090/TCP
kubecost-cost-analyzer                 ClusterIP      100.68.49.220    <none>         9003/TCP,9090/TCP
prometheus-operated                    ClusterIP      None             <none>         9090/TCP,10901/TCP

pniederlag commented 1 year ago

This is a general problem with multiple instances of prometheus. Without user interaction lens will aggregate these instead of warning the user or taking other measures to prevent the user from seeing wrong numbers.

One advice could probably be to use avg() instead of sum() in here https://github.com/lensapp/lens/blob/f1a960fd785b62a118acd8b1525d879f39917e21/packages/technical-features/prometheus/src/helm-provider.injectable.ts

kubecost has documented the problem in here: https://docs.kubecost.com/architecture/ksm-metrics#external-ksm-deployments-resulting-in-duplicated-metrics