Open bmgante opened 1 year ago
I guess the problem are the endpoints which were empty because kube-scheduler and kube-controller-manager are not pods. Then, i tried to specify the IPs of the EKS instances but prometheus scrapping was failing. Tried also to change the endpoint for kube-scheduler for lease holder 10.0.105.9 but scrape fails as well with "Get "https://10.0.105.9:10259/metrics": context deadline exceeded"
.
# kubectl get endpoints -n kube-system
....
prometheus-kube-prometheus-kube-controller-manager <none> 30d
prometheus-kube-prometheus-kube-etcd <none> 30d
prometheus-kube-prometheus-kube-scheduler 10.0.105.9:10259 9m9s
...
When setting endpoints to the ips of eks worker nodes, the error is Get "https://x.x.x.x:10259/metrics": dial tcp 172.27.172.254:10259: connect: connection refused.
Any idea on how to address this or isn't it possible at all to monitor the services and should just disable them in values.yaml?
@bmgante can you access the scheduler metrics endpoint from a container in the cluster (create a container in any namespace and try a curl)?
Managed Kubernetes services do not generally make control plane's metrics endpoints accessible to customers, except for kube-api-server. This is also true for EKS (to provide at least some important scheduler's metrics, EKS planned to make them available through Cloudwatch).
Ok thanks. I’ve just disabled that monitoring on values.yaml to avoid having alerts.
zeritti @.***> escreveu em qui., 25/05/2023 às 22:11 :
Managed Kubernetes services do not generally make control plane's metrics endpoints accessible to customers, except for kube-api-server. This is also true for EKS (to provide at least some important scheduler's metrics, EKS planned to make them available through Cloudwatch).
— Reply to this email directly, view it on GitHub https://github.com/prometheus-community/helm-charts/issues/3368#issuecomment-1563510980, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGQXMYJTYT2H4RINPRNOFNLXH7DHNANCNFSM6AAAAAAX6KBVSI . You are receiving this because you were mentioned.Message ID: @.***>
@bmgante Could you share the update that you had to do to the values.yaml to achieve the disabling of those 2 alerts? I tried using this:
defaultRules:
disabled:
Watchdog: true
KubeControllerManagerDown: true
KubeSchedulerDown: true
but it failed with this when I tried to apply that update:
Error: error validating "": error validating data: ValidationError(PrometheusRule.spec.groups[0]): missing required field "rules" in com.coreos.monitoring.v1.PrometheusRule.spec.groups
Thanks!
Hi @diego-ojeda-binbash I think it was just this:
## Component scraping kube scheduler
##
kubeScheduler:
enabled: false
## Component scraping kube scheduler
##
kubeScheduler:
enabled: false
## Create default rules for monitoring the cluster
##
defaultRules:
create: true
rules:
alertmanager: true
etcd: true
configReloaders: true
general: true
k8s: true
kubeApiserverAvailability: true
kubeApiserverBurnrate: true
kubeApiserverHistogram: true
kubeApiserverSlos: true
kubeControllerManager: false
kubelet: true
kubeProxy: true
kubePrometheusGeneral: true
kubePrometheusNodeRecording: true
kubernetesApps: true
kubernetesResources: true
kubernetesStorage: true
kubernetesSystem: true
kubeSchedulerAlerting: false
kubeSchedulerRecording: false
kubeStateMetrics: true
network: true
node: true
nodeExporterAlerting: true
nodeExporterRecording: true
prometheus: true
prometheusOperator: true
I assume service selector does not match ... maybe because of old version of kubernetes ...
selector:
component: kube-scheduler
when the real label assigned to scheduler pod is k8s-app=kube-scheduler
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This workaround should normally solve the problem, if you still want to monitor kube-scheduler and kube-controller-manager : https://github.com/prometheus-community/helm-charts/issues/3368#issuecomment-1563510980
## Create default rules for monitoring the cluster ## defaultRules: create: true rules: ...
Any idea where the documentation for each of these rules is? I can see they are all being used here https://github.com/prometheus-community/helm-charts/blob/11127a45423d6cf468e476e9ee5a800b7a6c29af/charts/kube-prometheus-stack/hack/sync_prometheus_rules.py but I can't figure out the meaning of some of them.
This should actually be included in the documentation. I had to jump through issues to find this.
My setup with microk8s had the kube-scheduler
, kube-controller-manager
, and kube-proxy
alerts firing. I had to disable them via these Helm chart values:
values:
kubeControllerManager:
enabled: false
kubeScheduler:
enabled: false
kubeProxy:
enabled: false
I tried setting the endpoint values as described in the microk8s docs but it didn't work.
Hi,
EKS 1.25 and cannot get metrics from kube-scheduler and kube-controller-manager. Below values.yaml for kube-scheduler (similar for kube-controller-manager).
Servicemonitor created by helm-chart:
SVC created by helm-chart: