Closed kurtostfeld closed 5 years ago
Thank you for posting the fix on the labels for the coredns ServiceMonitor!
For Kubelet, kubectl -n kube-system edit servicemonitors/myprm-kubelet
, after reading https://github.com/coreos/prometheus-operator/issues/926 I altered the config to this to get it working:
spec
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
honorLabels: true
interval: 30s
port: https-metrics
scheme: https
tlsConfig:
insecureSkipVerify: true
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
honorLabels: true
interval: 30s
path: /metrics/cadvisor
port: https-metrics
scheme: https
tlsConfig:
insecureSkipVerify: true
In my case, it seems the EKS workers (v1.11.5) are only binding to 10250 on scheme: https
, based /etc/kubernetes/kubelet/kubelet-config.json anonymous auth is disabled but the bearer token webhook should be used:
"authentication": {
"anonymous": {
"enabled": false
},
"webhook": {
"cacheTTL": "2m0s",
"enabled": true
}
Thank you! I confirm that your fix resolves the kubelet issue on Kubernetes 1.11.5 on EKS:
JSONPATCH=$(cat <<-END
[
{"op": "add", "path": "/spec/endpoints/0/bearerTokenFile", "value": "/var/run/secrets/kubernetes.io/serviceaccount/token"},
{"op": "add", "path": "/spec/endpoints/1/bearerTokenFile", "value": "/var/run/secrets/kubernetes.io/serviceaccount/token"},
{"op": "add", "path": "/spec/endpoints/0/scheme", "value": "https"},
{"op": "add", "path": "/spec/endpoints/1/scheme", "value": "https"},
{"op": "replace", "path": "/spec/endpoints/0/port", "value": "https-metrics"},
{"op": "replace", "path": "/spec/endpoints/1/port", "value": "https-metrics"},
{"op": "add", "path": "/spec/endpoints/0/tlsConfig", "value": { "insecureSkipVerify": true}},
{"op": "add", "path": "/spec/endpoints/1/tlsConfig", "value": { "insecureSkipVerify": true}}
]
END
)
kubectl patch servicemonitor myprm-prometheus-operator-kubelet --type='json' --patch $JSONPATCH
Do you have any idea on how to get metric scraping working for kube-controller-manager
and kube-scheduler
. That is a consistent problem on both Kubernetes 1.11.5 in EKS and Kubernetes 1.13.1 in minikube.
Hey, one alternative way to fix the kubelet issue on DigitalOcean's kube offering was setting kubelet.serviceMonitor.https
to true
. Does that work for y'all?
@wirehead, thank you. That works perfectly and it's a much easier fix. There is a similar easier configuration fix for coredns scraping.
I still don't see solutions for kube-controller-manager, kube-etcd, kube-scheduler.
@kurtostfeld Debugging step:
Can you do a kubectl get po -n kube-system
and see if there are any controller-manager, etcd, or kube-scheduler pods present? My present theory is that they don't have the labels that prometheus-operator is expecting.
What I think you want your values to look like is this:
kubeControllerManager:
service:
selector:
component: kube-controller-manager
k8s-app: null
kubeScheduler:
service:
selector:
component: kube-scheduler
k8s-app: null
The other possible problem is that the controller manager and scheduler aren't listening on a port you can get at, which I don't have a solution for.
I have the same problem on EKS where KubeControllerManagerDown and KubeSchedulerDown are firing. The kube-scheduler, kube-controller-manager services don't have any endpoints. Nor are there any related pods.
My theory is it's because the scheduler and control manager run in the control plane, not on the worker nodes, and therefore we do not even need to monitor them. Can someone please confirm this?
On my AWS EKS cluster I have opted to not have prometheus scrape/monitor etcd, kube-scheduler, or kube-controller-manager since they are all managed by AWS. I have also configured the proper label selector for coreDns using the values file. Here is my config in case anyone is interested in doing the same:
coreDns:
service:
selector:
k8s-app: kube-dns
# Not monitoring etcd, kube-scheduler, or kube-controller-manager because it is managed by EKS
defaultRules:
rules:
etcd: false
kubeScheduler: false
kubeControllerManager:
enabled: false
kubeEtcd:
enabled: false
kubeScheduler:
enabled: false
Could be related, but is anyone else on EKS getting the KubeClientCertificateExpiration alert as well?
@jwenz723, that config works perfectly, thank you so much!!
@TarekAS, I am definitely getting the KubeClientCertificateExpiration alert as well on EKS but that is a separate issue. I route those alerts along with CPUThrottlingHigh and Watchdog to a "null" alert receiver, to basically ignore those alerts.
@jwenz723 @kurtostfeld, So, it means we don't have to monitor those parameters as they are managed by EKS itself and we don't have access to control plane? If yes, then it should be working on minikube but according to @kurtostfeld, it is not working on minikube also.
why is this issue closed if no fix was posted for kube-controller-manager, kube-etcd, kube-scheduler? how did you guys solved it?
I'm happy with the config of simply disabling monitoring on those three:
defaultRules:
rules:
etcd: false
kubeScheduler: false
kubeControllerManager:
enabled: false
@jwenz723 @kurtostfeld, So, it means we don't have to monitor those parameters as they are managed by EKS itself and we don't have access to control plane? If yes, then it should be working on minikube but according to @kurtostfeld, it is not working on minikube also.
Sorry to revive this after so long, I didn't realize that I had been tagged in a comment.... I would like to point out that when running prometheus-operator in minikube that for me the kubelet metrics collection and etcd collection fails. The etcd collection fails with error: Get http://192.168.64.3:2379/metrics: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x01\x00\x02\x02"
. I am not sure how to resolve this issue.
The kubelet metrics are failing to be collected with a 403 error. I was able to resolve this by setting the helm value kubelet.serviceMonitor.https
to false
.
Hey, I still have the same issue with EKS and KubeletDown. My values.yaml looks like:
coreDns:
service:
selector:
k8s-app: kube-dns
defaultRules:
rules:
etcd: false
kubeScheduler: false
kubeControllerManager:
enabled: false
kubeEtcd:
enabled: false
kubeScheduler:
enabled: false
kubelet:
serviceMonitor:
https: true
Any idea how to troubleshoot it ?
ok, I found the issue. Its related to https://github.com/helm/charts/issues/20224
They added new alert rule which looks for
absent(up{job="kubelet",metrics_path="/metrics"} == 1)
But the kubelet returns only fields like:
up{endpoint="https-metrics",instance="10.30.1.191:10250",job="kubelet",namespace="kube-system",node="ip-10-56-23-156.eu-west-1.compute.internal",service="prometheus-kubelet"}
Is this a request for help?: yes
Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT
Version of Helm and Kubernetes: On both the server and the client I'm running Helm v2.12.1. I've experienced this primarily with kubernetes v1.11.5 (on both server and client) on Amazon EKS, but also with kubernetes v1.13.1 on local Minikube.
Which chart: stable/prometheus-operator
What happened:
Several servicemonitors aren't correctly selecting target pods and aren't doing any scraping as a result.
The four that aren't finding any endpoints that should be doing so are:
I can fully fix the coredns one with this patch command:
Basically, there is a
servicemonitor
object namedmyprm-prometheus-operator-coredns
that references a Kubernetesservice
of the exact same name in namespacekube-system
. As created that service has the spec yaml:That k8s-app entry is wrong. It should be
kube-dns
, notcoredns
, even when using CoreDNS. I see two coredns pods running and they both have the labelk8s-app=kube-dns
. After I run the patch I mentioned earlier, CoreDNS scraping starts working perfectly.The other three servicemonitors also have broken selectors, but even after fixing those in a similar fashion to how I fixed the CoreDNS scraping, these seem to have additional problems that prevent metrics scraping from working.
I also notice on
http://localhost:9090/targets
that servicemonitormyprm-prometheus-operator-kubelet
finds targets to scrape, but the scraping fails. I'm not sure what's wrong or how to fix that.On
http://localhost:9090/alerts
, I can see four alerts that are firing due to the above service scraping not working correctly:What you expected to happen:
I expect the default set of servicemonitors to work and scrape standard Kubernetes metrics. I expect to see no alerts due to a broken Prometheus setup. If the cluster is running perfectly fine, there should be no active alerts other than the DeadMansSwitch alert.
How to reproduce it (as minimally and precisely as possible):
Get a clean Kubernetes cluster and run a simple install with all default values. I'm using a custom shortened name for simplicity.
helm install --name myprm stable/prometheus-operator
This is very easy and consistent to reproduce. This has been reproduced in both a local minikube environment and on Amazon EKS.
Anything else we need to know:
Not at the moment.