Closed dmcnaught closed 7 years ago
The BY (cluster) == 0
part is unlikely important in most setups it just allows this to be used to monitor/alert for more than one Kubernetes cluster (eg. when using Federation). I'm guessing your alert is rather triggering because of the absence of kube-scheduler
/kube-controller-manager
jobs. Could you make sure you can find it in the /targets
page? If it is not there, what is likely happening is that there is no Endpoints
object that lists the kube-scheduler
and kube-controller-manager
, which means you either don't have the Service
s created from manifests/k8s/
. Or your kube-scheduler
and kube-controller-manager
are not discoverable via those, in which case the output of the following would help:
$ kubectl get pods --all-namespace
Or if you cannot disclose all of that information this should give us all the information applicable:
$ kubectl -n monitoring get pods
$ kubectl -n kube-system get pods
We typically test the content of this repository with clusters created with bootkube, but it would be great if we can get a section/guides for kops as it's pretty widely adopted as well.
Thanks for the quick response. I don't see them on the /targets page. Here is the info requested:
NAMESPACE NAME READY STATUS RESTARTS AGE
athena-graphql athena-graphql-cmd-3150689734-xqkkq 1/1 Running 0 4d
deis deis-builder-2759337600-en6fr 1/1 Running 1 17d
deis deis-controller-873470834-re8zj 1/1 Running 0 17d
deis deis-database-1712966127-plp30 1/1 Running 0 17d
deis deis-logger-9212198-xjpsj 1/1 Running 3 17d
deis deis-logger-fluentd-0ppwl 1/1 Running 0 17d
deis deis-logger-fluentd-a8zly 1/1 Running 0 17d
deis deis-logger-redis-663064164-3wq80 1/1 Running 0 17d
deis deis-monitor-grafana-432364990-joygg 1/1 Running 0 17d
deis deis-monitor-influxdb-2729526471-7npow 1/1 Running 0 17d
deis deis-monitor-telegraf-2poea 0/1 CrashLoopBackOff 112 17d
deis deis-monitor-telegraf-cy218 1/1 Running 0 17d
deis deis-nsqd-3264449345-c1t0c 1/1 Running 0 17d
deis deis-registry-680832981-64b4s 1/1 Running 0 17d
deis deis-registry-proxy-94w6s 1/1 Running 0 17d
deis deis-registry-proxy-p46bn 1/1 Running 0 17d
deis deis-router-2457652422-sx6c7 1/1 Running 0 17d
deis deis-workflow-manager-2210821749-ggzm3 1/1 Running 0 17d
hades-graphql hades-graphql-cmd-1371319327-6sqb6 1/1 Running 0 8m
kube-system dns-controller-2613152787-l8bj4 1/1 Running 0 17d
kube-system etcd-server-events-ip-10-101-115-158.ec2.internal 1/1 Running 0 17d
kube-system etcd-server-ip-10-101-115-158.ec2.internal 1/1 Running 0 17d
kube-system kube-apiserver-ip-10-101-115-158.ec2.internal 1/1 Running 2 17d
kube-system kube-controller-manager-ip-10-101-115-158.ec2.internal 1/1 Running 0 17d
kube-system kube-dns-v20-3531996453-ban95 3/3 Running 0 17d
kube-system kube-dns-v20-3531996453-v66h9 3/3 Running 0 17d
kube-system kube-proxy-ip-10-101-115-158.ec2.internal 1/1 Running 0 17d
kube-system kube-proxy-ip-10-101-175-18.ec2.internal 1/1 Running 0 17d
kube-system kube-scheduler-ip-10-101-115-158.ec2.internal 1/1 Running 0 17d
monitoring alertmanager-main-0 2/2 Running 0 4d
monitoring alertmanager-main-1 2/2 Running 0 4d
monitoring alertmanager-main-2 2/2 Running 0 4d
monitoring grafana-874468113-0atmz 2/2 Running 0 4d
monitoring kube-state-metrics-3229993571-aqo7z 1/1 Running 0 4d
monitoring node-exporter-5xyxb 1/1 Running 0 4d
monitoring node-exporter-xwgn6 1/1 Running 0 4d
monitoring prometheus-k8s-0 3/3 Running 0 4d
monitoring prometheus-operator-479044303-ris0n 1/1 Running 0 4d
splunkspout k8ssplunkspout-nonprod-2ykwk 1/1 Running 0 17d
splunkspout k8ssplunkspout-nonprod-xdp7d 1/1 Running 0 17d
styleguide styleguide-cmd-685725177-pwl1c 1/1 Running 0 4d
styleguide-staging styleguide-staging-cmd-2993321210-1cgmo 1/1 Running 0 2h
wellbot wellbot-web-3878855632-34s4e 1/1 Running 0 15d
welltok-arch-k8s welltok-arch-k8s-1857575956-ky06l 1/1 Running 0 14d```
I believe I have seen this before, the problem I think is that kops doesn't label the kubernetes component pods correctly with k8s-app=<component-name>
. To confirm that can you give me the output of kubectl -n kube-system get kube-controller-manager-ip-10-101-115-158.ec2.internal -oyaml
and kubectl -n kube-system get kube-scheduler-ip-10-101-115-158.ec2.internal -oyaml
(in case they got rescheduled the pods that now start with kube-scheduler
or kube-controller-manager
respectively)
If what I am guessing is correct then we should push on kops side to use upstream manifests like bootkube does.
This was a similar issue - to add that label to kube-proxy: https://github.com/kubernetes/kops/pull/617
I didn't see that one, thanks for pointing it out! I opened https://github.com/kubernetes/kops/issues/1226 to start a discussion on it. Hopefully we will get those soon then. In the mean time I think you'll have to either ssh onto those servers, change the templates and restart them (which makes the objects be recreated from the templates IIRC; disclaimer: not super familiar with kops) or comment out/remove those alerts for now.
(also remember that changes to single machines will disappear when recreating machines from the ASG unless you make the changes to the ASG)
I just noticed etcd is not appearing in my prometheus targets either.
Oh, and kube-dns. Should we update https://github.com/kubernetes/kops/issues/1226 ?
It seems like we won't have an answer before the holidays. So I'll keep pushing in the new year. But yes I will keep pushing for a consistent labelling strategy, we'll add the respective manifests for Prometheus to properly discover the components here once we have that consistent labelling. I don't mind maintaining a set of manifests for kops, bootkube, etc. as long as each of those labelling strategies make sense and exist.
So far so goot :) happy holidays!
I added the labels on the master (/etc/kubernetes/manifests/kube-controller-manager,kube-scheduler), and then ran kubectl create -f manifests/k8s/self-hosted
I ran hack/cluster-monitoring/teardown
and then hack/cluster-monitoring/deploy
and that fixed the alerts for kube-scheduler and kube-controller-manager.
kube-dns now has four endpoints listed under /targets, and they are all getting error: getsockopt: connection refused
I'd also like to add etcd, but don't find any explicit instructions on that. It would be great to add a kops default setup in manifests/k8s/kops/
Yep that's the plan as soon as we have consistent labeling in upstream kops.
Labelling added to kops: https://github.com/kubernetes/kops/pull/1314
I don't have a v1.5.x kops cluster handy, but I'll create the manifests with a best effort and then it would be great if you could test them.
With pleasure. Thanks
In fact I think the manifests from manifests/k8s/self-hosted
are suitable for kops when using the latest master, am I mistaken? Except kube-dns
as the manifest is slightly outdated from the upstream manifest, but there are no alerts for kube-dns
metrics yet so that wouldn't be a problem for now. Can you confirm that?
Actually it seems that the kube-dns
manifest for v1.5.0 has been updated so in that case it should appear as well.
I'll create a 1.5.x K8s cluster with the latest KOPS soon to test, thanks.
Right now I updated the labels on my 1.4.6 master and it looks good except:
kube-dns
Endpoint State Labels Last Scrape Error
http://100.96.1.2:10054/metrics
DOWN instance="100.96.1.2:10054" 198ms ago Get http://100.96.1.2:10054/metrics: dial tcp 100.96.1.2:10054: getsockopt: connection refused
http://100.96.1.2:10055/metrics
DOWN instance="100.96.1.2:10055" 14.087s ago Get http://100.96.1.2:10055/metrics: dial tcp 100.96.1.2:10055: getsockopt: connection refused
http://100.96.1.3:10054/metrics
DOWN instance="100.96.1.3:10054" 9.708s ago Get http://100.96.1.3:10054/metrics: dial tcp 100.96.1.3:10054: getsockopt: connection refused
http://100.96.1.3:10055/metrics
DOWN instance="100.96.1.3:10055" 327ms ago Get http://100.96.1.3:10055/metrics: dial tcp 100.96.1.3:10055: getsockopt: connection refused
kubernetes
Endpoint State Labels Last Scrape Error
https://10.101.115.158:443/metrics
DOWN instance="10.101.115.158:443" 11.904s ago Get https://10.101.115.158:443/metrics: x509: certificate is valid for 100.64.0.1, not 10.101.115.158
The kube-dns
failure is likely due to an old kube-dns
manifest .. I can see that the 1.4.x manifest has not been updated to expose metrics in the upstream kops repo.
The kubernetes
target failure is a bit more tricky, best would be if the certificate were to be created with the requested IP in the additional names section. (inspect your cert with openssl x509 -text -in your.crt
) The other option is to "manually" maintain an Endpoints
object through a headless Service
as done for etcd (see manifests/etcd
), that way the correct IP will be used to perform the request.
@brancz I don't know whether we want to continue this thread on kops/kube-prometheus work - let me know if there's a better place. Maybe we should open a new issue. I used kops latest (master: Version git-8b21ace) and Kubernetes 1.5.1 to create a new cluster in AWS. Running hack/cluster-monitoring/deploy
--- github/kube-prometheus ‹master› » ./hack/cluster-monitoring/deploy
namespace "monitoring" created
deployment "prometheus-operator" created
the server doesn't have a resource type "servicemonitor"
the server doesn't have a resource type "servicemonitor"
the server doesn't have a resource type "servicemonitor"
No resources found.
No resources found.
No resources found.
deployment "kube-state-metrics" created
service "kube-state-metrics" created
daemonset "node-exporter" created
service "node-exporter" created
configmap "grafana-dashboards" created
deployment "grafana" created
service "grafana" created
configmap "prometheus-k8s" created
configmap "prometheus-k8s-rules" created
service "prometheus-k8s" created
prometheus "prometheus-k8s" created
configmap "alertmanager-main" created
service "alertmanager-main" created
alertmanager "alertmanager-main" created
Has this been solved on upstream kops? @dmcnaught
I'm going to start with the kops - kube-prometheus config when kops 1.5 has been released.
Great thanks for the update! Are you aware of an ETA?
I've heard "soon" - it's currently in alpha4: https://github.com/kubernetes/kops/releases
Great! Looking forward to "soon" 🙂
Me too. I thought it would be "sooner" 😉
Getting close with kops 1.5.0-alpha2 and k8s 1.5.2 ^ Just the api cert issue to go. 😄
Looks like this is also the case with clusters created via acs-engine on Azure. The labels on the controller-manager pod are:
labels:
component: kube-controller-manager
tier: control-plane
Same with a cluster created using Kubeadm.
@yann-soubeyrand for kubeadm clusters you need to enable the controller manager and scheduler to listen on all or at least the pod networking interface/ip.
@brancz Thanks for the tip on modifying listening addresses which saved me some time ;-) However, I was mentioning the fact that the labeling done by Kubeadm is like rocketraman wrote above and therefore kube-prometheus was not able to discover the controller manager neither the scheduler nor etcd.
@brancz Can confirm what @yann-soubeyrand and @rocketraman have said, kubeadm
and gke
use component: kube-scheduler
, not k8s-app
@yann-soubeyrand for kubeadm clusters you need to enable the controller manager and scheduler to listen on all or at least the pod networking interface/ip.
I changed the bind address of controller manager and scheduler to 0.0.0.0
, but they are still not up on prometheus.
Also, there is no data in grafana..
Installed with KOPS 1.4.1, K8s 1.4.6 on AWS.
It looks to me like the query is set to alert when there is one kube-scheduler (or kube-contoller-manager), which I don't understand.
I’m pretty new to prometheus queries and I’m not really sure how the
BY (cluster) == 0)
relates. Any pointers appreciated. Thanks for the great project! --Duncan