Two alerts failing out of the box: K8SControllerManagerDown and K8SSchedulerDown

dmcnaught commented 7 years ago

Installed with KOPS 1.4.1, K8s 1.4.6 on AWS.

It looks to me like the query is set to alert when there is one kube-scheduler (or kube-contoller-manager), which I don't understand.

ALERT K8SSchedulerDown
  IF absent(up{job="kube-scheduler"}) or (count(up{job="kube-scheduler"} == 1) BY (cluster) == 0)

I’m pretty new to prometheus queries and I’m not really sure how the BY (cluster) == 0) relates. Any pointers appreciated. Thanks for the great project! --Duncan

brancz commented 7 years ago

The BY (cluster) == 0 part is unlikely important in most setups it just allows this to be used to monitor/alert for more than one Kubernetes cluster (eg. when using Federation). I'm guessing your alert is rather triggering because of the absence of kube-scheduler/kube-controller-manager jobs. Could you make sure you can find it in the /targets page? If it is not there, what is likely happening is that there is no Endpoints object that lists the kube-scheduler and kube-controller-manager, which means you either don't have the Services created from manifests/k8s/. Or your kube-scheduler and kube-controller-manager are not discoverable via those, in which case the output of the following would help:

$ kubectl get pods --all-namespace

Or if you cannot disclose all of that information this should give us all the information applicable:

$ kubectl -n monitoring get pods
$ kubectl -n kube-system get pods

We typically test the content of this repository with clusters created with bootkube, but it would be great if we can get a section/guides for kops as it's pretty widely adopted as well.

dmcnaught commented 7 years ago

Thanks for the quick response. I don't see them on the /targets page. Here is the info requested:


NAMESPACE            NAME                                                     READY     STATUS             RESTARTS   AGE
athena-graphql       athena-graphql-cmd-3150689734-xqkkq                      1/1       Running            0          4d
deis                 deis-builder-2759337600-en6fr                            1/1       Running            1          17d
deis                 deis-controller-873470834-re8zj                          1/1       Running            0          17d
deis                 deis-database-1712966127-plp30                           1/1       Running            0          17d
deis                 deis-logger-9212198-xjpsj                                1/1       Running            3          17d
deis                 deis-logger-fluentd-0ppwl                                1/1       Running            0          17d
deis                 deis-logger-fluentd-a8zly                                1/1       Running            0          17d
deis                 deis-logger-redis-663064164-3wq80                        1/1       Running            0          17d
deis                 deis-monitor-grafana-432364990-joygg                     1/1       Running            0          17d
deis                 deis-monitor-influxdb-2729526471-7npow                   1/1       Running            0          17d
deis                 deis-monitor-telegraf-2poea                              0/1       CrashLoopBackOff   112        17d
deis                 deis-monitor-telegraf-cy218                              1/1       Running            0          17d
deis                 deis-nsqd-3264449345-c1t0c                               1/1       Running            0          17d
deis                 deis-registry-680832981-64b4s                            1/1       Running            0          17d
deis                 deis-registry-proxy-94w6s                                1/1       Running            0          17d
deis                 deis-registry-proxy-p46bn                                1/1       Running            0          17d
deis                 deis-router-2457652422-sx6c7                             1/1       Running            0          17d
deis                 deis-workflow-manager-2210821749-ggzm3                   1/1       Running            0          17d
hades-graphql        hades-graphql-cmd-1371319327-6sqb6                       1/1       Running            0          8m
kube-system          dns-controller-2613152787-l8bj4                          1/1       Running            0          17d
kube-system          etcd-server-events-ip-10-101-115-158.ec2.internal        1/1       Running            0          17d
kube-system          etcd-server-ip-10-101-115-158.ec2.internal               1/1       Running            0          17d
kube-system          kube-apiserver-ip-10-101-115-158.ec2.internal            1/1       Running            2          17d
kube-system          kube-controller-manager-ip-10-101-115-158.ec2.internal   1/1       Running            0          17d
kube-system          kube-dns-v20-3531996453-ban95                            3/3       Running            0          17d
kube-system          kube-dns-v20-3531996453-v66h9                            3/3       Running            0          17d
kube-system          kube-proxy-ip-10-101-115-158.ec2.internal                1/1       Running            0          17d
kube-system          kube-proxy-ip-10-101-175-18.ec2.internal                 1/1       Running            0          17d
kube-system          kube-scheduler-ip-10-101-115-158.ec2.internal            1/1       Running            0          17d
monitoring           alertmanager-main-0                                      2/2       Running            0          4d
monitoring           alertmanager-main-1                                      2/2       Running            0          4d
monitoring           alertmanager-main-2                                      2/2       Running            0          4d
monitoring           grafana-874468113-0atmz                                  2/2       Running            0          4d
monitoring           kube-state-metrics-3229993571-aqo7z                      1/1       Running            0          4d
monitoring           node-exporter-5xyxb                                      1/1       Running            0          4d
monitoring           node-exporter-xwgn6                                      1/1       Running            0          4d
monitoring           prometheus-k8s-0                                         3/3       Running            0          4d
monitoring           prometheus-operator-479044303-ris0n                      1/1       Running            0          4d
splunkspout          k8ssplunkspout-nonprod-2ykwk                             1/1       Running            0          17d
splunkspout          k8ssplunkspout-nonprod-xdp7d                             1/1       Running            0          17d
styleguide           styleguide-cmd-685725177-pwl1c                           1/1       Running            0          4d
styleguide-staging   styleguide-staging-cmd-2993321210-1cgmo                  1/1       Running            0          2h
wellbot              wellbot-web-3878855632-34s4e                             1/1       Running            0          15d
welltok-arch-k8s     welltok-arch-k8s-1857575956-ky06l                        1/1       Running            0          14d```

brancz commented 7 years ago

I believe I have seen this before, the problem I think is that kops doesn't label the kubernetes component pods correctly with k8s-app=<component-name>. To confirm that can you give me the output of kubectl -n kube-system get kube-controller-manager-ip-10-101-115-158.ec2.internal -oyaml and kubectl -n kube-system get kube-scheduler-ip-10-101-115-158.ec2.internal -oyaml (in case they got rescheduled the pods that now start with kube-scheduler or kube-controller-manager respectively)

If what I am guessing is correct then we should push on kops side to use upstream manifests like bootkube does.

dmcnaught commented 7 years ago

kube-controller-manager.txt kube-scheduler.txt

dmcnaught commented 7 years ago

This was a similar issue - to add that label to kube-proxy: https://github.com/kubernetes/kops/pull/617

brancz commented 7 years ago

I didn't see that one, thanks for pointing it out! I opened https://github.com/kubernetes/kops/issues/1226 to start a discussion on it. Hopefully we will get those soon then. In the mean time I think you'll have to either ssh onto those servers, change the templates and restart them (which makes the objects be recreated from the templates IIRC; disclaimer: not super familiar with kops) or comment out/remove those alerts for now.

(also remember that changes to single machines will disappear when recreating machines from the ASG unless you make the changes to the ASG)

dmcnaught commented 7 years ago

I just noticed etcd is not appearing in my prometheus targets either.

dmcnaught commented 7 years ago

Oh, and kube-dns. Should we update https://github.com/kubernetes/kops/issues/1226 ?

brancz commented 7 years ago

It seems like we won't have an answer before the holidays. So I'll keep pushing in the new year. But yes I will keep pushing for a consistent labelling strategy, we'll add the respective manifests for Prometheus to properly discover the components here once we have that consistent labelling. I don't mind maintaining a set of manifests for kops, bootkube, etc. as long as each of those labelling strategies make sense and exist.

So far so goot :) happy holidays!

dmcnaught commented 7 years ago

I added the labels on the master (/etc/kubernetes/manifests/kube-controller-manager,kube-scheduler), and then ran kubectl create -f manifests/k8s/self-hosted I ran hack/cluster-monitoring/teardown and then hack/cluster-monitoring/deploy and that fixed the alerts for kube-scheduler and kube-controller-manager. kube-dns now has four endpoints listed under /targets, and they are all getting error: getsockopt: connection refused I'd also like to add etcd, but don't find any explicit instructions on that. It would be great to add a kops default setup in manifests/k8s/kops/

brancz commented 7 years ago

Yep that's the plan as soon as we have consistent labeling in upstream kops.

dmcnaught commented 7 years ago

Labelling added to kops: https://github.com/kubernetes/kops/pull/1314

brancz commented 7 years ago

I don't have a v1.5.x kops cluster handy, but I'll create the manifests with a best effort and then it would be great if you could test them.

dmcnaught commented 7 years ago

With pleasure. Thanks

brancz commented 7 years ago

In fact I think the manifests from manifests/k8s/self-hosted are suitable for kops when using the latest master, am I mistaken? Except kube-dns as the manifest is slightly outdated from the upstream manifest, but there are no alerts for kube-dns metrics yet so that wouldn't be a problem for now. Can you confirm that?

brancz commented 7 years ago

Actually it seems that the kube-dns manifest for v1.5.0 has been updated so in that case it should appear as well.

dmcnaught commented 7 years ago

I'll create a 1.5.x K8s cluster with the latest KOPS soon to test, thanks.

Right now I updated the labels on my 1.4.6 master and it looks good except:

kube-dns now has four failing endpoints listed under /targets:

kube-dns
Endpoint    State   Labels  Last Scrape Error
http://100.96.1.2:10054/metrics
DOWN    instance="100.96.1.2:10054" 198ms ago   Get http://100.96.1.2:10054/metrics: dial tcp 100.96.1.2:10054: getsockopt: connection refused
http://100.96.1.2:10055/metrics
DOWN    instance="100.96.1.2:10055" 14.087s ago Get http://100.96.1.2:10055/metrics: dial tcp 100.96.1.2:10055: getsockopt: connection refused
http://100.96.1.3:10054/metrics
DOWN    instance="100.96.1.3:10054" 9.708s ago  Get http://100.96.1.3:10054/metrics: dial tcp 100.96.1.3:10054: getsockopt: connection refused
http://100.96.1.3:10055/metrics
DOWN    instance="100.96.1.3:10055" 327ms ago   Get http://100.96.1.3:10055/metrics: dial tcp 100.96.1.3:10055: getsockopt: connection refused

kubernetes target is also failing (and this is causing K8SApiserverDown alert):

kubernetes
Endpoint    State   Labels  Last Scrape Error
https://10.101.115.158:443/metrics
DOWN    instance="10.101.115.158:443"   11.904s ago Get https://10.101.115.158:443/metrics: x509: certificate is valid for 100.64.0.1, not 10.101.115.158

brancz commented 7 years ago

The kube-dns failure is likely due to an old kube-dns manifest .. I can see that the 1.4.x manifest has not been updated to expose metrics in the upstream kops repo.

The kubernetes target failure is a bit more tricky, best would be if the certificate were to be created with the requested IP in the additional names section. (inspect your cert with openssl x509 -text -in your.crt) The other option is to "manually" maintain an Endpoints object through a headless Service as done for etcd (see manifests/etcd), that way the correct IP will be used to perform the request.

dmcnaught commented 7 years ago

@brancz I don't know whether we want to continue this thread on kops/kube-prometheus work - let me know if there's a better place. Maybe we should open a new issue. I used kops latest (master: Version git-8b21ace) and Kubernetes 1.5.1 to create a new cluster in AWS. Running hack/cluster-monitoring/deploy

--- github/kube-prometheus ‹master› » ./hack/cluster-monitoring/deploy
namespace "monitoring" created
deployment "prometheus-operator" created
the server doesn't have a resource type "servicemonitor"
the server doesn't have a resource type "servicemonitor"
the server doesn't have a resource type "servicemonitor"
No resources found.
No resources found.
No resources found.
deployment "kube-state-metrics" created
service "kube-state-metrics" created
daemonset "node-exporter" created
service "node-exporter" created
configmap "grafana-dashboards" created
deployment "grafana" created
service "grafana" created
configmap "prometheus-k8s" created
configmap "prometheus-k8s-rules" created
service "prometheus-k8s" created
prometheus "prometheus-k8s" created
configmap "alertmanager-main" created
service "alertmanager-main" created
alertmanager "alertmanager-main" created

brancz commented 7 years ago

Has this been solved on upstream kops? @dmcnaught

dmcnaught commented 7 years ago

I'm going to start with the kops - kube-prometheus config when kops 1.5 has been released.

brancz commented 7 years ago

Great thanks for the update! Are you aware of an ETA?

dmcnaught commented 7 years ago

I've heard "soon" - it's currently in alpha4: https://github.com/kubernetes/kops/releases

brancz commented 7 years ago

Great! Looking forward to "soon" 🙂

dmcnaught commented 7 years ago

Me too. I thought it would be "sooner" 😉

dmcnaught commented 7 years ago

Getting close with kops 1.5.0-alpha2 and k8s 1.5.2 ^ Just the api cert issue to go. 😄

rocketraman commented 6 years ago

Looks like this is also the case with clusters created via acs-engine on Azure. The labels on the controller-manager pod are:

  labels:
    component: kube-controller-manager
    tier: control-plane

yann-soubeyrand commented 6 years ago

Same with a cluster created using Kubeadm.

brancz commented 6 years ago

@yann-soubeyrand for kubeadm clusters you need to enable the controller manager and scheduler to listen on all or at least the pod networking interface/ip.

yann-soubeyrand commented 6 years ago

@brancz Thanks for the tip on modifying listening addresses which saved me some time ;-) However, I was mentioning the fact that the labeling done by Kubeadm is like rocketraman wrote above and therefore kube-prometheus was not able to discover the controller manager neither the scheduler nor etcd.

rawkode commented 6 years ago

@brancz Can confirm what @yann-soubeyrand and @rocketraman have said, kubeadm and gke use component: kube-scheduler, not k8s-app

KeithTt commented 3 years ago

@yann-soubeyrand for kubeadm clusters you need to enable the controller manager and scheduler to listen on all or at least the pod networking interface/ip.

I changed the bind address of controller manager and scheduler to 0.0.0.0, but they are still not up on prometheus.

Also, there is no data in grafana..

prometheus-operator / kube-prometheus

Two alerts failing out of the box: K8SControllerManagerDown and K8SSchedulerDown #23