prometheus-operator / kube-prometheus

Use Prometheus to monitor Kubernetes and applications running on Kubernetes
https://prometheus-operator.dev/
Apache License 2.0
6.62k stars 1.91k forks source link

KubeClientCertificateExpiration always alert #881

Closed ne1000 closed 4 months ago

ne1000 commented 6 years ago

What did you do? wget https://codeload.github.com/coreos/prometheus-operator/tar.gz/v0.23.1 install prometheus-operator use kubectl create -f prometheus-operator-0.23.1/contrib/kube-prometheus/manifests/ || true

What did you expect to see? all components work correctly.

Environment


[2] Firing
--
Labels
alertname = KubeClientCertificateExpiration
job = apiserver
prometheus = monitoring/k8s
severity = critical
Annotations
message = Kubernetes API certificate is expiring in less than 1 day.runbook_url = https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpirationSource

Labels
alertname = KubeClientCertificateExpiration
job = apiserver
prometheus = monitoring/k8s
severity = warning
Annotations
message = Kubernetes API certificate is expiring in less than 7 days.runbook_url = https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration

I used cfssl generate pem and keys

# openssl x509 -in /etc/kubernetes/ssl/ca.pem -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            6f:b9:70:eb:80:73:e6:73:f9:c8:29:98:99:5e:b5:f2:6d:a3:0e:49
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=CN, ST=Shanghai, L=Shanghai, O=k8s, OU=System, CN=kubernetes
        Validity
            Not Before: Aug  8 09:54:00 2018 GMT
            Not After : Aug  7 09:54:00 2023 GMT
        Subject: C=CN, ST=Shanghai, L=Shanghai, O=k8s, OU=System, CN=kubernetes
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:c8:ae:16:d6:0c:5b:30:95:97:a2:5b:16:cf:db:
                    f1:bd:68:8c:c6:0c:84:5b:a4:46:b4:79:0b:2b:c4:
                    b2:c0:5f:ab:e4:4a:33:46:d3:82:a3:33:bf:a7:f7:
                    ec:a3:4e:b3:70:34:e8:15:24:8e:56:b7:4d:68:9b:
                    e0:dc:0a:3a:3c:36:3e:f2:5c:be:d1:5d:fa:fa:e0:
                    7d:5b:2a:5d:e2:fc:94:9f:ea:a9:ce:ca:ad:2f:fd:
                    16:bc:fb:83:f6:45:fd:2f:9a:ac:94:e3:fd:49:90:
                    a1:31:95:cd:f2:30:2b:cd:31:34:69:b1:3a:b8:6a:
                    b8:7a:ef:f1:e9:ee:a2:5d:81:a8:59:80:77:c1:43:
                    85:3c:29:d8:02:fb:24:b9:9a:1f:e4:61:82:ec:8d:
                    49:3d:91:f7:0a:50:25:b1:a4:51:ba:f3:d6:77:07:
                    e2:50:ed:b8:af:30:18:d8:23:d6:e9:17:b1:a0:1c:
                    8c:74:f3:87:56:08:c7:49:86:c0:90:5e:16:a4:1e:
                    07:49:ef:b2:dc:9e:22:4c:b9:9b:7f:38:47:d7:26:
                    17:15:92:79:51:cc:a9:3f:4b:a1:6d:03:94:5b:9c:
                    03:c0:19:7e:d1:4e:c9:77:84:b1:e4:5b:a6:2b:54:
                    95:d0:a3:ef:39:d6:c3:88:77:af:4f:31:cd:ba:f7:
                    cc:3b
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Certificate Sign, CRL Sign
            X509v3 Basic Constraints: critical
                CA:TRUE, pathlen:2
            X509v3 Subject Key Identifier: 
                BC:9F:D1:BD:4C:26:E1:77:C0:7F:CF:04:3E:DF:64:86:BE:23:F3:7F
            X509v3 Authority Key Identifier: 
                keyid:BC:9F:D1:BD:4C:26:E1:77:C0:7F:CF:04:3E:DF:64:86:BE:23:F3:7F

    Signature Algorithm: sha256WithRSAEncryption
         78:b7:65:4d:53:e1:0c:7d:d6:9e:d5:aa:f8:1a:34:e4:1d:c0:
         22:4b:42:72:86:86:e9:73:e2:fd:89:90:e1:10:56:a7:f2:15:
         71:14:79:ce:67:9a:ca:5d:4d:e8:25:3d:70:2a:0a:3b:08:09:
         02:8a:d9:2d:ed:85:cd:10:38:60:75:d7:f5:a7:b2:ee:86:05:
         dd:50:38:04:a4:7a:bc:f5:02:b2:a5:d9:a2:a1:71:7d:e5:ce:
         dd:c8:5a:a7:25:61:de:c3:76:c3:87:3e:5a:4c:eb:36:91:51:
         8b:fc:ef:9d:aa:35:58:3a:ba:fc:2a:3c:4f:b3:54:e8:0d:a5:
         32:25:91:dd:93:75:33:53:2b:94:9e:f1:cb:e9:58:17:a6:dc:
         07:1c:96:5e:93:40:d6:c8:2b:67:49:3b:3f:1f:a8:3a:41:65:
         29:03:f3:18:f9:d3:66:a8:49:14:1e:7f:cb:6b:f6:26:1d:7b:
         6f:46:c6:27:a1:69:fe:62:7f:da:fb:41:7d:fc:ab:12:77:b8:
         b3:4c:92:a5:5c:d2:8c:25:a1:aa:1e:2f:a2:de:38:e5:9a:96:
         2f:b2:bb:3c:32:de:db:7f:80:eb:f0:01:be:2d:ff:00:09:35:
         ea:2b:8d:33:6e:6c:2c:6d:37:a2:c4:b3:c9:eb:ac:3f:ec:e5:
         5d:61:50:66
# openssl x509 -in /etc/kubernetes/ssl/kubernetes.pem -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            76:64:c7:59:95:aa:fb:9b:8c:b2:26:c0:82:24:c5:0a:8d:95:a2:1e
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=CN, ST=Shanghai, L=Shanghai, O=k8s, OU=System, CN=kubernetes
        Validity
            Not Before: Aug  8 09:54:00 2018 GMT
            Not After : Aug  5 09:54:00 2028 GMT
        Subject: C=CN, ST=Shanghai, L=Shanghai, O=k8s, OU=System, CN=kubernetes
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:b9:7e:1b:a9:9a:95:21:42:5a:e8:3e:79:94:e6:
                    c1:35:87:93:22:3d:3c:c9:65:be:b6:99:4b:47:25:
                    1a:22:db:4a:a5:b8:59:0d:2d:a0:0d:e5:c6:35:3b:
                    8e:2c:e3:fe:3a:d9:bc:63:9b:a0:98:c2:26:98:4c:
                    be:8b:71:20:37:a3:19:21:34:03:0b:10:d7:cb:7c:
                    b6:d8:68:90:1b:e1:6b:ee:b8:0e:6f:3d:33:2b:3f:
                    87:9a:4f:6c:59:08:f4:22:a6:2a:b6:d5:d6:00:b8:
                    7e:3c:90:aa:99:5c:6e:7c:93:f2:6b:6a:6f:5b:c6:
                    35:60:e0:14:62:5e:91:cc:20:eb:88:ea:cc:7a:10:
                    d7:f1:5f:b3:fb:aa:c4:a7:f5:95:3e:8a:44:ee:09:
                    12:6b:aa:29:05:40:df:1e:54:25:05:e2:8c:cb:d7:
                    32:e8:c5:ff:0c:48:11:27:c9:52:81:f2:53:b0:82:
                    b0:1b:7f:ad:08:fd:cd:b6:c1:4e:43:da:2d:f0:90:
                    90:cb:97:a2:2a:31:bc:65:2c:9f:a9:72:90:dd:b0:
                    5e:3b:7d:1c:37:d6:ca:22:13:2a:da:27:1d:61:94:
                    8f:36:9f:9d:6a:d1:6c:b9:17:58:5d:9c:0d:b1:d8:
                    2a:98:f1:54:d7:87:c6:da:ff:05:c9:a2:c5:91:5a:
                    77:23
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier: 
                B3:1C:65:F4:DA:61:57:1F:68:06:05:46:36:31:BC:AF:E1:D5:06:7C
            X509v3 Authority Key Identifier: 
                keyid:BC:9F:D1:BD:4C:26:E1:77:C0:7F:CF:04:3E:DF:64:86:BE:23:F3:7F

            X509v3 Subject Alternative Name: 
                DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster, DNS:kubernetes.default.svc.cluster.local, IP Address:127.0.0.1, IP Address:192.168.2.93, IP Address:10.100.0.1, IP Address:192.168.2.86, IP Address:192.168.2.87, IP Address:192.168.2.88
    Signature Algorithm: sha256WithRSAEncryption
         2d:a6:ee:28:71:0f:ea:69:ff:90:25:d6:04:4e:4c:e1:3d:ff:
         34:f1:64:67:4f:ab:80:ee:f5:d9:16:53:48:0c:c4:fd:9a:f0:
         09:13:71:b1:ba:52:b0:36:38:6b:51:be:ac:cc:14:30:2b:e7:
         a9:87:00:76:fe:1a:58:72:45:27:0a:59:51:74:65:6a:30:ea:
         37:f3:c9:79:59:f0:09:87:e9:94:99:00:11:d7:20:9c:90:5c:
         de:ee:09:ff:53:07:41:06:4c:91:8d:8a:d1:d5:ff:30:06:3b:
         53:32:4c:dd:70:f0:22:7f:7d:e6:02:f2:eb:a6:fd:5a:de:d6:
         0d:fa:b5:e9:f0:95:5a:79:bb:f9:b5:a5:47:01:13:3f:b0:12:
         c6:35:11:45:2f:6b:f3:71:26:92:8f:34:90:0f:42:d8:2a:12:
         0f:ad:96:1f:60:54:5c:27:f3:0f:c3:4e:f5:ef:58:75:51:7a:
         df:8c:f3:b2:d4:b8:70:99:ff:e3:5a:ee:a9:00:69:84:a3:c2:
         df:7e:9b:55:e1:ab:92:bb:55:8b:54:6c:aa:05:c4:ea:29:8e:
         56:72:15:11:c2:6e:49:72:b5:d7:30:06:7b:c4:a2:0a:82:87:
         19:83:b7:1e:3a:86:02:35:f5:21:e8:e6:bf:5e:51:c0:ec:f0:
         c1:3d:15:35
# openssl x509 -in /etc/kubernetes/ssl/admin.pem -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            30:7e:a9:d4:1c:0a:04:d7:3b:2a:38:7a:b3:ca:25:fb:65:e3:e6:72
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=CN, ST=Shanghai, L=Shanghai, O=k8s, OU=System, CN=kubernetes
        Validity
            Not Before: Aug  8 09:55:00 2018 GMT
            Not After : Aug  5 09:55:00 2028 GMT
        Subject: C=CN, ST=Shanghai, L=Shanghai, O=system:masters, OU=System, CN=admin
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:cb:10:41:82:61:ec:93:e8:4d:bf:3e:2d:88:45:
                    ce:e8:57:ee:c6:90:8c:a2:e7:7b:16:ae:9e:fc:6e:
                    60:25:5c:f4:26:c2:50:c7:b5:1e:d3:91:d8:54:e9:
                    5b:6f:85:0e:0a:56:2c:e8:4d:69:dc:06:1e:94:92:
                    29:b9:7c:6f:cd:bd:25:13:bf:c9:9b:98:dd:81:f2:
                    0e:df:27:17:75:c9:4f:d8:9a:9c:5c:b0:db:9c:ed:
                    bb:a5:1f:c1:df:85:9a:f9:62:6b:a8:7a:96:69:30:
                    93:2f:e9:e3:16:dc:74:5f:4d:68:5d:e3:05:ae:01:
                    bd:60:72:d0:30:7c:3b:01:7a:13:9f:4c:ef:62:f2:
                    6c:47:6a:25:6f:b4:0c:7a:53:db:78:a4:71:00:c8:
                    6c:a7:c6:39:42:cf:da:e0:20:ce:66:02:36:43:13:
                    5a:56:7d:da:77:ad:01:4f:ab:56:54:6d:b9:27:08:
                    4e:d6:95:8b:cd:90:5f:28:c2:63:de:d8:f9:77:4f:
                    6d:35:02:9b:6c:cf:27:43:8a:47:b0:74:7e:25:c5:
                    6c:2d:7a:4b:e1:49:af:e7:28:d1:e0:3b:2a:21:1d:
                    bd:09:80:f7:4f:ee:a9:23:50:8c:65:55:0b:fd:d8:
                    4b:4b:b3:82:cb:2a:9f:33:c7:d3:88:63:91:ca:f9:
                    e1:a7
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier: 
                DA:B4:8B:36:C7:E9:9C:C0:6E:AC:8D:1F:D6:18:93:76:4D:6E:78:1F
            X509v3 Authority Key Identifier: 
                keyid:BC:9F:D1:BD:4C:26:E1:77:C0:7F:CF:04:3E:DF:64:86:BE:23:F3:7F

    Signature Algorithm: sha256WithRSAEncryption
         2f:69:9c:6f:53:bb:7a:42:e3:4e:8f:b4:17:00:10:90:c3:1c:
         be:68:05:f3:15:6a:aa:0c:53:eb:89:c6:0c:2e:c2:0a:75:14:
         16:09:7e:68:0e:83:5c:c9:79:e0:ab:86:ee:93:d7:de:50:66:
         98:3d:5a:43:e0:7f:dd:dc:8a:b8:83:84:84:d4:0f:a5:c5:a1:
         b2:4a:65:76:15:e7:85:f3:7d:37:ee:e2:50:70:28:85:e8:05:
         05:d1:60:74:40:e2:67:7a:31:32:39:e3:96:e3:5b:fe:5e:eb:
         36:ef:cf:fa:95:37:9c:f1:3a:f5:11:80:e8:80:f9:1c:39:04:
         a0:14:af:e0:e7:ac:ce:6f:ad:4a:f3:e8:24:13:20:72:46:15:
         da:9a:e3:1d:88:c5:3d:93:12:7c:71:d3:77:95:5b:cd:f7:3b:
         b3:33:5d:10:31:7e:d9:ba:0e:ed:c8:61:9a:e7:df:fa:75:f1:
         f4:e5:67:81:be:3b:4a:5d:1e:82:1e:64:f7:16:14:4c:d9:e1:
         09:56:81:f4:64:21:47:79:f2:50:55:bb:e1:28:21:40:22:7d:
         f6:b7:f1:cd:3f:99:e5:96:c9:ee:76:be:03:68:da:7a:94:f5:
         ad:bb:40:66:cc:8c:85:36:91:3d:6a:5e:f6:d8:71:23:9e:f1:
         97:ff:73:ea

my k8s cluster and prometheus seems fine. but KubeClientCertificateExpiration always trigger alert, how do I fix it ?

brancz commented 6 years ago

This is actually about the certificates that clients use to communicate with the Kubernetes API. Check the certificates that the kubelets, scheduler(s) and controller-manager(s) use.

ne1000 commented 6 years ago

@brancz Yes, I did checked, but I didn't find something issue, kindly please give me a advise

# cat /etc/kubernetes/controller-manager 
KUBE_CONTROLLER_MANAGER_ARGS="--address=0.0.0.0   \
                              --master=http://192.168.2.86:8080 \
                              --cluster-name=kubernetes \
                              --cluster-signing-cert-file=/etc/kubernetes/ssl/ca.pem \
                              --cluster-signing-key-file=/etc/kubernetes/ssl/ca-key.pem \
                              --service-account-private-key-file=/etc/kubernetes/ssl/ca-key.pem \
                              --root-ca-file=/etc/kubernetes/ssl/ca.pem \
                              --leader-elect=true \
                              --v=0"
# cat /usr/lib/systemd/system/kubelet.service 
[Unit]
Description=Kubernetes API Server
Documentation=https://kubernetes.io/doc
After=docker.service
Requires=docker.service

[Service]
WorkingDirectory=/var/lib/kubelet
ExecStart=/usr/local/bin/kubelet --kubeconfig=/etc/kubernetes/kubelet.kubeconfig --bootstrap-kubeconfig=/etc/kubernetes/bootstrap.kubeconfig --logtostderr=false --log-dir=/var/log/kubernetes --v=0 --cluster-dns=10.100.0.100 --cluster-domain=cluster.local. --resolv-conf=/etc/resolv.conf --authentication-token-webhook=true --authorization-mode=Webhook
Restart=on-failure

[Install]
WantedBy=multi-user.target
# cat /etc/kubernetes/kubelet.kubeconfig 
apiVersion: v1 
clusters:
- cluster:
    certificate-authority: /etc/kubernetes/ssl/ca.pem
    server: https://192.168.2.93:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: admin
  name: kubernetes
current-context: kubernetes
kind: Config
preferences: {}
users:
- name: admin
  user:
    client-certificate: /etc/kubernetes/ssl/admin.pem
    client-key: /etc/kubernetes/ssl/admin-key.pem
# cat /etc/kubernetes/bootstrap.kubeconfig 
apiVersion: v1
clusters:
- cluster:
    certificate-authority: /etc/kubernetes/ssl/ca.pem
    server: https://192.168.2.93:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubelet-bootstrap
  name: default
current-context: default
kind: Config
preferences: {}
users:
- name: kubelet-bootstrap
  user:
    token: d2b9e107b99641a01ff18e952cf9ce85
# openssl x509 -in /var/lib/kubelet/pki/kubelet.crt -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 2 (0x2)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=izuf68thdbm0n4j5qywd7sz-ca@1533727394
        Validity
            Not Before: Aug  8 11:23:14 2018 GMT
            Not After : Aug  8 11:23:14 2019 GMT
        Subject: CN=izuf68thdbm0n4j5qywd7sz@1533727394
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:d1:51:90:4a:e1:e5:e0:4c:90:f2:ff:ad:31:20:
                    77:34:d0:1a:7f:ab:c8:f5:87:74:10:4b:df:52:6a:
                    77:d7:01:92:ab:7a:14:4a:78:eb:c3:a7:9a:ed:f2:
                    b4:95:a7:dd:b8:40:25:4f:fb:06:d8:36:ef:4c:4b:
                    a9:13:0f:c9:f0:de:8a:f6:9a:17:1c:7c:07:5f:2f:
                    4a:dd:3c:f7:4e:7f:59:78:7b:0f:10:df:77:cc:bb:
                    1b:7f:02:3b:39:66:56:5c:37:3b:db:ec:c8:84:53:
                    46:ed:7e:26:3d:14:56:2d:f4:82:a3:4b:64:ae:8b:
                    3e:9c:56:c7:15:59:97:01:f7:93:6a:35:88:5d:b5:
                    cd:a5:03:02:0f:55:04:aa:77:6a:65:8e:96:2c:ae:
                    a6:7e:03:de:01:95:30:bc:68:21:52:4e:02:f4:c0:
                    ad:8f:6b:71:db:5b:b9:d3:7c:55:93:b1:ce:df:12:
                    be:1a:7e:95:0f:cb:d9:4b:1f:43:28:0b:19:12:f4:
                    5f:b8:53:49:93:b2:ef:37:61:0a:ec:d1:11:10:e2:
                    40:bc:1b:c3:74:e2:83:a8:24:32:0e:e8:0e:6f:5e:
                    f2:44:6a:27:40:4a:c0:f1:4c:98:ab:e3:52:18:2c:
                    fd:80:ff:23:9b:2f:77:8e:3e:20:a1:ee:df:82:24:
                    a1:f3
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage: 
                TLS Web Server Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Alternative Name: 
                DNS:izuf68thdbm0n4j5qywd7sz
    Signature Algorithm: sha256WithRSAEncryption
         a1:d6:4a:86:13:8e:36:a3:c2:ff:6a:e3:50:ce:48:97:19:a0:
         d8:94:99:47:53:49:75:6f:27:15:ad:4a:b3:4c:50:5c:79:15:
         d3:f7:55:26:f8:58:d2:77:26:c3:6c:8a:2d:46:58:df:5f:70:
         40:54:4d:0a:7e:16:b2:b9:f6:6b:ce:ae:81:94:3f:88:b9:b3:
         56:e5:1c:55:f1:97:7b:50:66:f3:19:c5:48:55:2d:22:60:6d:
         36:0f:4b:99:ef:53:88:2d:3f:6a:47:2d:54:96:a9:35:2b:71:
         7c:18:86:bc:a2:33:2a:b5:b5:ab:19:3b:85:f5:c8:2a:4c:9c:
         54:71:ca:2b:14:00:a3:02:a3:6a:f8:fb:4f:40:d7:a2:59:18:
         9c:7a:93:2b:8d:39:26:1c:42:b1:62:6e:55:dd:c5:48:fe:45:
         cb:81:d1:bb:8a:86:05:80:9d:32:ec:da:cf:9c:83:fa:9b:f3:
         90:70:38:56:c7:1d:7d:e6:69:91:e2:90:77:db:20:50:43:f6:
         8d:5d:7f:52:e7:eb:fc:9d:8e:75:91:f6:63:b6:b9:96:2a:ef:
         0f:1f:99:13:4a:d6:5d:72:d7:1a:a8:71:0f:b6:21:21:a6:81:
         40:e2:74:f4:89:cd:0e:ae:24:0b:e2:c2:07:69:1c:06:0d:ad:
         3b:3f:5e:a2
brancz commented 6 years ago

Can you just check that you've checked every client that appears when you run this query?

max(apiserver_request_count) by(client)

The values don't matter it's just an aggregation so we can see all clients requesting against the API.

ne1000 commented 5 years ago

@brancz

pr

if the values don't matter, how to ignore the alert in my case?

brancz commented 5 years ago

You can always just silence an alert in Alertmanager :slightly_smiling_face: nonetheless we should figure out what's up here.

willtrking commented 5 years ago

Noticed this as well when setting up on EKS.

I BELIEVE that EKS manages the renewal of certificates by itself, so I've removed the rule on my end.

kevtaylor commented 5 years ago

@brancz Was there any update on this? - we get spammed a lot on our alerts with this false detail

brancz commented 5 years ago

You can configure the certificate expiry thresholds https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/c0b31ea63564966021f9e6010090acded475b192/config.libsonnet#L42-L43

If you want to ignore it entirely though you can also just remove the alert, or silence it :slightly_smiling_face: .

kevtaylor commented 5 years ago

Hi. Thanks for this answer but...

We are using the helm chart which has this baked in: https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kubernetes/templates/kubernetes.rules.yaml

How do we influence that?

And if I do want to programmatically silence that alert, is there a way of doing so in this chart - selective alerting ?

brancz commented 5 years ago

I'm not aware of tooling to declaratively silence alerts, I agree that would be neat to have.

@gianrubio maintains the helm charts, the coreos/red hat team just maintains the jsonnet part, as we don't use helm. I can't help you with helm things unfortunately.

kevtaylor commented 5 years ago

@gianrubio Do you have any helm updates to fix this?

joshbranham commented 5 years ago

Yeah we are also seeing this issue even though all our certificates appear up to date, and it specifically states the apiserver. I updated the alert on our side (we are using version 0.17 still) to use the histogram as updated in: https://github.com/coreos/prometheus-operator/blob/0bad93292506ace68e344c9a991af6ae76ae1a51/contrib/kube-prometheus/manifests/prometheus-rules.yaml#L752-L759

The strange part is the values come and go.. screen shot 2018-12-18 at 4 24 58 pm

kevtaylor commented 5 years ago

@brancz Is this project still active? We don't seem to get any responses

leoncard commented 5 years ago

We are observing the same behaviour with our clusters. Does anyone have any news on that? We're finding it quite weird because every certificate seems to be up to date but the metrics show something different. We tried restarting the apiserver docker container on the master node following this comment. The alert stopped for a while but came back hours after.

joshbranham commented 5 years ago

So we figured out our issue (at least it is the only thing that makes sense). The alert was firing sporadically, and it was only when our single user that had certificate-based authentication was communicating with the API (Jenkins). Those certs did, in fact, expire in line with what the alert was saying, and since rotated them we have not had the alert fire.

shovelend commented 5 years ago

@joshphp our histogram is incrementing sporadically as well, we couldn't tie them to any client machine yet, but we noticed that for the clusters these numbers are increasing they are increasing at the same exact time. May I ask how you traced down that single user?

kevtaylor commented 5 years ago

I think also that this PR may be related to this issue: prometheus-operator/prometheus-operator#2058

joshbranham commented 5 years ago

@shoveland the old fashioned way: the user stopped being able to talk to the API with certificate warnings 😏

brancz commented 5 years ago

It seems that this issue is about expired client certs after all :slightly_smiling_face: . I'll keep this open for now as prometheus-operator/prometheus-operator#2058 is correct, this alert is about client certs, not serving certs, however, that won't change it from firing, so you will need to check up on your certs and check the apiserver's logs for which clients did these requests with expired certs.

shovelend commented 5 years ago

@brancz that's correct, let's close this after rephrasing the description of the alert. Having checked the apiserver's logs we still don't see the culprit, there are no messages logged regarding expired certificates nor authentication requests that failed. Do you have any other ideas where we could identify the client?

brancz commented 5 years ago

I realize this is drastic but if I recall correctly if you bump logging verbosity to --v=10 then you see the user identity of the certificate printed. That's the best I can do, otherwise I'd suggest we should add a log line to Kubernetes to log this more explicitly.

shovelend commented 5 years ago

Setting the log verbosity to 10 helped us track down the issue.

We found out that the client certificate data (part of the ./kube/config) we generated for developers was expiring after 1 month. When developers were trying to access the kubernetes-dashboard by kubectl proxy-ing from their local machines (with an expired certificate), the metric increased as well. We extended the expiry date of the generated certificates.

This issue was quite difficult to track down and we found the needle in the haystack when setting the apiserver verbosity to 10 (had to go through 100.000 lines of log looking for a possible culprit.) If we were to suggest a place for improvement, it would be the alert message. If the metric could capture the url that was about to be accessed with the expired certificate and/or the client IP address, it would be incredibly helpful and could be displayed within the alert.

ChristofferNicklassonLenswayGroup commented 5 years ago

@shovelend could you share a logline that would help me find it in my logs :)

daimoniac commented 5 years ago

The metric could tell us which certificate is about to expire.

brancz commented 5 years ago

@daimoniac I don't think that's a good idea, that would allow clients to produce arbitrary amounts of metrics leading to a denial of service attack against a kubernetes API. I think we should add an info log line that says which user/certificate caused the counter to increment. Happy to review your Kubernetes PR if you open this :slightly_smiling_face: .

itaysk commented 5 years ago

I'm running prometheus-operator on EKS and have this issue as well. If I understand correctly (couldn't find anything on the web about this metric apiserver_client_certificate_expiration_seconds_bucket) this is about api clients (such as kubectl) using soon to be expired certs. I don't think it's actually kubectl in this case since in EKS it's using a custom webhook validator. Any pointers to how I can identify the aged certs, given I don't have access to the control plane? If this is the by-design behavior of EKS (see @willtrking 's comment), then should this alert be preconfigured? consider the more general case outside EKS of automatically rotating short lived certs.

willejs commented 5 years ago

@itaysk I have just taken the system-alerts rules out as this is managed by AWS. I couldn't find a good way to exclude these specific rules, i am more intrigued as to why they are at 0, i suspect the prom library they use does this by default? It probably shouldnt.

itaysk commented 5 years ago

@willejs - didn't what to remove the entire kubernetes-system rules set (https://github.com/helm/charts/blob/master/stable/prometheus-operator/templates/alertmanager/rules/kubernetes-system.yaml) for now i've edited those rules out.

azalio commented 5 years ago

Setting the log verbosity to 10 helped us track down the issue.

We found out that the client certificate data (part of the ./kube/config) we generated for developers was expiring after 1 month. When developers were trying to access the kubernetes-dashboard by kubectl proxy-ing from their local machines (with an expired certificate), the metric increased as well. We extended the expiry date of the generated certificates.

This issue was quite difficult to track down and we found the needle in the haystack when setting the apiserver verbosity to 10 (had to go through 100.000 lines of log looking for a possible culprit.) If we were to suggest a place for improvement, it would be the alert message. If the metric could capture the url that was about to be accessed with the expired certificate and/or the client IP address, it would be incredibly helpful and could be displayed within the alert.

Can you explain which words you searched?

madAndroid commented 5 years ago

We're seeing the same issue on our clusters and finding it incredibly hard to track down the expiring certificate - can someone please share a log line we can search for when the verbosity is set to 10?

azalio commented 5 years ago

We're seeing the same issue on our clusters and finding it incredibly hard to track down the expiring certificate - can someone please share a log line we can search for when the verbosity is set to 10?

Just try to search word cert

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

andrewsav-bt commented 4 years ago

@daimoniac I don't think that's a good idea, that would allow clients to produce arbitrary amounts of metrics leading to a denial of service attack against a kubernetes API.

So you are saying, that we are producing alert of a metric that can be caused by a couple a dozen clients (see screenshot from ne1000 above), we have no way to identify which client caused it, and providing this information will cause denial of service?

On the face value it does not sound right, could you please explain in more details what are different moving parts here are and what exactly makes it impossible to provide useful, exact information?

Failing that, how to identify the culprit client manually, without checking them all?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

AndrewSav commented 4 years ago

/remove stale

dlandtwing commented 4 years ago

Failing that, how to identify the culprit client manually, without checking them all?

Had the same issue on our openshift cluster today. I've found the responsible clients by capturing network traffic on the apiserver https port using tcpdump and analyzing the tcp handshakes using wireshark (filter "tls.handshake.client_cert_vrfy.sig") and inspecting the client certificates

It turned out to be expiring kubelet client certificates (cn system:nodes...) causing these issues. by running openssl x509 -in /etc/origin/node/certificates/kubelet-client-current.pem -noout -startdate -enddate on each node we were able to identify which kubelet certificates were about to expire

Ultimately the solution was obtaining all pending CSRs and approving them manually: oc get csr --sort-by='{.metadata.creationTimestamp}' oc adm certificate approve csr-thgqd

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in last 60d. Thank you for your contributions.

xavierW commented 3 years ago

Failing that, how to identify the culprit client manually, without checking them all?

Had the same issue on our openshift cluster today. I've found the responsible clients by capturing network traffic on the apiserver https port using tcpdump and analyzing the tcp handshakes using wireshark (filter "tls.handshake.client_cert_vrfy.sig") and inspecting the client certificates

It turned out to be expiring kubelet client certificates (cn system:nodes...) causing these issues. by running openssl x509 -in /etc/origin/node/certificates/kubelet-client-current.pem -noout -startdate -enddate on each node we were able to identify which kubelet certificates were about to expire

Ultimately the solution was obtaining all pending CSRs and approving them manually: oc get csr --sort-by='{.metadata.creationTimestamp}' oc adm certificate approve csr-thgqd

is there any easier way to find out the expired client ?

coderanger commented 3 years ago

Failing that, how to identify the culprit client manually, without checking them all?

Had the same issue on our openshift cluster today. I've found the responsible clients by capturing network traffic on the apiserver https port using tcpdump and analyzing the tcp handshakes using wireshark (filter "tls.handshake.client_cert_vrfy.sig") and inspecting the client certificates It turned out to be expiring kubelet client certificates (cn system:nodes...) causing these issues. by running openssl x509 -in /etc/origin/node/certificates/kubelet-client-current.pem -noout -startdate -enddate on each node we were able to identify which kubelet certificates were about to expire Ultimately the solution was obtaining all pending CSRs and approving them manually: oc get csr --sort-by='{.metadata.creationTimestamp}' oc adm certificate approve csr-thgqd

is there any easier way to find out the expired client ?

Not via metrics. Metrics are for numeric values, not textual data like client names. As mentioned earlier in the thread, apiserver should be logging more verbosely when an almost-expired client connects but that fix would need to be in apiserver itself, not here.

bpinske commented 3 years ago

This is a frustrating problem. Its unclear to me how the creation of this histogram could be fixed without introducing cardinality problems. My first thought was to add a label of the hex representation of the source IP. Then a few other ideas that technically satisfy the requirement that all prometheus metrics be integers, but nothing that would ever be accepted upstream. https://github.com/kubernetes/kubernetes/commit/49a19c6011e05363a8baf8e99c917d11a9496568

pcap dumping is one way to figure this out as noted above, possibly the best. But remember you may have many API servers behind an LB so may want to dump all of them.

The way we've just found our mystery cert is with the following splunk query. You may have to adapt this slightly if you use elk or something but the idea should be the same: search audit records for usernames. The built-in cert rotation mechanism of normal k8s components like the kubelets should be auto fixing anything with the system: prefix so exclude those. In my case I got lucky that this only showed a small number of username possibilities. We use OIDC for human kubectl auth, rather than x509 certs, so the logs were sparse for me.

index="main" apiVersion="audit.k8s.io/v1" * | spath "user.username" | search "user.username" != "system:*"

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. Thank you for your contributions.

drasikhov commented 1 year ago

Hi! I'm new to Kubernetes and I'm sorry if it's a dumb question, but how exaclty the apiserver_client_certificate_expiration_seconds histogram and metric can be useful? We can't really tell, who was that client with the expiring certificate, so what's the big point in it then?

It would be much better if we would have metrics (like at least notAfter property) for each certificate that is used in the cluster directly in the API (for all components, like etcd/front-proxy/controller manager etc.) the same way we can get them with kubeadm certs check-expiration. If you have another etcd or some other component on some node, you should be able to retrieve that data through API as well. And metric's description should clearly state what it's puprose.

I understand that with kubeadm you get these on the file level from PKI directory, but this just looks very strange that you can't really get this data natively through API and that in such cases you have to use 3rd party solutions, like x509-certificate-exporter.

ykfq commented 1 year ago

I managed to fix this by upgrade prometheus from 2.37.1 to current latest version 2.44.0(only tested on this version)

aostrovsky commented 1 year ago

I managed to fix this by upgrade prometheus from 2.37.1 to current latest version 2.44.0(only tested on this version)

running 2.45.0 in aks 1.24.10 Still getting those alerts

h0jeZvgoxFepBQ2C commented 1 year ago

Still receiving these alerts

jvstein commented 10 months ago

I ran into a weird setup where I had both a token as well as a client-certificate-data and client-key-data section in a ~/.kube/config file on a client.

From the user perspective everything was working fine, but it must have been attempting requests with the certificate first and then silently falling back to the working token value, not showing any error to the user.

This client was constantly triggering the KubeClientCertificateExpiration alert whenever it was active. The histogram_quantile portion of the alert query (below) was showing a straight line at 0.

histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[2m]))) < 86400

Basic structure of the bad client config.

users:
- name: the-user
  user:
    token: <VALID_TOKEN>
    client-certificate-data: <EXPIRED_CERT>
    client-key-data: <KEY>
github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had any activity in the last 60 days. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue was closed because it has not had any activity in the last 120 days. Please reopen if you feel this is still valid.