kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.54k stars 39.52k forks source link

static etcd and control plane pods log are not rotated in kubeadm cluster #120888

Closed geotransformer closed 1 week ago

geotransformer commented 1 year ago

What happened?

Kubeadm cluster with kubelet configured with the following config

containerLogMaxSize: 25Mi containerLogMaxFiles: 2

crictl stop the static etcd/apiserver/controller/manager pod container. pod logs are not rotated.

root@mst1:/var/log/pods/kube-system_kube-controller-manager-mst1_d502ec3b6222a8e5430ef26afe24e26f/kube-controll er-manager# ls -alh total 236K drwxr-x--- 2 root root 4.0K Sep 25 21:47 . drwxr-x--- 3 root root 4.0K Sep 24 00:39 .. -rw-r----- 1 root root 34K Sep 24 00:39 -------- 0.log -rw-r----- 1 root root 40K Sep 24 00:41 -------- 1.log -rw-r----- 1 root root 147K Sep 25 21:09 ---------2.log -rw-r----- 1 root root 1.2K Sep 25 21:47 ----------3.log

Notes: 1> log rotation works for other non-staic pod or static pod root@mst1:/var/log/pods/default_static-web-mst1_ab30fcbb6481a1c91aa9873ea51974ff/web# ll total 24 drwxr-xr-x 2 root root 4096 Sep 25 21:46 ./ drwxr-xr-x 3 root root 4096 Sep 25 21:28 ../ -rw-r----- 1 root root 5650 Sep 25 21:34 ---------- 3.log -rw-r----- 1 root root 5649 Sep 25 21:46 ---------- 4.log

What did you expect to happen?

at most only 2 pod logs are kept

How can we reproduce it (as minimally and precisely as possible)?

sudo crictl stop <etcd/apisever >

Anything else we need to know?

No response

Kubernetes version

``` v1.25.11 ```

Cloud provider

baremetal

OS version

ubuntu 20.04.6

Install tools

Container runtime (CRI) and version (if applicable)

containerd 1.7.2

Related plugins (CNI, CSI, ...) and versions (if applicable)

kannon92 commented 1 year ago

/sig node

neolit123 commented 1 year ago

/transfer kubeadm

neolit123 commented 1 year ago

i removed the node label and moved this to k/kubeadm, but this might be a ticket for kubernetes/kubernetes. for sig instrumentstion.

moving it back.

neolit123 commented 1 year ago

/transfer kubernetes

kannon92 commented 1 year ago

I think this should be node.

/sig node

Node is responsible for pod lifecycle and they own the log rotation code.

I am not sure if log rotations actually work for static pods. I see some mentions of static pods and log rotation in the docs but not sure.

https://kubernetes.io/docs/concepts/cluster-administration/logging/#system-component-logs

geotransformer commented 1 year ago

I think this should be node.

/sig node

Node is responsible for pod lifecycle and they own the log rotation code.

I am not sure if log rotations actually work for static pods. I see some mentions of static pods and log rotation in the docs but not sure.

https://kubernetes.io/docs/concepts/cluster-administration/logging/#system-component-logs

Actually the log rotation for the static pod works fine. See the below. Only the etcd and k8s control-plane static pods. So how kubelet determines the container log should be rotated. Is it any apiserver request for it? If it has to query apiserver, but the apiserver container itself is down/retarted. In this case, kubelet, might not able to cleanup previous etcd/apiserver container logs.

root@mst1:/var/log/pods/default_static-web-mst1_ab30fcbb6481a1c91aa9873ea51974ff/web# ll total 24 drwxr-xr-x 2 root root 4096 Sep 25 21:46 ./ drwxr-xr-x 3 root root 4096 Sep 25 21:28 ../ -rw-r----- 1 root root 5650 Sep 25 21:34 ---------- 3.log -rw-r----- 1 root root 5649 Sep 25 21:46 ---------- 4.log

kannon92 commented 1 year ago

What is the pod manifest for api server?

In those docs, it says that mounting to a shared volume for static pods may cause issues with rotation.

Kubelet handles log rotation. Code

geotransformer commented 1 year ago

Code


$ sudo cat etcd.yaml 
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/etcd.advertise-client-urls: https://10.192.1.21:2379
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://10.192.1.21:2379
    - --cert-file=/k8s/kubernetes/pki/etcd/server.crt
    - --cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
    - --client-cert-auth=true
    - --data-dir=/k8s/etcd
    - --experimental-initial-corrupt-check=true
    - --experimental-watch-progress-notify-interval=5s
    - --initial-advertise-peer-urls=https://10.192.1.21:2380
    - --initial-cluster=tb16-pod1-c1-mm1-master1=https://10.192.1.21:2380
    - --key-file=/k8s/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://10.192.1.21:2379
    - --listen-metrics-urls=http://0.0.0.0:2381
    - --listen-peer-urls=https://10.192.1.21:2380
    - --name=tb16-pod1-c1-mm1-master1
    - --peer-cert-file=/k8s/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/k8s/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/k8s/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/k8s/kubernetes/pki/etcd/ca.crt
    image: registry.k8s.io/etcd:3.5.9-0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 0.0.0.0
        path: /health?exclude=NOSPACE&serializable=true
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: etcd
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 0.0.0.0
        path: /health?serializable=false
        port: 2381
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /k8s/etcd
      name: etcd-k8s
    - mountPath: /k8s/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /k8s/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
  - hostPath:
      path: /k8s/etcd
      type: DirectoryOrCreate
    name: etcd-k8s
status: {}
mmiranda96 commented 1 year ago

/triage accepted

k8s-triage-robot commented 3 weeks ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

dgrisonnet commented 2 weeks ago

/remove-sig instrumentation

bart0sh commented 1 week ago

I couldn't reproduce this with the k/k from the fresh master(Tue Oct 8 14:42:22 2024).

Here is what I did:

Here how it's evolved:

So, rotation, compression and clean up seems to work as expected. /close

k8s-ci-robot commented 1 week ago

@bart0sh: Closing this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/120888#issuecomment-2404743163): >I couldn't reproduce this with the k/k from the fresh master(Tue Oct 8 14:42:22 2024). > >Here is what I did: >- increased verbosity of the API server logs in its manifest: >``` ># grep -B2 v=10 /etc/kubernetes/manifests/kube-apiserver.yaml > - --tls-cert-file=/etc/kubernetes/pki/apiserver.crt > - --tls-private-key-file=/etc/kubernetes/pki/apiserver.key > - --v=10 >``` >- created some load: repeated creating 50 short-lived pods and deleted them after they're completed >- monitored kube-apiserver logs directory > >Here how it's evolved: >- 1: one compressed log, one uncompressed and one active: >``` ># ls -lha /var/log/pods/kube-system_kube-apiserver-devel_b7ee32bd1387aa32d0edfc860e37b46e/kube-apiserver/ >total 24M >drwxr-xr-x 2 root root 4,0K loka 10 12:58 . >drwxr-xr-x 3 root root 4,0K loka 10 12:57 .. >-rw------- 1 root root 6,9M loka 10 13:01 0.log >-rw-r--r-- 1 root root 6,6M loka 10 12:58 0.log.20241010-125714.gz >-rw------- 1 root root 11M loka 10 12:58 0.log.20241010-125825 >``` >- 2: 0.log.20241010-125825 from previous list has been compressed, one new rotation observed (0.log.20241010-130224 is a new file created by rotation) >``` ># ls -lha /var/log/pods/kube-system_kube-apiserver-devel_b7ee32bd1387aa32d0edfc860e37b46e/kube-apiserver/ >total 28M >drwxr-xr-x 2 root root 4,0K loka 10 13:02 . >drwxr-xr-x 3 root root 4,0K loka 10 12:57 .. >-rw------- 1 root root 60K loka 10 13:02 0.log >-rw-r--r-- 1 root root 6,6M loka 10 12:58 0.log.20241010-125714.gz >-rw-r--r-- 1 root root 505K loka 10 13:02 0.log.20241010-125825.gz >-rw------- 1 root root 21M loka 10 13:02 0.log.20241010-130224 >``` >- 3: process continues, 130224 has been gzipped, 130254 has been created and compressed as well, 130334 rotated, but not compressed yet >``` ># ls -lha /var/log/pods/kube-system_kube-apiserver-devel_b7ee32bd1387aa32d0edfc860e37b46e/kube-apiserver/ >total 20M >drwxr-xr-x 2 root root 4,0K loka 10 13:03 . >drwxr-xr-x 3 root root 4,0K loka 10 12:57 .. >-rw------- 1 root root 7,0M loka 10 13:04 0.log >-rw-r--r-- 1 root root 505K loka 10 13:02 0.log.20241010-125825.gz >-rw-r--r-- 1 root root 1,2M loka 10 13:02 0.log.20241010-130224.gz >-rw-r--r-- 1 root root 519K loka 10 13:03 0.log.20241010-130254.gz >-rw------- 1 root root 11M loka 10 13:03 0.log.20241010-130334 >``` >- 4: after some time all old logs has been deleted by kubelet as ContainerLogMaxFiles config option is 5 by default: >``` ># ls -lha /var/log/pods/kube-system_kube-apiserver-devel_b7ee32bd1387aa32d0edfc860e37b46e/kube-apiserver/ >total 17M >drwxr-xr-x 2 root root 4,0K loka 10 13:15 . >drwxr-xr-x 3 root root 4,0K loka 10 12:57 .. >-rw------- 1 root root 4,3M loka 10 13:16 0.log >-rw-r--r-- 1 root root 595K loka 10 13:12 0.log.20241010-131024.gz >-rw-r--r-- 1 root root 561K loka 10 13:13 0.log.20241010-131204.gz >-rw-r--r-- 1 root root 613K loka 10 13:15 0.log.20241010-131344.gz >-rw------- 1 root root 11M loka 10 13:15 0.log.20241010-131524 >``` > >So, rotation, compression and clean up seems to work as expected. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.