litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
https://litmuschaos.io
Apache License 2.0
4.39k stars 688 forks source link

Litmus Chaos Tests not running on K8s v1.27 #4125

Open amitpd opened 1 year ago

amitpd commented 1 year ago

What happened: LitmusChaos tests not running properly on Kubernetes v1.27

What you expected to happen: LitmusChaos tests should run properly on Kubernetes v1.27

Where can this issue be corrected? (optional)

The issue is probably in the source code of litmuschaos/go-runner:2.14.0

How to reproduce it (as minimally and precisely as possible): Note: Followed the instructions as per https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-cpu-hog/.

Deploy litmus operator v2.14.0

kubectl create -f https://litmuschaos.github.io/litmus/litmus-operator-v2.14.0.yaml

Deploy below ChaosExperiment:

apiVersion: litmuschaos.io/v1alpha1
description:
  message: |
    Injects cpu consumption on pods belonging to an app deployment
kind: ChaosExperiment
metadata:
  labels:
    app.kubernetes.io/component: chaosexperiment
    app.kubernetes.io/part-of: litmus
    app.kubernetes.io/version: 2.14.0
    name: pod-cpu-hog
  name: pod-cpu-hog
  namespace: default
spec:
  definition:
    args:
    - -c
    - ./experiments -name pod-cpu-hog
    command:
    - /bin/bash
    env:
    - name: TOTAL_CHAOS_DURATION
      value: "60"
    - name: CHAOS_INTERVAL
      value: "10"
    - name: CPU_CORES
      value: "1"
    - name: CPU_LOAD
      value: "100"
    - name: PODS_AFFECTED_PERC
      value: ""
    - name: RAMP_TIME
      value: ""
    - name: LIB
      value: litmus
    - name: LIB_IMAGE
      value: litmuschaos/go-runner:2.14.0
    - name: SOCKET_PATH
      value: /var/run/docker.sock
    - name: LIB_IMAGE_PULL_POLICY
      value: IfNotPresent
    - name: TARGET_PODS
      value: ""
    - name: NODE_LABEL
      value: ""
    - name: SEQUENCE
      value: parallel
    image: litmuschaos/go-runner:2.14.0
    imagePullPolicy: IfNotPresent
    labels:
      app.kubernetes.io/component: experiment-job
      app.kubernetes.io/part-of: litmus
      app.kubernetes.io/version: 2.14.0
      name: pod-cpu-hog
    permissions:
    - apiGroups:
      - ""
      resources:
      - pods
      verbs:
      - create
      - delete
      - get
      - list
      - patch
      - update
      - deletecollection
    - apiGroups:
      - ""
      resources:
      - events
      verbs:
      - create
      - get
      - list
      - patch
      - update
    - apiGroups:
      - ""
      resources:
      - configmaps
      verbs:
      - get
      - list
    - apiGroups:
      - ""
      resources:
      - pods/log
      verbs:
      - get
      - list
      - watch
    - apiGroups:
      - ""
      resources:
      - pods/exec
      verbs:
      - get
      - list
      - create
    - apiGroups:
      - apps
      resources:
      - deployments
      - statefulsets
      - replicasets
      - daemonsets
      verbs:
      - list
      - get
    - apiGroups:
      - apps.openshift.io
      resources:
      - deploymentconfigs
      verbs:
      - list
      - get
    - apiGroups:
      - ""
      resources:
      - replicationcontrollers
      verbs:
      - get
      - list
    - apiGroups:
      - argoproj.io
      resources:
      - rollouts
      verbs:
      - list
      - get
    - apiGroups:
      - batch
      resources:
      - jobs
      verbs:
      - create
      - list
      - get
      - delete
      - deletecollection
    - apiGroups:
      - litmuschaos.io
      resources:
      - chaosengines
      - chaosexperiments
      - chaosresults
      verbs:
      - create
      - list
      - get
      - patch
      - update
      - delete
    scope: Namespaced

Create below RBAC:

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/part-of: litmus
    name: pod-cpu-hog-sa
  name: pod-cpu-hog-sa
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app.kubernetes.io/part-of: litmus
    name: pod-cpu-hog-sa
  name: pod-cpu-hog-sa
  namespace: default
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - deletecollection
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - get
  - list
  - patch
  - update
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
- apiGroups:
  - ""
  resources:
  - pods/log
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - get
  - list
  - create
- apiGroups:
  - apps
  resources:
  - deployments
  - statefulsets
  - replicasets
  - daemonsets
  verbs:
  - list
  - get
- apiGroups:
  - apps.openshift.io
  resources:
  - deploymentconfigs
  verbs:
  - list
  - get
- apiGroups:
  - ""
  resources:
  - replicationcontrollers
  verbs:
  - get
  - list
- apiGroups:
  - argoproj.io
  resources:
  - rollouts
  verbs:
  - list
  - get
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - create
  - list
  - get
  - delete
  - deletecollection
- apiGroups:
  - litmuschaos.io
  resources:
  - chaosengines
  - chaosexperiments
  - chaosresults
  verbs:
  - create
  - list
  - get
  - patch
  - update
  - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    app.kubernetes.io/part-of: litmus
    name: pod-cpu-hog-sa
  name: pod-cpu-hog-sa
  namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-cpu-hog-sa
subjects:
- kind: ServiceAccount
  name: pod-cpu-hog-sa
  namespace: default

Deploy below ChaosEngine:

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: chaosengine-pod-cpu-hog
  namespace: default
spec:
  annotationCheck: "true"
  appinfo:
    appkind: deployment
    applabel: app=nginx
    appns: default
  chaosServiceAccount: pod-cpu-hog-sa
  components:
    runner:
      image: litmuschaos/chaos-runner:2.14.0
      imagePullPolicy: IfNotPresent
  engineState: active
  experiments:
  - name: pod-cpu-hog
    spec:
      components:
        env:
        - name: CONTAINER_RUNTIME
          value: containerd
        - name: SOCKET_PATH
          value: /run/containerd/containerd.sock
        - name: TOTAL_CHAOS_DURATION
          value: "30"
        - name: CPU_CORES
          value: "1"
        - name: TARGET_CONTAINER
          value: nginx
  jobCleanUpPolicy: retain

Anything else we need to know?: Log of pod-cpu-hog-vczplk-d5fsw pod created during experiemnt:

time="2023-08-14T09:45:53Z" level=info msg="Experiment Name: pod-cpu-hog"
time="2023-08-14T09:45:53Z" level=info msg="[PreReq]: Getting the ENV for the pod-cpu-hog experiment"
time="2023-08-14T09:45:55Z" level=info msg="[PreReq]: Updating the chaos result of pod-cpu-hog experiment (SOT)"
time="2023-08-14T09:45:57Z" level=info msg="The application information is as follows" Namespace=default Label="app=nginx" App Kind=deployment
time="2023-08-14T09:45:57Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The Container status are as follows" container=nginx Pod=nginx-deployment-54bcfc567b-pjddz Readiness=true
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The status of Pods are as follows" Pod=nginx-deployment-54bcfc567b-pjddz Status=Running
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The Container status are as follows" container=nginx Pod=nginx-deployment-54bcfc567b-sm4ql Readiness=true
time="2023-08-14T09:45:57Z" level=info msg="[Status]: The status of Pods are as follows" Pod=nginx-deployment-54bcfc567b-sm4ql Status=Running
time="2023-08-14T09:45:59Z" level=info msg="[Info]: The chaos tunables are:" Sequence=parallel PodsAffectedPerc=0 CPU Core=1 CPU Load Percentage=100
time="2023-08-14T09:45:59Z" level=info msg="[Chaos]:Number of pods targeted: 1"
time="2023-08-14T09:45:59Z" level=info msg="[Info]: Target pods list for chaos, [nginx-deployment-54bcfc567b-pjddz]"
time="2023-08-14T09:45:59Z" level=info msg="[Info]: Details of application under chaos injection" PodName=nginx-deployment-54bcfc567b-pjddz NodeName=amit-vm-2 ContainerName=nginx
time="2023-08-14T09:45:59Z" level=info msg="[Status]: Checking the status of the helper pods"
time="2023-08-14T09:46:04Z" level=info msg="[Wait]: waiting till the completion of the helper pod"
time="2023-08-14T09:49:37Z" level=error msg="[Error]: CPU hog failed, err: helper pod failed, err: Unable to find the pods with matching labels"

Events from the Job that creates pod-cpu-hog-vczplk-d5fsw pod:

Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  10s   job-controller  Created pod: pod-cpu-hog-vczplk-d5fsw
  Normal  SuccessfulDelete  2s    job-controller  Deleted pod: pod-cpu-hog-helper-xrpvbv

It seems like the helper pod is getting deleted immediately after it is created.

cooldev001 commented 1 year ago

I am also facing the same issue on Amazon EKS cluster (1.27), it works correctly on v 1.24.

RobinSegura commented 1 year ago

Same here ! With litmus 3.0.0-beta8 (and reproduced on 3.0.0-beta7 too) on EKS 1.27 and was working fine on 1.26 Might this be related to ---container-runtime deprecation since 1.24 and removed in 1.27 ? here on kube release notes

RobinSegura commented 1 year ago

Able to reproduce on Minikube + containerd + litmus 3.0.0-beta8

Screenshot from 2023-08-30 17-52-43 All chaos experiments requiring container runtime working fine

case 2 : error group running kubernetes 1.27 Screenshot from 2023-08-30 17-53-30 Screenshot from 2023-08-30 17-53-22

Helper instantly killed

We'll stick our clusters on kube 1.26.X (less than 1.27) on our side for now but please Harness/Litmus team have a look at https://kubernetes.io/blog/2023/03/17/upcoming-changes-in-kubernetes-v1-27/#removal-of-container-runtime-command-line-argument

ksatchit commented 1 year ago

This is fixed in 3.00beta10 via https://github.com/litmuschaos/litmus-go/pull/665

In 2.14.1 via https://github.com/litmuschaos/litmus-go/pull/669

rumstead commented 9 months ago

This is fixed in 3.00beta10 via litmuschaos/litmus-go#665

In 2.14.1 via litmuschaos/litmus-go#669

Based on the PRs, how does deleting labels fix the issue? The release notes state a kubelet flag but I don't see how that would impact starting the helper pods via the k8s API?

EDIT: Or is it related to the standard labels that are added to pods since 1.27

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.27.md#api-change-4

Pods owned by a Job now uses the labels batch.kubernetes.io/job-name and batch.kubernetes.io/controller-uid. The legacy labels job-name and controller-uid are still added for compatibility. (#114930, @kannon92)

sebay commented 7 months ago

@ksatchit can the 2.14.1 be pushed to dockerhub? The only other solution is using 3.x which is big change (and I am yet to have it fully working..)