litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
https://litmuschaos.io
Apache License 2.0
4.44k stars 698 forks source link

Unable to launch experiments without probes #4232

Closed smitthakkar96 closed 8 months ago

smitthakkar96 commented 1 year ago

What happened:

When trying to create an experiment without probes, we see the following error.

Probe in fault pod-cpu-hog-yy5 is not attached to a proper reference, please add it to annotations as probeRef

Screenshot 2023-10-17 at 10 34 49 Screenshot 2023-10-17 at 10 36 20

It's not clear in the docs if probes are mandatory for every fault. Also, the error message is also not very friendly.

What you expected to happen:

We should be able to inject faults without configuring probes. It adds a high bar of adoption for people starting out with their chaos journey to always to have to configure probes. Also, depending on the hypotheses, teams can choose not to have any automated checks but monitor the experiment manually by simply watching their dashboards. If something goes wrong, they can halt the experiment.

How to reproduce it (as minimally and precisely as possible):

Try creating an experiment without any probes in v3.0.0. Here is the manifest for quick reference

kind: Workflow
apiVersion: argoproj.io/v1alpha1
metadata:
  name: test
  namespace: sre-enablement
spec:
  templates:
    - name: test
      steps:
        - - name: install-chaos-faults
            template: install-chaos-faults
        - - name: pod-cpu-hog-k94
            template: pod-cpu-hog-k94
        - - name: cleanup-chaos-resources
            template: cleanup-chaos-resources
    - name: install-chaos-faults
      inputs:
        artifacts:
          - name: pod-cpu-hog-k94
            path: /tmp/pod-cpu-hog-k94.yaml
            raw:
              data: >
                apiVersion: litmuschaos.io/v1alpha1

                description:
                  message: |
                    Injects CPU consumption on pods belonging to an app deployment
                kind: ChaosExperiment

                metadata:
                  name: pod-cpu-hog
                  labels:
                    name: pod-cpu-hog
                    app.kubernetes.io/part-of: litmus
                    app.kubernetes.io/component: chaosexperiment
                    app.kubernetes.io/version: 3.0.0
                spec:
                  definition:
                    scope: Namespaced
                    permissions:
                      - apiGroups:
                          - ""
                        resources:
                          - pods
                        verbs:
                          - create
                          - delete
                          - get
                          - list
                          - patch
                          - update
                          - deletecollection
                      - apiGroups:
                          - ""
                        resources:
                          - events
                        verbs:
                          - create
                          - get
                          - list
                          - patch
                          - update
                      - apiGroups:
                          - ""
                        resources:
                          - configmaps
                        verbs:
                          - get
                          - list
                      - apiGroups:
                          - ""
                        resources:
                          - pods/log
                        verbs:
                          - get
                          - list
                          - watch
                      - apiGroups:
                          - ""
                        resources:
                          - pods/exec
                        verbs:
                          - get
                          - list
                          - create
                      - apiGroups:
                          - apps
                        resources:
                          - deployments
                          - statefulsets
                          - replicasets
                          - daemonsets
                        verbs:
                          - list
                          - get
                      - apiGroups:
                          - apps.openshift.io
                        resources:
                          - deploymentconfigs
                        verbs:
                          - list
                          - get
                      - apiGroups:
                          - ""
                        resources:
                          - replicationcontrollers
                        verbs:
                          - get
                          - list
                      - apiGroups:
                          - argoproj.io
                        resources:
                          - rollouts
                        verbs:
                          - list
                          - get
                      - apiGroups:
                          - batch
                        resources:
                          - jobs
                        verbs:
                          - create
                          - list
                          - get
                          - delete
                          - deletecollection
                      - apiGroups:
                          - litmuschaos.io
                        resources:
                          - chaosengines
                          - chaosexperiments
                          - chaosresults
                        verbs:
                          - create
                          - list
                          - get
                          - patch
                          - update
                          - delete
                    image: litmuschaos/go-runner:3.0.0
                    imagePullPolicy: Always
                    args:
                      - -c
                      - ./experiments -name pod-cpu-hog
                    command:
                      - /bin/bash
                    env:
                      - name: TOTAL_CHAOS_DURATION
                        value: "60"
                      - name: CPU_CORES
                        value: "1"
                      - name: CPU_LOAD
                        value: "100"
                      - name: PODS_AFFECTED_PERC
                        value: ""
                      - name: RAMP_TIME
                        value: ""
                      - name: LIB_IMAGE
                        value: litmuschaos/go-runner:3.0.0
                      - name: STRESS_IMAGE
                        value: alexeiled/stress-ng:latest-ubuntu
                      - name: CONTAINER_RUNTIME
                        value: containerd
                      - name: SOCKET_PATH
                        value: /run/containerd/containerd.sock
                      - name: TARGET_CONTAINER
                        value: ""
                      - name: TARGET_PODS
                        value: ""
                      - name: DEFAULT_HEALTH_CHECK
                        value: "false"
                      - name: NODE_LABEL
                        value: ""
                      - name: SEQUENCE
                        value: parallel
                    labels:
                      name: pod-cpu-hog
                      app.kubernetes.io/part-of: litmus
                      app.kubernetes.io/component: experiment-job
                      app.kubernetes.io/runtime-api-usage: "true"
                      app.kubernetes.io/version: 3.0.0
      container:
        name: ""
        image: litmuschaos/k8s:2.11.0
        command:
          - sh
          - -c
        args:
          - kubectl apply -f /tmp/ -n {{workflow.parameters.adminModeNamespace}}
            && sleep 30
    - name: cleanup-chaos-resources
      container:
        name: ""
        image: litmuschaos/k8s:2.11.0
        command:
          - sh
          - -c
        args:
          - kubectl delete chaosengine -l workflow_run_id={{workflow.uid}} -n
            {{workflow.parameters.adminModeNamespace}}
    - name: pod-cpu-hog-k94
      inputs:
        artifacts:
          - name: pod-cpu-hog-k94
            path: /tmp/chaosengine-pod-cpu-hog-k94.yaml
            raw:
              data: |
                apiVersion: litmuschaos.io/v1alpha1
                kind: ChaosEngine
                metadata:
                  namespace: "{{workflow.parameters.adminModeNamespace}}"
                  labels:
                    workflow_run_id: "{{ workflow.uid }}"
                  annotations: {}
                  generateName: pod-cpu-hog-k94
                spec:
                  engineState: active
                  appinfo:
                    appns: sre-enablement
                    applabel: app=chaos-exporter
                    appkind: deployment
                  chaosServiceAccount: litmus-admin
                  experiments:
                    - name: pod-cpu-hog
                      spec:
                        components:
                          env:
                            - name: TOTAL_CHAOS_DURATION
                              value: "60"
                            - name: CPU_CORES
                              value: "1"
                            - name: CPU_LOAD
                              value: "100"
                            - name: PODS_AFFECTED_PERC
                              value: ""
                            - name: RAMP_TIME
                              value: ""
                            - name: LIB_IMAGE
                              value: litmuschaos/go-runner:3.0.0
                            - name: STRESS_IMAGE
                              value: alexeiled/stress-ng:latest-ubuntu
                            - name: CONTAINER_RUNTIME
                              value: containerd
                            - name: SOCKET_PATH
                              value: /run/containerd/containerd.sock
                            - name: TARGET_CONTAINER
                              value: ""
                            - name: TARGET_PODS
                              value: ""
                            - name: DEFAULT_HEALTH_CHECK
                              value: "false"
                            - name: NODE_LABEL
                              value: ""
                            - name: SEQUENCE
                              value: parallel
      metadata:
        labels:
          weight: "10"
      container:
        name: ""
        image: docker.io/litmuschaos/litmus-checker:2.11.0
        args:
          - -file=/tmp/chaosengine-pod-cpu-hog-k94.yaml
          - -saveName=/tmp/engine-name
  entrypoint: test
  arguments:
    parameters:
      - name: adminModeNamespace
        value: sre-enablement
  serviceAccountName: argo-chaos
  podGC:
    strategy: OnWorkflowCompletion
  securityContext:
    runAsUser: 1000
    runAsNonRoot: true
Zakiya-Jafrin commented 1 year ago

Hello, I have encountered the same issue.

Nageshbansal commented 1 year ago

Can you try with steps described here: https://kubernetes.slack.com/archives/CNXNB0ZTN/p1697011377377159

vanshBhatia-A4k9 commented 1 year ago

With the release of litmus 3.0.0, attaching probes to an experiment is a mandatory step, thanks for reporting the missing information in the docs, we will add it here soon: https://docs.litmuschaos.io/docs/concepts/probes.

Thanks!

smitthakkar96 commented 1 year ago

@vanshBhatia-A4k9 There can be valid cases where someone might not want to have any probes but wants to monitor the experiment manually. What is the rationale behind making it mandatory? Is there any RFC about it?

Also, as an end user, the error message is very unhelpful. Can we improve it? It doesn't clearly communicate that at least one probe is needed.

Zakiya-Jafrin commented 1 year ago

To add, even if I add the probe, the experiments started failing because it required an integer value for the fields spec.experiments[0].spec.probe[0].runProperties.probeTimeout and spec.experiments[0].spec.probe[0].runProperties.interval whereas from the UI won't let you save without 's'. image (2).

The error message from the podlog is: Error Creating Resource : ChaosEngine.litmuschaos.io "run-chaosab123" is invalid: [spec.experiments[0].spec.probe[0].runProperties.probeTimeout: Invalid value: "string": spec.experiments[0].spec.probe[0].runProperties.probeTimeout in body must be of type integer: "string", spec.experiments[0].spec.probe[0].runProperties.interval: Invalid value: "string": spec.experiments[0].spec.probe[0].runProperties.interval in body must be of type integer: "string"]

I tried the experiment in a fresh cluster. Mentioning this, to indicate that there was no residual CRD's in the cluster. According to this PR: https://github.com/litmuschaos/litmus-docs/pull/244. Old CRD's can be a potential reason for this error. But in my case the cluster was fresh.
However, I made manual changes to the experiment yaml to set annotations filed as null that is annotations: instead of annotations: {}. Found the solution in @Nageshbansal slack comment. But, the issue with such 'workaround' is that the annotation filed might always not be null. We might need to fill it up with other info. In such cases we will encounter the above-mentioned issue again.

nwinstel-insight commented 1 year ago

This is a huge stopper for me as well, in beta 8 and below we just defined the probe in stream and everything worked.

ksatchit commented 1 year ago

The error message definitely needs improvement - we will take that feedback cc: @Saranya-jena

On why the probe is mandatory:

With 3.0, resilience score is purely based on "probe success/failure". We no longer associate success/failure with faults and exp themselves. They only show execution status - queued/running/completed/stopped etc., This was based on user feedback, the gist of which - fault injection/experiment execution is just an action and resilience should be measured purely on what is "validated".

Since we are looking to project the RS as the main outcome of an experiment (it is the main actionable entity for many users - who decide to take steps based on the value) and the RS is in turn dependent on existence of probes. This led to us to current flow which mandates the probes.

Having said that, what we would need in the current circumstances, is the support for default or "system" probes which will be auto-configured for faults - w/o users having to explicitly create them. Thereby ensuring there is no "additional" action/input required from the users while creating the experiments. We can add this to the short-term roadmap.

smitthakkar96 commented 1 year ago

Sounds good thanks for explanation @ksatchit