litmuschaos / litmus-go

Apache License 2.0
69 stars 122 forks source link

CPU Hog Exec - is not terminating appropriately for certain target containers as intended #685

Closed rociomroman closed 7 months ago

rociomroman commented 9 months ago

What happened: We are able to inject the cpu stress, however when terminating, we get the following error log:

time="2024-01-24T15:53:00Z" level=info msg="Target pods list for chaos, [abc-deploy]"

13time="2024-01-24T15:53:00Z" level=info msg="[Chaos]: The Target application details" CPU CORE=1 Target Container=app-outer Target Pod=abc-deploy

14time="2024-01-24T15:53:00Z" level=info msg="[Chaos]:Waiting for: 60s"

15time="2024-01-24T15:54:00Z" level=info msg="[Chaos]: Time is up for experiment: pod-cpu-hog-exec"

16/bin/sh: ps: command not found

17kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]

18time="2024-01-24T15:54:00Z" level=error msg="[Error]: CPU hog failed, err: Unable to kill the stress process in abc-deploy pod, err: command terminated with exit code 2"

What you expected to happen: We expected for the cpu hog exec experiment to terminate with the kill command.

How to reproduce it (as minimally and precisely as possible):

YAML:

apiVersion: [litmuschaos.io/v1alpha1](http://litmuschaos.io/v1alpha1)
  kind: ChaosEngine
  metadata: 
    name: abc-cpustress
    namespace: abc

  spec: 
    appinfo: 
      appns: abc
      applabel: app=abc-deploy
      appkind: deployment

    annotationCheck: false
    engineState: active
    chaosServiceAccount: litmus
    jobCleanUpPolicy: delete
    experiments: 

        name: pod-cpu-hog-exec
        spec: 
          components: 
            env: 

                name: TARGET_CONTAINER
                value: app-outer

                name: CONTAINER_RUNTIME
                value: crio

                name: SOCKET_PATH
                value: /var/run/crio/crio.sock

                name: CHAOS_INJECT_COMMAND
                value: md5sum /dev/zero

                 name: CHAOS_KILL_COMMAND
                 value: "kill $(find /proc -name exe -lname '*/md5sum' 2>&1 | grep -v 'Permission denied' | awk -F/ '{print $(NF-1)}')"

Anything else we need to know?: Litmus version 2.7.0

What we've tried: We tried the following kill commands that we found in the documentation for cpu hog exec and different variations: -kill $(find /proc -name exe -lname '/md5sum' 2>&1 | grep -v 'Permission denied' | awk -F/ '{print $(NF-1)}') -kill -9 $(ps afx | grep \"[md5sum] /dev/zero\" | awk '{print$1}' | tr '\n' ' ') -kill -9 $(find /proc -name exe -lname '/md5sum' 2>&1 | grep -v 'Permission denied' | awk -F/ '{print $(NF-1)}') https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-cpu-hog-exec/

Thoughts: Checked out the target container in question and noticed when i shelled in, it didn't have the PS command available. That's probably why this command didn't work: kill -9 $(ps afx | grep \"[md5sum] /dev/zero\" | awk '{print$1}' | tr '\n' ' ') However, kill $(find /proc -name exe -lname '*/md5sum' 2>&1 | grep -v 'Permission denied' | awk -F/ '{print $(NF-1)}') also didn't work but that could be related to the container permissions. In one of the other containers in the app, cpu stress is injected and terminated successfully. The difference was that I did notice that the container had the ability to execute PS commands.

Questions: For the kill command to work for cpu stress exec, does it heavily depend on the container image and its varying configurations? as in whether it's able to use the PS command or the find /proc command? Is there a kill command that would support most container images or alternative commands to try out? Any insights are much appreciated. Thank you.

Somewhat similar issue others are having for reference: https://github.com/litmuschaos/litmus/issues/1861

uditgaurav commented 7 months ago

Although it's possible to manually derive the kill command based on the application and use it in the experiment, I recommend using pod-cpu-hog instead of pod-cpu-hog-exec. This approach will help avoid such issues.

Refer: https://litmuschaos.github.io/litmus/experiments/faq/experiments/#whats-the-difference-between-pod-memorycpu-hog-vs-pod-memorycpu-hog-exec