litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
https://litmuschaos.io
Apache License 2.0
4.43k stars 693 forks source link

Disk fill for ephemeral storage doesn't work properly #4329

Open ash-man opened 10 months ago

ash-man commented 10 months ago

What happened:

This time I'm trying to execute disk fill. No matter what percentage of disk fill I will choose (20 % or 80 %) I got the same error message: time="2023-11-30T14:56:10Z" level=info msg="[Fill]: Filling ephemeral storage, size: 214748KB" time="2023-11-30T14:56:10Z" level=info msg="dd: {sudo dd if=/dev/urandom of=/proc/1249333/root/home/diskfill bs=256K count=838}" time="2023-11-30T14:56:13Z" level=fatal msg="helper pod failed, err: could not fill ephemeral storage\n --- at /litmus-go/chaoslib/litmus/disk-fill/helper/disk-fill.go:137 (diskFill) ---\nCaused by: {\"source\":\"disk-fill-helper-nsnw2\",\"errorCode\":\"CHAOS_INJECT_ERROR\",\"reason\":\"838+0 records in\n838+0 records out\n\",\"target\":\"{podName: testing-pod-86b47d547d-vnzfb, namespace: test, container: test\"}"

What you expected to happen:

Disk fill should end with success.

Where can this issue be corrected? (optional)

This part of the code should be fixed:

https://github.com/litmuschaos/litmus-go/blob/v3.1.x/chaoslib/litmus/disk-fill/helper/disk-fill.go#L342

https://github.com/litmuschaos/litmus-go/blob/v3.1.x/chaoslib/litmus/disk-fill/helper/disk-fill.go#L178

How to reproduce it (as minimally and precisely as possible):

It can be easily reproducible by executing it from Litmus Portal however I did it also manually trying to find where can be the problem:

  1. I was able to create manually helper pod and run it on my GKE cluster to experiment with disk-fill

  2. I was able to find the containerID and container PID on a POD I'm going to fill the ephemeral storage

  3. First thing: the size of ephemeral storage USED is wrongly calculated (at least in Litmus 3.1) because it uses following function: du := fmt.Sprintf("sudo du /proc/%v/root", t.TargetPID) but if this (/proc/%v/root) is symlink and it's it will return 0 value all the time when you do this by providing /proc/%v/root/ (slash at the end) it will return proper value.

  4. I did "dd" command manually from helper pod bash-5.1# crictl inspect --output yaml ac136572dd3cf| egrep pid pid: 1 pid: 1825267

    • type: pid bash-5.1# dd if=/dev/urandom of=/proc/1825267/root/home/diskfill bs=256K count=10485 10485+0 records in 10485+0 records out bash-5.1# echo $? 0
  5. File exists: ls -latrh /proc/1825267/root/home/diskfill -rw-r--r-- 1 root root 2.6G Dec 8 12:35 /proc/1825267/root/home/diskfill

  6. When I crate bigger file (bigger than ephemeral storage limit) pod is evicted - which works perfectly

  7. But, when we run it from Litmus toolkit it fails ... no more messages. I've checked it in the code and it seems it comes from this code: if t.SizeToFill > 0 { if err := fillDisk(t, experimentsDetails.DataBlockSize); err != nil { return stacktrace.Propagate(err, "could not fill ephemeral storage")

  8. I think helper-pod catches output from dd command like: 10485+0 records in 10485+0 records out as an error and marks it the same so entire injection is marked as failed.

Anything else we need to know?:


ash-man commented 10 months ago

Any updates on it ?

ash-man commented 8 months ago

Can someone from DEV team comment on it ?

neelanjan00 commented 7 months ago

Hi @ash-man thanks for raising this issue. We're looking into it.