Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
This time I'm trying to execute disk fill. No matter what percentage of disk fill I will choose (20 % or 80 %) I got the same error message:
time="2023-11-30T14:56:10Z" level=info msg="[Fill]: Filling ephemeral storage, size: 214748KB"
time="2023-11-30T14:56:10Z" level=info msg="dd: {sudo dd if=/dev/urandom of=/proc/1249333/root/home/diskfill bs=256K count=838}"
time="2023-11-30T14:56:13Z" level=fatal msg="helper pod failed, err: could not fill ephemeral storage\n --- at /litmus-go/chaoslib/litmus/disk-fill/helper/disk-fill.go:137 (diskFill) ---\nCaused by: {\"source\":\"disk-fill-helper-nsnw2\",\"errorCode\":\"CHAOS_INJECT_ERROR\",\"reason\":\"838+0 records in\n838+0 records out\n\",\"target\":\"{podName: testing-pod-86b47d547d-vnzfb, namespace: test, container: test\"}"
How to reproduce it (as minimally and precisely as possible):
It can be easily reproducible by executing it from Litmus Portal however I did it also manually trying to find where can be the problem:
I was able to create manually helper pod and run it on my GKE cluster to experiment with disk-fill
I was able to find the containerID and container PID on a POD I'm going to fill the ephemeral storage
First thing: the size of ephemeral storage USED is wrongly calculated (at least in Litmus 3.1) because it uses following function:
du := fmt.Sprintf("sudo du /proc/%v/root", t.TargetPID)
but if this (/proc/%v/root) is symlink and it's it will return 0 value all the time
when you do this by providing /proc/%v/root/ (slash at the end) it will return proper value.
I did "dd" command manually from helper pod
bash-5.1# crictl inspect --output yaml ac136572dd3cf| egrep pid
pid: 1
pid: 1825267
type: pid
bash-5.1# dd if=/dev/urandom of=/proc/1825267/root/home/diskfill bs=256K count=10485
10485+0 records in
10485+0 records out
bash-5.1# echo $?
0
File exists:
ls -latrh /proc/1825267/root/home/diskfill
-rw-r--r-- 1 root root 2.6G Dec 8 12:35 /proc/1825267/root/home/diskfill
When I crate bigger file (bigger than ephemeral storage limit) pod is evicted - which works perfectly
But, when we run it from Litmus toolkit it fails ... no more messages. I've checked it in the code and it seems it comes from this code:
if t.SizeToFill > 0 {
if err := fillDisk(t, experimentsDetails.DataBlockSize); err != nil {
return stacktrace.Propagate(err, "could not fill ephemeral storage")
I think helper-pod catches output from dd command like:
10485+0 records in
10485+0 records out
as an error and marks it the same so entire injection is marked as failed.
What happened:
This time I'm trying to execute disk fill. No matter what percentage of disk fill I will choose (20 % or 80 %) I got the same error message: time="2023-11-30T14:56:10Z" level=info msg="[Fill]: Filling ephemeral storage, size: 214748KB" time="2023-11-30T14:56:10Z" level=info msg="dd: {sudo dd if=/dev/urandom of=/proc/1249333/root/home/diskfill bs=256K count=838}" time="2023-11-30T14:56:13Z" level=fatal msg="helper pod failed, err: could not fill ephemeral storage\n --- at /litmus-go/chaoslib/litmus/disk-fill/helper/disk-fill.go:137 (diskFill) ---\nCaused by: {\"source\":\"disk-fill-helper-nsnw2\",\"errorCode\":\"CHAOS_INJECT_ERROR\",\"reason\":\"838+0 records in\n838+0 records out\n\",\"target\":\"{podName: testing-pod-86b47d547d-vnzfb, namespace: test, container: test\"}"
What you expected to happen:
Disk fill should end with success.
Where can this issue be corrected? (optional)
This part of the code should be fixed:
https://github.com/litmuschaos/litmus-go/blob/v3.1.x/chaoslib/litmus/disk-fill/helper/disk-fill.go#L342
https://github.com/litmuschaos/litmus-go/blob/v3.1.x/chaoslib/litmus/disk-fill/helper/disk-fill.go#L178
How to reproduce it (as minimally and precisely as possible):
It can be easily reproducible by executing it from Litmus Portal however I did it also manually trying to find where can be the problem:
I was able to create manually helper pod and run it on my GKE cluster to experiment with disk-fill
I was able to find the containerID and container PID on a POD I'm going to fill the ephemeral storage
First thing: the size of ephemeral storage USED is wrongly calculated (at least in Litmus 3.1) because it uses following function: du := fmt.Sprintf("sudo du /proc/%v/root", t.TargetPID) but if this (/proc/%v/root) is symlink and it's it will return 0 value all the time when you do this by providing /proc/%v/root/ (slash at the end) it will return proper value.
I did "dd" command manually from helper pod bash-5.1# crictl inspect --output yaml ac136572dd3cf| egrep pid pid: 1 pid: 1825267
File exists: ls -latrh /proc/1825267/root/home/diskfill -rw-r--r-- 1 root root 2.6G Dec 8 12:35 /proc/1825267/root/home/diskfill
When I crate bigger file (bigger than ephemeral storage limit) pod is evicted - which works perfectly
But, when we run it from Litmus toolkit it fails ... no more messages. I've checked it in the code and it seems it comes from this code: if t.SizeToFill > 0 { if err := fillDisk(t, experimentsDetails.DataBlockSize); err != nil { return stacktrace.Propagate(err, "could not fill ephemeral storage")
I think helper-pod catches output from dd command like: 10485+0 records in 10485+0 records out as an error and marks it the same so entire injection is marked as failed.
Anything else we need to know?: