Closed Dodan closed 4 years ago
Thanks for the report @Dodan . Would you be able to try a couple of things to see if we can find some more clues...
1) before you quit the container where the failure happened, can you do a dmesg
in it to get the kernel logs, to see if there was any OOM for instance (adding an exec -ti dmesg
command to your test script may do it.
2) Could you run the failing command under strace, so we can capture a log of the system calls, and then hopefully we can find the failing system call (near the end of the logs), and that might provide clues as to where the error is.
thanks!
adding it to the queue
Hello @grahamwhaley We gathered some logs using dmesg after each command given to the container and ran the install of build essential with strace. We put the logs in the attached file as the markdown rules were mangling the logs.
This is the Kata-Firecracker case, which fails. dmesg_strace_kata_firecracker_log.txt
Hi @Dodan I can't reproduce this issue,
crio version 1.15.0
commit: "485227d727401fa0472a449b5df3b0537e314ebb"
firecracker 0.18.0
kata-runtime : 1.9.1
commit : b909cab6c40eacaca15038ab3f2706a634a50501
OCI specs: 1.0.1-dev
Inside the container
# free -h
total used free shared buff/cache available
Mem: 2.0G 36M 1.9G 48K 37M 1.9G
Swap: 0B 0B 0B
# uname -a
Linux 4e40ad6b0d59 4.19.75 #1 SMP Wed Oct 9 00:11:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
the configuration file that you're sharing in the issue is for QEMU, can you share the configuration used for firecracker?
@devimc This is interesting... On our setup, we manged to reproduce the issue on 2 separate machines with Ubuntu 18.04. Maybe we are doing something wrong.
Anyway, these are the configurations we use.
The script to launch kata-fc (a script we put /usr/bin/kata-fc):
#!/bin/bash
/usr/bin/kata-runtime --kata-config "/usr/share/defaults/kata-containers/configuration-fc.toml" $@
And this is the configuration file for firecracker (/usr/share/defaults/kata-containers/configuration-fc.toml)
@Dodan I'm using the same Ubuntu version but different kernel, I'm using 4.15
I updated my kernel to 5.3.0-19 and still cannot reproduce this issue
@devimc I'd have a few questions if you don't mind:
./kill_container.sh kata-fc
d3fac27a128019fb65a62a96c62a97975b3495f3498eea797106257d2bb95734
1
FATA[0000] Creating container failed: rpc error: code = Unknown desc = container create failed: [PATCH /drives/{drive_id}][400] patchGuestDriveByIdBadRequest &{FaultMessage:Cannot open block device. Invalid permission/path.}
Are you using a custom / self-compiled version of Firecracker & jailer. We've been using the "stock" ones, given at these links below.
https://github.com/firecracker-microvm/firecracker/releases/download/v0.18.0/jailer-v0.18.0 https://github.com/firecracker-microvm/firecracker/releases/download/v0.18.0/firecracker-v0.18.0
We've been using kernel 5.2.2. Do you think this might be a factor? I know there we some issues in the past with kernels between 4.15 and 5.0 with vsock so this is why I'm asking.
@Dodan
Are you using the Firecracker jailer when running the Kata-fc?
no, this is my configuration.toml
Are you using a custom / self-compiled version of Firecracker & jailer. We've been using the "stock" ones, given at these links below.
I'm using the same binaries
We've been using kernel 5.2.2. Do you think this might be a factor?
4.15 and 5.3 work for me, so I don't think, btw I'm not using bare metal, I'm running on VMs (nested virtualization), can you try out in an ubuntu VM?
@devimc Sorry for the late reply.
We tested this in a VM with nested virtualization and, indeed, we had discovered that sometimes this doesn't replicate.
Still, we usually see that in about 4 out of cases we the 137 error code killing the process. We noticed that the bug is more likely to occur, when the processes are long running, so maybe we should add installing via apt something like nginx or apache.
It is quite trickier than we expected to find something that replicates this 100% of the times run. Any idea you might have regarding this would be greatly appreciated.
@Dodan I tried installing libreoffice + firefox in a loop and I couldn't reproduce it,
Q. do you have enough space in your dm.directlvm_device
?
I'm using a loop device as dm.directlvm_device
$ dd if=/dev/zero of=disk.img bs=10M count=500
$ printf "g\nn\n\n\n\nw\n" | fdisk disk.img
$ sudo losetup --show -Pf disk.img
/dev/loop1
$ sudo mkfs.ext4 /dev/loop1p1
@devimc We have enough space on the lvm. We use a lvm derived from a physical volume with ~150 GB.
This is what we see when calling lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
thinpool storage twi-a-t--- <146.45g 0.16 1.03
We can try and see if it reproduces with a loop device.
We tried reproducing this bug on a Centos machine with Kata 1.9.3, CRI-o 1.15.3 and kernel 5.4.2, and, indeed, the bug does not reproduce anymore.
We don't know which one of these differences solved the issues we were seeing, but we wanted to thank you for your support!
@Dodan you're welcome. I'm going to close this issue, feel free to re-open it if you can reduce it again.
Description of problem
Hello! My team and I have been running some benchmarks on some various FaaS technologies using various containerization setups. We have been facing some stability issues when using Kata with Firecracker.
Bellow we have put together a simple scenario where a series of apt update/install commands, inside a kata-fc container running Ubuntu, will lead to the container receiving a sigkill.
We noticed that this behaviour replicates for other long running processes that we use (e.g. python http servers or using the containers shell interactively) but the easiest and most reliable way to replicate this was by running the packet manager.
Can you please have a look?
This is our setup:
This is the script we used to recreate the bug:
Show code:
```bash #!/bin/bash # kill_container.sh POD_ID=`crictl runp -r $1 pod.yaml` READY_STATUS=`crictl pods | grep busybox | grep Ready | wc -l` while [[ $READY_STATUS -ne 1 ]]; do READY_STATUS=`crictl pods | grep busybox | grep Ready | wc -l` done echo $POD_ID echo $READY_STATUS CONT_ID=`crictl create $POD_ID container.yaml pod.yaml` READY_STATUS=`crictl ps -a | grep busybox | grep Created | wc -l` while [[ $READY_STATUS -ne 1 ]]; do READY_STATUS=`crictl ps -a | grep busybox | grep Created | wc -l` done echo $CONT_ID echo $READY_STATUS EXEC_STATUS=`crictl exec -it $CONT_ID apt update` echo $EXEC_STATUS EXEC_STATUS=`crictl exec -it $CONT_ID apt install -y htop` echo $EXEC_STATUS EXEC_STATUS=`crictl exec -it $CONT_ID apt install -y build-essential` echo $EXEC_STATUS RM_STATUS=`crictl stop $CONT_ID && crictl rm $CONT_ID` echo $RM_STATUS RMP_STATUS=`crictl stopp $POD_ID && crictl rmp $POD_ID` echo $RMP_STATUS ```Expected result
Actual result
Environment
These are the pod.yaml and container.yaml files we used:
These are the CRI-O network configuration file and crio.conf file we used:
This is the output of the kata-collect-data.sh script: