turing-machines / BMC-Firmware

Turing-pi BMC firmware
GNU General Public License v2.0
215 stars 26 forks source link

OOM-killer triggered during consecutive node flash command #162

Open svenrademakers opened 6 months ago

svenrademakers commented 6 months ago

Describe the bug I'm using a local tpi to attempt to flash, 1 by 1, all the 4 nodes of the Turing Pi 2. It always fails quicky after starting to upload image on node 2.

To Reproduce the following command was used


for i in 1 2 3 4
do
        tpi --host 192.168.1.100 --user root --password turing flash --image-path /tmp/raspios.img -n $i
done

Expected behavior message.txt

all 4 nodes to flash successfully.

Screenshots If applicable, add screenshots to help explain your problem.

Versions ?

Additional context attached is the dmesg logging

ncasaux commented 6 months ago

I have exactly same behaviour even if I add a 60s sleep after flashing:

for i in 1 2 3 4; 
do
  tpi --host 192.168.1.100 --user root --password turing flash --image-path /tmp/raspios/raspios.img -n $i
  sleep 60s
done

This issue can be consistently reproduced.

ejade commented 4 months ago

Experiencing same issue. Rebooting the BMC between flashes seems to be a workaround

rtreffer commented 4 months ago

Looks like a memory leak to me. I think there are a few issues here

  1. bmcd is not supervised. I would like to run it under runit (or systemd)
  2. memory is rather tight
  3. bmcd seems to be leaking

Issue 1 could be easily fixed. I am considering a small PR for runit. That way the flash would fail, but at least the bmcd would be up again afterward :see_no_evil:

There is also a tiny potential for (2) to run zswap/zram, with a very small memory setting, e.g. 16MB. This would give a tiny bit more headroom. This would require a kernel config change though.

The crashes seem to happen with ~70-80MB memory usage for bmcd. I can't imagine that to happen without some memory leak.