nakato / nixos-bpir3-example

MIT License
21 stars 5 forks source link

No space left on a device #8

Open ghostbuster91 opened 1 year ago

ghostbuster91 commented 1 year ago

Hi, today I noticed that my router became very sluggish.

After a quick inspection it turned out that during the night it tried to run nixos upgrade but didn't have enough space on the /boot partition.

Aug 17 04:17:07 surfer nixos-upgrade-start[2995908]: cat: write error: No space left on device
Aug 17 04:17:07 surfer nixos-upgrade-start[2993981]: warning: error(s) occurred while switching to the new configuration
Aug 17 04:17:07 surfer systemd[1]: nixos-upgrade.service: Main process exited, code=exited, status=1/FAILURE
Aug 17 04:17:07 surfer systemd[1]: nixos-upgrade.service: Failed with result 'exit-code'.
Aug 17 04:17:07 surfer systemd[1]: Failed to start NixOS Upgrade.
Aug 17 04:17:07 surfer systemd[1]: nixos-upgrade.service: Consumed 5min 34.070s CPU time, received 39.6M IP traffic, sent 338.6K IP traffic.
q
$ df -h
/dev/mmcblk0p6   59G   26G   30G  47% /
/dev/nvme0n1p3   37G  440K   35G   1% /var
/dev/nvme0n1p1   19G  286M   17G   2% /var/log
/dev/mmcblk0p5  511M  511M     0 100% /boot
/dev/nvme0n1p2   37G   84K   35G   1% /tmp
tmpfs           196M     0  196M   0% /run/user/1000
  1. What can be done in such a case? A quick search reveals that this is was a known issue for RPI https://github.com/NixOS/nixpkgs/issues/23926#issuecomment-837142197 and it was solved by moving the boot partition onto the main rootfs but this is not possible in our case.
  2. What should we do to prevent this from happening? I did have an automatic gc set to work weekly https://github.com/ghostbuster91/nixos-router/blob/main/nixos/configuration.nix#L66

I tried running manually garbage collection and it removed some old generations but the boot partition was still full.

nakato commented 1 year ago

The files in /boot only get managed during switch-to-configuration, which gets executed as part of nixos-rebuild switch or nixos-rebuild boot. The easiest way to get out of this situation, after deleting old generations, is to delete some of the old boot kernels/initramfs files and re-run the command, which will remove the remaining deleted generations. I think it'll restore deleted kernels/initramfs files if they're still referenced by an alive generation as well, but I've never actually verified this.

U-boot can read ext4 filesystems, so one long-term fix would be to rebuild without a /boot partition, so /boot would be on /.

There has to be enough space in /boot for files to be copied on first, as cleanup during the switch happens after new files are copied into place.

ghostbuster91 commented 1 year ago

U-boot can read ext4 filesystems, so one long-term fix would be to rebuild without a /boot partition, so /boot would be on /.

I agree that seems to be the way. Is this as easy as changing respectively this:

        # I'm not sure if this is what MT means by "kernel" but I'm going to assume so as
        # this should be well into the uboot process now.
        bootSizeBlocks=$((bootPartSizeMB * 1024 * 1024 / 512))
        bootPartStart=$((fipEnd + 1))
        bootPartEnd=$((bootPartStart + bootSizeBlocks - 1))

        rootSizeBlocks=$(du -B 512 --apparent-size $root_fs | awk '{ print $1 }')
        rootPartStart=$((bootPartEnd + 1))
        rootPartEnd=$((rootPartStart + rootSizeBlocks - 1))

and

   # Create a new GPT data structure
        sgdisk -o \
        --set-alignment=2 \
        -n 1:$bl2Start:$bl2End -c 1:bl2 -A 1:set:2:1 \
        -n 2:$envStart:$envEnd -c 2:u-boot-env \
        -n 3:$factoryStart:$factoryEnd -c 3:factory \
        -n 4:$fipStart:$fipEnd -c 4:fip \
        -n 5:$bootPartStart:$bootPartEnd -c 5:boot -t 5:C12A7328-F81F-11D2-BA4B-00A0C93EC93B \
        -n 6:$rootPartStart:$rootPartEnd -c 6:root \
        $img

in https://github.com/nakato/nixos-bpir3-example/blob/main/lib/sd-image-mt7986.nix ?

The easiest way to get out of this situation, after deleting old generations, is to delete some of the old boot kernels/initramfs files and re-run the command, which will remove the remaining deleted generations. I think it'll restore deleted kernels/initramfs files if they're still referenced by an alive generation as well, but I've never actually verified this.

Interesting, I will try this next time it happens.

nakato commented 8 months ago

I've finally started pulling-out/iterating on SBC bits in my private flake and created a new repo, nakato/nixos-sbc. It includes putting the boot files on the root partition.

It's designed to be used as an input to other flakes, which is in contrast to this one which pulled just enough from my personal flake to get a board booting and was more of a "copy it" repo.

It should be far enough along now to replace this repo, and it's meant to support more than just the bpir3 in the long run.

ghostbuster91 commented 8 months ago

This is great, thanks for sharing. Together with @steveej we started working on something similar but tailored only for bpir3 as we have no interest on other SBC (at least at the moment). I kind of feel that there is some overlap between yours repository and https://github.com/steveej-forks/nixos-bpir3. Maybe we should all collaborate together?