ubuntu / zsys

ZSys daemon and client for zfs systems
GNU General Public License v3.0
301 stars 43 forks source link

out of space on bpool #231

Closed fubar-1 closed 2 years ago

fubar-1 commented 2 years ago

Describe the bug So it's happened.

Right now I'm facing an issue on this system where a regular Ubuntu update failed with the following message: Package failed to install: Error while installing package: cannot copy extracted data for './boot/vmlinuz-5.15.0-37-generic' to '/boot/vmlinuz-5.15.0-37-generic.dpkg-new'

Apparently zsys creation / garbage-collection doesn't consider free space when making snapshot creation & removal decisions on limited-space devices / partitions. Ideally it would detect low free space situations on system state creation and do some garbage collection to avoid this situation. But it does not.

My (limited) understanding of the situation is that bpool is full of snapshots and can't correctly update and having recovered a failed update using zsys's grub recovery on this system in the past, I half-expect attempting to use zsysctl state remove autozsys_XXXXXX to make free space so I can complete the system update will trigger bug #218 and leave the system in an unbootable state.

Right now I'm not sure what to do; Just try to reboot and see what happens? Manually attempt to delete snapshots listed in the list generated from zfs list -t snap bpool/BOOT -r? Or roll the dice and use zsysctl state remove?

Having been in similar boot-failure recovery situations before, if the system fails to boot, to my knowlege there is no clear system recovery path other than following the instructions here and trying to understand how ti fix the issue. But I expect I'll have do what I've had to in the past: get zfs working in a live-boot session, copy all my data off, reformat the disk and reinstall from scratch. Which is sad, it's likely just a zfs attribute set incorrectly somewhere. But the last time I spent a week trying to manually recover (the OpenZFS mail list [here](https://zfsonlinux.topicbox.com/groups/zfs-discuss) was a waste of time) and it took a month and a week of lost vacation time to make up for the lost time.

Finally, I'm not sure I'm seeing the value of Ubuntu boot on ZFS anymore. Simply snapshotting things doesn't seem sufficient without a tool like zsys to keep things in order. Recovery without zsys (which is brilliant when it works) seems to be a mess that's left to gurus. It's far too complex for me to figure out what's going wrong on my own.

It's really sad - such an ideal system, 98% complete, just to be left in the dust with critical flaws even enthusiasts can't tolerate now that Microsoft pays the bills (to kill franchise-threatening projects such as this) to move developers onto supporting Active Directory.

This may be my last post for a while since I'm looking at probably 20-40 hours of recovery. I think my best option is to use zsysctl state remove on the oldest autozsys snapshot and cross my fingers.

Apologies if this comes off as a rant. Maybe I'm just tired, it's been a long day. But to have my system likely to about to timebomb, for really bad reasons isn't a terrible excuse.

To Reproduce

  1. Install Ubuntu with ZFS.
  2. ZSYS is installed by default in Ubuntu 20 or use 'apt install zsys' in Ubuntu 21.
  3. Use Ubuntu for a while.
  4. Get comfortable. Spend a bunch of time getting the system just how you like it.
  5. Pile up lots of valuable data.
  6. Install recommended updates when notified.
  7. At some point, use GRUB to recover the system to an earlier snapshot.
  8. Sooner or later BOOM Out of space on bpool. Or some other fail.
  9. Learn how to manually recover the system of at least try to rescue your data before wiping it and starting over.

Expected behavior Simple, robust operation. Usable by non-zfs experts. Not crashing to an unbootable, unrecoverable state.

For ubuntu users, please run and copy the following:

  1. ubuntu-bug zsys --save=/tmp/report
  2. Copy paste below /tmp/report content:
    (it's just a very typical Ubuntu system on ZFS filesystem and zsys installed)

Screenshots ()

Installed versions:

Additional context In my humble opinion:

  1. ZFS itself is great.
  2. ZFS Root on Linux in it's current form isn't very good. The dependence on multiple partitions (bpool, rpool) should be unnecessary. If GRUB needs to be forked and patched to improve rpool compatibility then it should be done instead of this bpool hack. (bpool out of space is causing my current issue)
  3. All of these zfs pool & filesystem attributes having to be set perfectly to have a bootable system is asking for failure. A root pool should contain an attribute 'bootfs=' that points to the filesystem to boot. GRUB should be able to boot from any filesystem. No fancy forking, cloning, etc should be necessary (but available options)
  4. All of these numerous separate filesystems mounted in various places that make up a bootable system isn't necessary, so eliminate them in the base case. Sysadmins can make these separate filesystems if they so choose. rpool/ROOT/ubuntu_xxxxxx/var/mail? Really?
  5. ZFS bootable system recovery needs to become a thing. Like, something that exists. At least a document on on how to rebuild an Ubuntu system from an rpool.
  6. ZSYS, the ideal companion to bootable ZFS, needs just a little more work. Fix the issues (obvious glaring bugs like getting a GUI so that non-geniuses can manage it, not creating out-of-space issues, not wiping peoples systems unexpectedly, etc)

If these issues were addressed, it would make for an ideal system setup. Something that would give Ubuntu the competitive advantage @canonical needs so badly right now. Not active directory integration :-p.

fubar-1 commented 2 years ago

It's going to take at least a day to copy my data off so in the hopes of getting a quick response, here's my situation. @didrocks if you could offer some advice I'd appreciate it...

root@beast:~# zfs list -t snap bpool/BOOT -r -o creation,name,used -s creation
CREATION               NAME                                       USED
Wed May  4 23:23 2022  bpool/BOOT/ubuntu_hpuj4i@autozsys_hlmi6g     0B
Wed May  4 23:23 2022  bpool/BOOT/ubuntu_hpuj4i@autozsys_vhp0xi     0B
Wed May  4 23:24 2022  bpool/BOOT/ubuntu_hpuj4i@autozsys_2vdkez     0B
Thu May  5  0:26 2022  bpool/BOOT/ubuntu_hpuj4i@autozsys_kcyplx     0B
Thu May  5 10:10 2022  bpool/BOOT/ubuntu_hpuj4i@autozsys_g7fh5z     8K
Thu May  5 15:34 2022  bpool/BOOT/ubuntu_hpuj4i@autozsys_j3ygoq     8K
Fri May  6  9:13 2022  bpool/BOOT/ubuntu_hpuj4i@autozsys_j5wpmk    64K
Fri May  6  9:24 2022  bpool/BOOT/ubuntu_hpuj4i@working            72K
Sat May  7 21:21 2022  bpool/BOOT/ubuntu_ylut29@autozsys_ql8ps5    64K
Sat May  7 21:22 2022  bpool/BOOT/ubuntu_ylut29@autozsys_00iao6    80K
Sun May  8  1:36 2022  bpool/BOOT/ubuntu_qjn27u@autozsys_z2qqay     8K
Sun May  8 13:33 2022  bpool/BOOT/ubuntu_dwwf56@autozsys_cis3pt     0B
Sun May  8 13:33 2022  bpool/BOOT/ubuntu_dwwf56@autozsys_dvld37     0B
Sun May  8 15:28 2022  bpool/BOOT/ubuntu_dwwf56@autozsys_gtehow     0B
Sun May  8 15:32 2022  bpool/BOOT/ubuntu_dwwf56@autozsys_1gutqv     0B
Mon May  9 10:31 2022  bpool/BOOT/ubuntu_dwwf56@autozsys_9on5ib     8K
Tue May 10 11:21 2022  bpool/BOOT/ubuntu_hpuj4i@autozsys_iupcss   116M
root@beast:~# zsysctl state remove working --dry-run
rpool/USERDATA/root_c0jn3v@working will be detached from system state rpool/ROOT/ubuntu_hpuj4i@working
rpool/USERDATA/root_c0jn3v@working has a dependency linked to some states:
  - rpool/USERDATA/root_9mvv9t (2022-05-10 11:01:26) to remove. Currently linked to rpool/ROOT/ubuntu_ylut29
  - rpool/USERDATA/root_9mvv9t@autozsys_00iao6 (2022-05-07 21:22:14)
  - rpool/USERDATA/root_dij59a (2022-05-10 10:20:34) to remove. Currently linked to rpool/ROOT/ubuntu_dwwf56, rpool/ROOT/ubuntu_qjn27u
  - rpool/USERDATA/root_dij59a@autozsys_9on5ib (2022-05-09 10:31:02)
  - rpool/USERDATA/root_dij59a@autozsys_z2qqay (2022-05-08 01:36:58)
  - rpool/USERDATA/root_dij59a@autozsys_1gutqv (2022-05-08 15:32:34)
  - rpool/USERDATA/root_dij59a@autozsys_gtehow (2022-05-08 15:28:13)
  - rpool/USERDATA/root_dij59a@autozsys_dvld37 (2022-05-08 13:33:56)
  - rpool/USERDATA/root_dij59a@autozsys_cis3pt (2022-05-08 13:33:04)
  - rpool/USERDATA/root_9mvv9t@autozsys_ql8ps5 (2022-05-07 21:21:15)

Would you like to proceed [y/N]? n^C

Will a zsysctl state remove working break this system? What is the recommended course of action here?

(A list of my rpool snaps, see attached) rpool-snaps.txt