ubuntu / zsys

ZSys daemon and client for zfs systems
GNU General Public License v3.0
300 stars 43 forks source link

Unmanaged datasets destroyed at boot time #196

Open azazar opened 3 years ago

azazar commented 3 years ago

Describe the bug

When custom datasets are present in the system, they are not recognised by zsys and sceduled for destuction. That's probably a new incarnation of a bug #103 reported earlier.

Before actually deleting filesystem there were two failed attempts:

...
Mar 12 14:19:45 hp zed: eid=132 class=history_event pool_guid=0x974173CCDE607995
Mar 12 14:19:48 hp zed: eid=133 class=history_event pool_guid=0x974173CCDE607995
Mar 12 14:19:50 hp zed: eid=134 class=history_event pool_guid=0x974173CCDE607995
Mar 12 14:20:18 hp systemd[1]: zsysd.service: Succeeded.
Mar 12 14:21:01 hp systemd[1]: Starting Clean up old snapshots to free space...
Mar 12 14:21:01 hp systemd[1]: Starting ZSYS daemon service...
Mar 12 14:21:02 hp systemd[1]: Started ZSYS daemon service.
Mar 12 14:21:03 hp zsysd[14234]: level=warning msg="[[f0d315a4:620bdb6a]] Couldn't destroy user dataset rpool/USERDATA/m_enc (due to rpool/USERDATA/m_enc): couldn't destroy \"rpool/USERDATA/m_enc\" and its children: cannot destroy dataset \"rpool/USERDATA/m_enc\": dataset is busy"
Mar 12 14:21:03 hp zsysctl[14228]: #033[33mWARNING#033[0m Couldn't destroy user dataset rpool/USERDATA/m_enc (due to rpool/USERDATA/m_enc): couldn't destroy "rpool/USERDATA/m_enc" and its children: cannot destroy dataset "rpool/USERDATA/m_enc": dataset is busy
Mar 12 14:21:05 hp systemd[1]: zsys-gc.service: Succeeded.
Mar 12 14:21:05 hp systemd[1]: Finished Clean up old snapshots to free space.
Mar 12 14:22:05 hp systemd[1]: zsysd.service: Succeeded.
...
Mar 12 21:28:30 hp systemd[1]: Starting Clean up old snapshots to free space...
Mar 12 21:28:31 hp zed: eid=3635 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:35 hp systemd-resolved[1254]: Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying tr
ansaction with reduced feature level UDP.
Mar 12 21:28:35 hp zed: eid=3636 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:38 hp zed: eid=3637 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:38 hp zsysd[476502]: level=warning msg="[[489b0020:c2f4566c]] Couldn't destroy user dataset rpool/USERDATA/m_enc (due to rpool/USERDATA/m_enc): couldn't destroy \"rpool/USERDATA/m_enc\" and its children: cannot destroy dataset \"rpool/USERDATA/m_enc\": dataset is busy"
Mar 12 21:28:38 hp zsysctl[485834]: #033[33mWARNING#033[0m Couldn't destroy user dataset rpool/USERDATA/m_enc (due to rpool/USERDATA/m_enc): couldn't destroy "rpool/USERDATA/m_enc" and its children: cannot destroy dataset "rpool/USERDATA/m_enc": dataset is busy
Mar 12 21:28:41 hp zed: eid=3638 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:43 hp zed: eid=3639 class=history_event pool_guid=0x974173CCDE607995
Mar 12 21:28:45 hp systemd[1]: zsys-gc.service: Succeeded.
Mar 12 21:28:45 hp systemd[1]: Finished Clean up old snapshots to free space.
Mar 12 21:28:46 hp zed: eid=3640 class=history_event pool_guid=0x974173CCDE607995

From zpool history -il rpool output:

2021-03-13.10:38:39 [txg:73086] destroy rpool/USERDATA/root_3vm1uh@autozsys_rb901c (481)  [on hp]
2021-03-13.10:38:44 ioctl destroy_snaps
    input:
        snaps:
            rpool/USERDATA/root_3vm1uh@autozsys_rb901c
 [user 0 (root) on hp:linux]
2021-03-13.10:39:00 [txg:73115] destroy rpool/USERDATA/m_enc (1331)  [on hp]
2021-03-13.10:50:39 [txg:73753] open pool version 5000; software version unknown; uts hp 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 [on hp]
2021-03-13.10:50:39 [txg:73755] import pool version 5000; software version unknown; uts hp 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 [on hp]
2021-03-13.10:50:39 zpool import -N rpool [user 0 (root) on hp:linux]
2021-03-13.10:51:31 [txg:73799] set rpool/ROOT/ubuntu_7k8at6 (90) com.ubuntu.zsys:last-used=1615621890 [on hp]
2021-03-13.10:51:31 [txg:73800] set rpool/USERDATA/o_envgoi (5181) com.ubuntu.zsys:last-used=1615621890 [on hp]
2021-03-13.10:51:31 [txg:73802] set rpool/USERDATA/root_3vm1uh (288) com.ubuntu.zsys:last-used=1615621890 [on hp]
2021-03-13.10:51:31 [txg:73804] set rpool/ROOT/ubuntu_7k8at6 (90) com.ubuntu.zsys:last-booted-kernel=vmlinuz-5.8.0-44-generic [on hp]

Filesystem got deleted silently, without any notice:

Mar 13 10:38:10 hp zed: eid=77 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:13 hp zed: eid=78 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:16 hp zed: eid=79 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:18 hp zed: eid=80 class=history_event pool_guid=0xC787AB1273593DF8
Mar 13 10:38:22 hp zed: eid=81 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:25 hp zed: eid=82 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:29 hp zed: eid=83 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:33 hp zed: eid=84 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:36 hp zed: eid=85 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:39 hp zed: eid=86 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:42 hp zed: eid=87 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:45 hp zed: eid=88 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:47 hp dbus-daemon[2744]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.109' (uid=1001 pid=22803 comm="exo-desktop-item-edit -t Link -c --xid=0x18d /home" label="unconfined")
Mar 13 10:38:48 hp systemd[1]: Starting Hostname Service...
Mar 13 10:38:48 hp dbus-daemon[2744]: [system] Successfully activated service 'org.freedesktop.hostname1'
Mar 13 10:38:48 hp systemd[1]: Started Hostname Service.
Mar 13 10:38:49 hp zed: eid=89 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:52 hp zed: eid=90 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:55 hp zed: eid=91 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:38:58 hp zed: eid=92 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:01 hp zed: eid=93 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:01 hp CRON[23349]: (root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Mar 13 10:39:05 hp zed: eid=94 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:07 hp zed: eid=95 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:10 hp zed: eid=96 class=history_event pool_guid=0x974173CCDE607995
Mar 13 10:39:12 hp systemd[1]: zsys-gc.service: Succeeded.
Mar 13 10:39:12 hp systemd[1]: Finished Clean up old snapshots to free space.

To Reproduce

  1. Create encrypted home filesystem as written here: https://talldanestale.dk/2020/04/06/zfs-and-homedir-encryption/
  2. Populate it with data to trigger gc
  3. Create zsys.conf with nonzero general.minfreepoolspace
  4. Reboot

Installed versions:

azazar commented 3 years ago

Not sure if it needed, but let it be here: zsys.conf.gz

azazar commented 3 years ago

zsys shouldn't manage filesystems and snapshots that it didn't create, implicitly.

jvcdk commented 3 years ago

I agree with @azazar and would go further and say it shouldn't destroy filesystems at all (only auto-created snapshots).

lathiat commented 3 years ago

I think I hit this today too, all 3 homedirs including all snapshots are gone on my system (Ubuntu Hirsute). Similar looking logs to the reporter. I will try to gather more evidence. Happened on Friday for me.

didrocks commented 3 years ago

First, sorry for your destroyed datasets. We handled at first that USERDATA (which was never used in any ZFS systems we monitor) as a reserved ZSYS datasets and own them. Note that any dataset that you create here without the appropriate zsys metadata won’t be handled and you loose the benefits of being able to revert with user datasets automatically and such. I think in general USERDATA shouldn’t be used manually.

We need to delete datasets there, as after a revert, a dataset without any zsys tag (because of unsucessful revert or a dataset that expired due to garbage collection) would remain forever, filesystem datasets are some kind of a snapshots for it and we need to delete them to not clutter the system.

However, we tried to create some mitigation as you noted on bug #103 and there has been no change since, this is why I am a little bit puzzled (could this be related to your particular encryption setup?) on why it only triggers now. Thanks for the reproducer, I’ll start my investigation from there and keep you posted.

didrocks commented 3 years ago

We unfortunately couldn’t reproduce it with the steps described. I’m really wonder what differed in your case. (Note: I saw you mentioned that it will only GC once the disk is full at 80%, this is not the case but rather this is time-based)

For your information, here are the datasets that are up for deletion in /USERDATA (once it reached the GC limit):

In addition to the reproducer, we tried the following (everytime, we advance the date and forced GC to keep 0 datasets) in the USERDATA namespace:

All those cases pass on encrypted or unencrypted datasets as we expect (we have found an issue on hirsute due to ZFS packaging, not related to ZSys itself, which makes some datasets not mounted at boot, we are fixing it). Any idea what’s different on your configuration (if you can come with a full reproducer, that would be awesome)? The only reason I can see is that rpool/USERDATA/m_enc was a clone of the unencrypted dataset, never tagged with ZSys, which isn’t what the how-to is doing (it’s creating its own dataset and tag it with ZSys).

azazar commented 3 years ago

If by tagging, do you mean setting com.ubuntu.zsys:bootfs-datasets fs option, then maybe it was the cause problem? When I followed the guide on home fs encryption, I've set it to -.

didrocks commented 3 years ago

Yeah, tagging is about adding that tag, but the manual doesn’t tell to set it to - but to the system dataset it’s associated with:

VAL=$(zfs get com.ubuntu.zsys:bootfs-datasets rpool/USERDATA/jvc_tdssc -H -ovalue)
sudo zfs set com.ubuntu.zsys:bootfs-datasets=$VAL rpool/USERDATA/jvc_enc

(ofc, you need to change the dataset names there)

jdavidberger commented 2 years ago

I think this bug hit me today. Normally I wouldn't bother reporting behavior with this sparse of information but the bug deleted my home directory and after a few hours attempting recovery I'm pretty sure it's gone.

This was on ubuntu 21.04, zfs 2.02, Linux 5.11.0-7620 with an encrypted home directory.

Admittedly vague notes:

From that point forward the home directory dataset was gone. Zpool history did not show the deletion but it was there with the -i flag. I have a log file with that in it that I'll post when I have a new system up and running.

I think there is a chance you can exhibit this bug by logging in to a fresh install via recovery mode, unmounting the home filesystem and rebooting but can't be sure other interactions don't play a part.

I can't help but think there should be a tag/flag you can affix to certain datasets that marks them ineligible for destruction except for a very manual cli "zfs destroy -f NAME".

almereyda commented 2 years ago

While you propose that unvoluntary destruction of datasets should be opt-out, I would rather vote for making voluntary destruction of datasets opt-in. This means we would only allow to destroy datasets that have a certain property set, and not the other way round.

But as of #213, I fear we can rather consider ZSYS as abandoned, with ZFS support leaving experimental status on Ubuntu nowadays. I am unable to figure out how this can go well together, but the world is contradictory at times.

darkbasic commented 1 year ago

While I understand zsys is no longer maintained, this issue is scaring the crap out of a lot of people and puts the project under a bad light. I myself have moved to zrepl despite always tagging every dataset. If Ubuntu is not funding this maybe you could try a crowdfunding platform instead (Github Sponsors, patreon, gofundme, whatever)? Development will be slow (I don't see to many people interested, albeit there are surely some) but at least it won't be completely unmaintained with huge bugs eating your data.

runejuhl commented 1 year ago

I have to chime in here, even if it's a bit +1-ish.

Though there hasn't been a formal announcement from Canonical about the status of zsys it seems that the project is dead. That's life for software -- sometimes it lives on, sometimes it dies, and sometimes it gets resurrected by someone with an itch and continues in a new incarnation.

What's problematic here is that this is (was?) a supported installation method, and because of this bug there's a very real risk of data loss -- just ask @jdavidberger. Because zsys is included in official releases it'll continue to affect users until this is properly fixed. I just had a look at the installer for Ubuntu 22.04, and there's no mention of an installation with ZFS being any less supported than a regular installation:

image

image

Even if Canonical sees no future in zsys, having such a bug around in Ubuntu it reflects extremely poorly on Canonical and Ubuntu, and I hope that you can be convinced to fix this issue before putting zsys in the grave for good.

almereyda commented 1 year ago

AFAIK Ubuntu 22.04 will install ZFS and set up the datasets through their Ubiquity installer, but it won't install zsys anymore, which is probably good. 🤣

Please see this line, where the installation of zsys is commented out (permalink to latest current LTS ref):

darkbasic commented 1 year ago

but it won't install zsys anymore, which is probably good. rofl

I wouldn't rejoice over zsys being dropped. It's bugged, of course, but it's a really nice and convenient piece of software. In its current state I wouldn't risk using it without replication to another machine, but I'm currently evaluating letting zsys handle snapshotting of BOOT, ROOT and USERDATA while letting zrepl snapshot everything else and replication. Basically zrepl creates bookmarks on top of zsys snapshots and replicates them to another machine and it also manages snapshotting itself for all the other datasets. That way I would still be able to conveniently revert from grub using zsys. The only problem is that zsys is more broken than I suspected and I cannot even revert without breaking the system: https://github.com/ubuntu/zsys/issues/236 I would love to write a detailed guide on how to use zsys in conjunction with zrepl replication, but I'm already in the process of upgrading several servers so I either find a solution to the broken reverts in the short term or I abandon zsys forever :( If you have any idea why reverting breaks your system I'm all ears.

a0c commented 1 year ago

This might be helpful in case of issues with backups.