ubuntu / zsys

ZSys daemon and client for zfs systems
GNU General Public License v3.0
301 stars 43 forks source link

USERDATA datasets removed #218

Open struthio opened 2 years ago

struthio commented 2 years ago

Describe the bug I have Raidz1 setup following this guide: https://openzfs.github.io/openzfs-docs/Getting%20Started/Ubuntu/Ubuntu%2020.04%20Root%20on%20ZFS.html

ZSys decided to delete all datasets from USERDATA - forcing all data lost. zpool history -i | grep USERDATA (...) 2021-10-21.20:24:14 [txg:1232311] destroy rpool/USERDATA/struthio_wxxb0z@autozsys_ugsife (1483) rpool/USERDATA/struthio_wxxb0z@autozsys_ugsife 2021-10-21.20:24:15 [txg:1232320] destroy rpool/USERDATA/struthio_wxxb0z@autozsys_moou2o (1293) rpool/USERDATA/struthio_wxxb0z@autozsys_moou2o 2021-10-21.20:24:16 [txg:1232453] destroy rpool/USERDATA/struthio_wxxb0z@autozsys_dxz0dd (1421) rpool/USERDATA/struthio_wxxb0z@autozsys_dxz0dd 2021-10-21.20:24:17 [txg:1232462] destroy rpool/USERDATA/struthio_wxxb0z@autozsys_fs1km6 (1541) rpool/USERDATA/struthio_wxxb0z@autozsys_fs1km6 2021-10-21.20:24:18 [txg:1232463] destroy rpool/USERDATA/struthio_wxxb0z@autozsys_xo9itu (5424) rpool/USERDATA/struthio_wxxb0z@autozsys_xo9itu 2021-10-21.20:24:18 [txg:1232471] destroy rpool/USERDATA/struthio_wxxb0z@autozsys_lsdu5q (1085) rpool/USERDATA/struthio_wxxb0z@autozsys_lsdu5q 2021-10-21.20:24:19 [txg:1232472] destroy rpool/USERDATA/struthio_wxxb0z@autozsys_3ansly (938) rpool/USERDATA/struthio_wxxb0z@autozsys_3ansly 2021-10-21.20:24:20 [txg:1232473] destroy rpool/USERDATA/struthio_wxxb0z@autozsys_7f2s0s (1651) rpool/USERDATA/struthio_wxxb0z@autozsys_7f2s0s 2021-10-21.20:24:21 [txg:1232474] destroy rpool/USERDATA/struthio_wxxb0z@autozsys_7m0nok (5495) rpool/USERDATA/struthio_wxxb0z@autozsys_7m0nok 2021-11-16.19:39:38 [txg:1275178] destroy rpool/USERDATA/struthio_wxxb0z (1688) (bptree, mintxg=1)

cat /var/log/syslog | grep -i zsys Nov 15 18:29:12 titan systemd[1]: Starting ZSYS daemon service... Nov 15 18:29:13 titan systemd[1]: Started ZSYS daemon service. Nov 15 18:29:13 titan zsysctl[10855]: level=error msg="couldn't save state for user \"struthio\": user \"struthio\" doesn't exist" Nov 15 18:29:13 titan systemd[9016]: zsys-user-savestate.service: Main process exited, code=exited, status=1/FAILURE Nov 15 18:29:13 titan systemd[9016]: zsys-user-savestate.service: Failed with result 'exit-code'. Nov 15 18:30:13 titan systemd[1]: zsysd.service: Deactivated successfully. Nov 15 18:35:05 titan systemd[1]: Starting ZSYS daemon service... Nov 15 18:35:06 titan systemd[1]: Started ZSYS daemon service. Nov 15 18:35:07 titan zsysd[12945]: level=warning msg="[[0795e4d6:9ec78de4]] Couldn't destroy user dataset rpool/USERDATA/struthio_wxxb0z (due to rpool/USERDATA/struthio_wxxb0z): couldn't destroy \"rpool/USERDATA/struthio_wxxb0z\" and its children: cannot destroy dataset \"rpool/USERDATA/struthio_wxxb0z\": dataset is busy" Nov 15 18:35:07 titan zsysctl[12939]: #033[33mWARNING#033[0m Couldn't destroy user dataset rpool/USERDATA/struthio_wxxb0z (due to rpool/USERDATA/struthio_wxxb0z): couldn't destroy "rpool/USERDATA/struthio_wxxb0z" and its children: cannot destroy dataset "rpool/USERDATA/struthio_wxxb0z": dataset is busy Nov 15 18:35:08 titan systemd[1]: zsys-gc.service: Deactivated successfully.

If I see correctly zsys was trying to remove dataset yesterday (but failed since I was working on it). So It deleted it today when I first time booted PC.

To Reproduce Not sure how to reproduce, since I didn't done anything special today.

Expected behavior Not deleting user datasets.

For ubuntu users, please run and copy the following:

  1. ubuntu-bug zsys --save=/tmp/report
  2. Copy paste below /tmp/report content:
    COPY REPORT CONTENT HERE.

Screenshots If applicable, add screenshots to help explain your problem.

Installed versions:

Additional context Add any other context about the problem here.

taegge commented 2 years ago

I discovered a similar problem, but luckily I caught it before zsys nuked my data. check the results of: zfs get com.ubuntu.zsys:bootfs-datasets

I bet it contains a typo or some similar goofiness. More useful info in https://github.com/ubuntu/zsys/issues/81

I ended up just clearing the property, and I think I'm safe now. I may go back and fix this properly, but I need to see if I can rebuild some trust with zsys first. Auto-deleting datasets is not cool. Why not spit out a warning that you have orphaned datasets, and let the administrator confirm to delete them with something like a 'zsysctl purge' command?

struthio commented 2 years ago

I think I know what was root cause for my issue

Some time before 'purge' i had a problem with updating system (I executed apt upgrade and after a reboot system does not started correctly), so I decided to restore previous snapshot from GRUB.

Snapshot was restored (an I was happy 'how great this worked'), BUT after restoring ROOT from snapshot, root received completely different ID - and that was probably the problem - home volumes was still linked to old ROOT(that was not used anymore because of broken update). Because 'Old Root' was not used and later on cleaned by zsys, all user datasets started to be orphaned and 'cleaned' few days later.

So I think 'restore' executed from GRUB worked incorrectly.

lckarssen commented 2 years ago

Earlier today I lost all snapshots of my user's home dataset (including those not created by zsys). I only found out because my hourly syncoid backups started failing because there were no matching snapshots between source (my PC) and destination (my home server). In the system logs I found messages like this:

apr 12 09:44:35 barabas systemd[10810]: Starting Save current user state periodically...
apr 12 09:44:35 barabas systemd[1]: Starting ZSYS daemon service...
apr 12 09:44:36 barabas systemd[1]: Started ZSYS daemon service.
apr 12 09:44:36 barabas zsysctl[1460577]: level=error msg="couldn't save state for user \"lennart\": user \"lennart\" doesn't exist"
apr 12 09:44:36 barabas systemd[10810]: zsys-user-savestate.service: Main process exited, code=exited, status=1/FAILURE
apr 12 09:44:36 barabas systemd[10810]: zsys-user-savestate.service: Failed with result 'exit-code'.
apr 12 09:44:36 barabas systemd[10810]: Failed to start Save current user state periodically.

Followed a bit later by :open_mouth: (luckily I am logged in and the dataset is mounted):

apr 12 09:52:27 barabas zsysd[1468432]: level=warning msg="[[865a9b21:f683580c]] Couldn't destroy user dataset rpool/USERDATA/lennart_5v645e (due to rpool/USERDATA/lennart_5v645e): couldn't destroy \"rpool/USERDATA/lennart_5v645e\" and its children: cannot destroy dataset \"rpool/USERDATA/lennart_5v645e\": dataset is busy"
apr 12 09:52:27 barabas zsysctl[1468426]: WARNING Couldn't destroy user dataset rpool/USERDATA/lennart_5v645e (due to rpool/USERDATA/lennart_5v645e): couldn't destroy "rpool/USERDATA/lennart_5v645e" and its children: cannot destroy dataset "rpool/USERDATA/lennart_5v645e": dataset is busy
apr 12 09:52:28 barabas systemd[1]: zsys-gc.service: Deactivated successfully.
apr 12 09:52:28 barabas systemd[1]: Finished Clean up old snapshots to free space.

I don't see any related zfs destroy actions in the output of zpool history. I guess that's because zsys isn't calling the command line tools directly, but uses the Go bindings.

Looking at the output of zfs get com.ubuntu.zsys:bootfs-datasets shows that the associated root pool is rpool/ROOT/ubuntu_kzqh42, whereas the actual pool is called rpool/ROOT/ubuntu_5x6e9l. And, as @struthio remarked I indeed recovered from a broken upgrade via GRUB several months ago.

To me automatic removal of snapshots not created by zsys is a no go, not to mention the fact that it shouldn't even try to destroy any dataset without first prompting the user.

Unfortunately, this is not the first issue I have with Zsys. For example, on a low-spec system to which I send ZFS snapshots with syncoid I'm bitten by #204. And for the system I'm talking about here, I start wondering if Zsys is a blessing or a disaster: on the one hand, I could recover from a failed upgrade from Ubuntu 21.04 to 21.10, on the other hand, doing so involved a lot of manual work using a rescue image because somehow the cloning process had gone wrong.

For now (see also #213), the balance points to: uninstall Zsys :disappointed:. I think the concept of Zsys is great, but it should work and definitely not destroy a full dataset without asking.

fubar-1 commented 2 years ago

Just to get some clarity on this issue, a few questions from a novice zsys user:

  1. Is just disabling zsys garbage collection a safe solution to work around this problem? I'm willing to manually clean up system states every so often in exchange for data safety.
  2. My critical datasets are separate and not effected by zsys. They're safe, right?
didrocks commented 2 years ago
  1. Yes, disable the gc service will avoid automated dataset cleanup. You will need to do that by hand with zsysctl remove though.
  2. You critical datasets should be part of the unmanaged datasets,. Check that you list them, and if so, zsys will never interact with them (no snapshot, nothing)
randolf-scholz commented 1 year ago

This just happened to me, USERDATA got destroyed exactly 2 weeks after is was created.

root@ubuntu:/home/ubuntu# zpool history -i | grep USERDATA
2023-03-24.13:19:47 [txg:163] create rpool/USERDATA (274)  
2023-03-24.13:19:47 [txg:164] set rpool/USERDATA (274) canmount=0
2023-03-24.13:19:47 [txg:164] set rpool/USERDATA (274) mountpoint=/
2023-03-24.13:19:47 zfs create rpool/USERDATA -o canmount=off -o mountpoint=/
2023-03-24.13:32:25 [txg:2360] create rpool/USERDATA/rscholz_7r5brd (2444)  
2023-03-24.13:32:25 [txg:2361] set rpool/USERDATA/rscholz_7r5brd (2444) canmount=1
2023-03-24.13:32:25 [txg:2361] set rpool/USERDATA/rscholz_7r5brd (2444) mountpoint=/home/rscholz
2023-03-24.13:32:25 zfs create rpool/USERDATA/rscholz_7r5brd -o canmount=on -o mountpoint=/home/rscholz
2023-03-24.13:32:25 [txg:2362] set rpool/USERDATA/rscholz_7r5brd (2444) com.ubuntu.zsys:bootfs-datasets=rpool/ROOT/ubuntu_vhqea7
2023-03-24.13:32:25 zfs set com.ubuntu.zsys:bootfs-datasets=rpool/ROOT/ubuntu_vhqea7 rpool/USERDATA/rscholz_7r5brd
2023-03-24.13:32:25 [txg:2363] create rpool/USERDATA/root_7r5brd (1001)  
2023-03-24.13:32:25 [txg:2364] set rpool/USERDATA/root_7r5brd (1001) canmount=1
2023-03-24.13:32:25 [txg:2364] set rpool/USERDATA/root_7r5brd (1001) mountpoint=/root
2023-03-24.13:32:25 zfs create rpool/USERDATA/root_7r5brd -o canmount=on -o mountpoint=/root
2023-03-24.13:32:25 [txg:2369] set rpool/USERDATA/root_7r5brd (1001) com.ubuntu.zsys:bootfs-datasets=rpool/ROOT/ubuntu_vhqea7
2023-03-24.13:32:25 zfs set com.ubuntu.zsys:bootfs-datasets=rpool/ROOT/ubuntu_vhqea7 rpool/USERDATA/root_7r5brd
2023-04-05.13:04:13 [txg:224396] destroy rpool/USERDATA/rscholz_7r5brd (2444) (bptree, mintxg=1)
2023-04-05.13:04:15 [txg:224398] destroy rpool/USERDATA/root_7r5brd (1001) (bptree, mintxg=1)

could it have played a role that I still had a second, older ubuntu installation on a secondary SSD? Did it confuse the two rpools?

a0c commented 1 year ago

could it have played a role that I still had a second, older ubuntu installation on a secondary SSD? Did it confuse the two rpools?

@randolf-scholz It is much worse than just confusing the two rpools. It's about destroying ANY pool visible to Ubuntu.

I've spent a week trying to understand why all my USB backups get destroyed regularly. At first I blamed syncoid, but it turned out it was zsys who has been doing it. I can conclude zsys has an undocumented dataset naming convention that brings havoc to ZFS users on Ubuntu. @didrocks does mention in his article that USERDATA is somewhat special, but it never gets explicitly said that NEVER EVER should Ubuntu users use USERDATA word anywhere in their datasets structure, or otherwise such datasets will be destroyed by zsys.

So it's not the particular rpool/USERDATA structure that makes this dataset and its children special for zsys. It is the presence of word USERDATA in dataset name! So you CAN'T have a dataset named USERDATA2 or MYUSERDATA. Because zsys has a %USERDATA% kind of regex to recognize the managed datasets.

This is really hard to believe, because users' kinda make backups, don't they? And they run something like this to backup all the datasets/volumes/snapshots recursively: syncoid --recursive --no-sync-snap --sendoptions="raw p" --recvoptions=u rpool backup_usb/unity_21_04/rpool This will obviously give you the following structure on target backup_usb pool:

backup_usb/unity_21_04/rpool/USERDATA/a0c_blabla
backup_usb/unity_21_04/rpool/USERDATA/a0c_blabla@snap1
backup_usb/unity_21_04/rpool/USERDATA/a0c_blabla@snap2

Now guess what will zsys do with ALL these backups as soon as it finds all the snaps match its %USERDATA% regex. Right, it will just destroy them immediately. As simple as it gets: connect your USB drive, run zpool import backup_usb - and all the backups are gone. Even the Live USB of latest Ubuntu LTS (22.04) does this!! Despite having zsys excluded from the installation, the Live USB still has zsys installed and so brings havoc to your backup disks as soon as you import their zpools.

In practice it means that instead of just two recursive commands to backup bpool and rpool you'd need to backup each zsys-special dataset separately into a target dataset named differently (i.e. without USERDATA in its name): syncoid --recursive --no-sync-snap --sendoptions="raw p" --recvoptions=u rpool/USERDATA backup_usb/unity_21_04/U so as to get the following structure with USERDATA replaced with U - just to prevent zsys from auto-managing it:

backup_usb/unity_21_04/U/a0c_blabla
backup_usb/unity_21_04/U/a0c_blabla@snap1
backup_usb/unity_21_04/U/a0c_blabla@snap2

UPD: syncoid seems to have --exclude=REGEX option that should allow us to continue enjoying the --recursive switch while having USERDATA/BOOT/ROOT excluded. Otherwise (with --recursive unavailable) each Persistent dataset would have to be backed up in a separate command, because we can't allow --recursive switch to copy the USERDATA dataset. And it would be very easy to forget to backup some dataset (esp. new ones).

UPD: --exclude works perfectly. zsys no longer destroys the backups.

# First run to init bpool:
syncoid --recursive --exclude=bpool/BOOT --no-sync-snap --sendoptions="raw p" --recvoptions=u bpool backup_usb/unity_21_04/bpool

# All subsequent runs to sync backups:
syncoid --recursive --no-sync-snap --sendoptions="raw p" --recvoptions=u bpool/BOOT backup_usb/unity_21_04/bpool/B
# sync persistent datasets recursively, skip ROOT/USERDATA to then sync them into R/U datasets to avoid zsys auto-managing them:
syncoid --recursive --exclude=rpool/ROOT --exclude=rpool/USERDATA --no-sync-snap --sendoptions="raw p" --recvoptions=u rpool backup_usb/unity_21_04/rpool
syncoid --recursive --no-sync-snap --sendoptions="raw p" --recvoptions=u rpool/ROOT backup_usb/unity_21_04/rpool/R
syncoid --recursive --no-sync-snap --sendoptions="raw p" --recvoptions=u rpool/USERDATA backup_usb/unity_21_04/rpool/U

UPD: Alternatively, importing zpool as readonly also prevents zsys from interfering with it:

zpool import -o readonly=on -N -R /backup backup_usb

But this won't allow you to sync new snapshots to it - so it's not quite a solution.

a0c commented 1 year ago

@randolf-scholz Here's another related bug in zsys. Ubuntu is configured to take hourly snapshots of the active users' home directories. One day I've kept USB backup disk connected to the laptop for several hours and during that time zsys has created several hourly snapshots on the backup USB and NOT on the laptop. :facepalm: BTW, backup pool of USB has been imported with -N, so nothing mounted, only block-level operations possible (snapshots, their transfer/removal), no file-level operations possible.

It was particularly funny because these several snapshots were the only snapshots on the USB backup disk. The rest of user homedir snaps have previously been carefully and silently destroyed by zsys on the USB backup disk. Can it get worse than that? :smile:

JavaScriptDude commented 6 months ago

For the record, I had my USERDATA datasets and snapshots destroyed last week. I was playing around with /etc/sysctl.conf with the vm.nr_hugepages setting, as one does, and I accidentally set too high of a value and my system would no longer boot due to out of memory issue. After I was able to boot by live CD, access my zpool, to fix the sysctl.conf and re-boot, I noticed that the zfs history showed that the USERDATA snapshots and and datasets were explicitly deleted in the ZFS history; all at the same timestamp and the timestamp exactly matched the first time I booted the machine with the rogue sysctl setting and memory crash.

I checked /var/log... and unfortunately those were nuked. If there other systemd logs available, I may still have the db files to do an extract but that's outside of my wheelhouse.

This boot memory issue may be a test case that can be tested in the future to ascertain one possible failure mode of zsys.

I recall something similar happening several years back but it was early in a new install and did not impact any data. I did not make any detailed notes at the time.

My recently retired system had zsys because it was originally built on Ubuntu 20.04 and upgraded to 22.04. I really miss zsys and I'll be glad to do anything to help if this project gets some attention again.

darkbasic commented 6 months ago

In its current state zsys should not be used, period. Such a waste of good software :(