ubuntu / zsys

ZSys daemon and client for zfs systems
GNU General Public License v3.0
302 stars 43 forks source link

Zsys stopped taking snapshots after upgrade to 20.10 #172

Closed odror closed 3 years ago

odror commented 4 years ago

systemctl status zsysd shows a running process with no issues.

also when installing new packages I get the following error:

ERROR Service took too long to respond. Disconnecting client.

When restarting zsysd I am able to install new packages with no error and zsys works for a while.

when restarting the computer the issue is coming back.

this issue can be tested when installing any package. for example:

sudo apt install --reinstall xterm

I also find out that when I do sudo /sbin/zsysctl state save

I get the following error: ERROR couldn't save state for user "root": user "root" doesn't exist

then I found out that the root pool was not mounted. I manually mounted it. I sill get the error Because of that zsys does not save states since the upgrade to 20.10. I have another machine with 20.10 that does not have this issues.

didrocks commented 4 years ago

Hey,

The usual issues for this is the number of datasets that are too many. Please report a bug as described in the template so that we have the info to confirm/debug it.

dominicdoty commented 4 years ago

I'll throw my hat in the ring, I'm experiencing the same bug.

Describe the bug Auto snapshot on apt actions fail with ERROR Service took too long to respond. Disconnecting client.
zsysctl service status fails with ERROR couldn't connect to zsys daemon: timed out waiting for server handshake.
Also attempted zsysctl service reload, which did not affect the above errors.
systemctl status zsysd reports a healthy service.
Doing a systemctl restart zsysd fixes it.

To Reproduce Unclear, except that it showed up after the upgrade to Ubuntu 20.10

Expected behavior Snapshots should work.

For ubuntu users, please run and copy the following: ubuntu-bug zsys --save=/tmp/report reports No pending crash reports. Try --help for more information.

Screenshots N/A

Installed versions:

Additional context df -h output:

Filesystem                                        Size  Used Avail Use% Mounted on
tmpfs                                             7.1G  7.3M  7.1G   1% /run
rpool/ROOT/ubuntu_ke713d                           24G   13G   11G  55% /
rpool/ROOT/ubuntu_ke713d/home                      13G  2.3G   11G  18% /home
rpool/ROOT/ubuntu_ke713d/home/ddoty                18G  6.8G   11G  39% /home/ddoty
tmpfs                                              36G   20K   36G   1% /dev/shm
tmpfs                                             5.0M     0  5.0M   0% /run/lock
tmpfs                                             4.0M     0  4.0M   0% /sys/fs/cgroup
tmpfs                                              36G  4.0K   36G   1% /tmp
bpool/BOOT/ubuntu_ke713d                          932M  195M  738M  21% /boot
rpool/USERDATA/root_ke713d                         11G  4.9M   11G   1% /root
rpool/ROOT/ubuntu_ke713d/srv                       11G   53M   11G   1% /srv
rpool/ROOT/ubuntu_ke713d/var/games                 11G  128K   11G   1% /var/games
rpool/ROOT/ubuntu_ke713d/var/lib                   16G  5.4G   11G  34% /var/lib
rpool/ROOT/ubuntu_ke713d/usr/local                 11G   16M   11G   1% /usr/local
rpool/ROOT/ubuntu_ke713d/var/spool                 11G  2.5M   11G   1% /var/spool
rpool/ROOT/ubuntu_ke713d/var/snap                  11G  128K   11G   1% /var/snap
rpool/ROOT/ubuntu_ke713d/var/log                   12G  762M   11G   7% /var/log
rpool/ROOT/ubuntu_ke713d/var/www                   11G  128K   11G   1% /var/www
rpool/ROOT/ubuntu_ke713d/var/mail                  11G  640K   11G   1% /var/mail
/dev/sda1                                         505M   17M  488M   4% /boot/efi
rpool/ROOT/ubuntu_ke713d/var/lib/AccountsService   11G  128K   11G   1% /var/lib/AccountsService
rpool/ROOT/ubuntu_ke713d/var/lib/NetworkManager    11G  128K   11G   1% /var/lib/NetworkManager
/dev/zd0p1                                         30G  6.7G   22G  24% /var/lib/docker
rpool/ROOT/ubuntu_ke713d/var/lib/apt               11G   44M   11G   1% /var/lib/apt
rpool/ROOT/ubuntu_ke713d/var/lib/dpkg              11G   37M   11G   1% /var/lib/dpkg
zfs_pool                                          5.1T  4.6T  502G  91% /zfs_pool
tmpfs                                             7.1G     0  7.1G   0% /run/user/1000
didrocks commented 4 years ago

Hey @dominicdoty, please look at my previous comment asking to follow the bug template collecting and reporting all datasets informations, without that, it’s hard to debug the issue. I see that ubuntu-bug zsys --save=/tmp/report doesn’t collect the data for you, have you changed anything to default installation? You should have a bunch of packages like apport, apport-gtk and so on installed.

As told, most of the time, the number of datasets are too high because created by anoither tool. Please check your number of ZFS datasets.

Btw, it seems your system isn’t a ZSys one, you have: rpool/ROOT/ubuntu_ke713d/home 13G 2.3G 11G 18% /home rpool/ROOT/ubuntu_ke713d/home/ddoty 18G 6.8G 11G 39% /home/ddoty

Which defeats the purpose of separating user data and system data

dominicdoty commented 4 years ago

As far as I can tell this is following the repo's bug template that comes up when you create a new issue, from here. I added everything requested in the template so I'm not sure what else you were looking for.

Yes there are some irregularities in my install because I migrated a running system over to zfs.

When you say there are too many datasets do you mean too many snapshots? My system does have a ton of snapshots from the default snapshot policy. I was under the impression it automatically cleaned out old ones that weren't in use but maybe that isn't the case.

didrocks commented 4 years ago

The issue doesn’t have an ubuntu-bug output (due to the error you are seeing), which is what is the most useful for us to debug the issue. I think something has diverged from a regular installation to lead us to this, if you can get the ubuntu-bug output fix and attach it here, that would be nice!

On the number of datasets (snapshots are datasets), I might think you have a lot of them, indeed, and this is what can make the daemon timeout (due to the go-libzfs binding being slow), but it would be great to confirm that, can you list them all (snapshots and filesystem datasets)?

The system will indeed purge automatically taken snapshots.

dominicdoty commented 4 years ago

I rebooted my system- unclear why but now ubuntu-bug worked. Attached. report.txt

didrocks commented 4 years ago

Yeah, your issue is coming by the amount of datasets you have. Most of them are not created by ZSys but by a snapshot tool that you have (or a script you wrote?) You have 4300+ snapshots under the name "zfs-auto-snap-*" and you hit the go-libzfs limit in terms of performance.

As I think you don’t really need them as ZSys is performing the same kind of work, I suggest that you delete them all (and uninstall/remove the tool that create them). Then, ZSys should be performing correctly again. Please keep us posted if that worked out for you :)

odror commented 4 years ago

I have reverted my system back to 20.04. It was working file for a while. the same issue came back. You indicated that I have too many datasets. Can you be more specific. Now "systemctl restart zsysd" does not fix the issue.

zfs list -rpool:

NAME                                                  USED  AVAIL     REFER  MOUNTPOINT
rpool                                                 897G   901G       96K  /
rpool/ROOT                                            764G   901G       96K  none
rpool/ROOT/oz_ubuntu_yfogsz                          7.31G   901G     6.33G  /ubuntu_yfogsz
rpool/ROOT/oz_ubuntu_yfogsz/srv                        96K   901G       96K  /ubuntu_yfogsz/srv
rpool/ROOT/oz_ubuntu_yfogsz/usr                       256K   901G       96K  /ubuntu_yfogsz/usr
rpool/ROOT/oz_ubuntu_yfogsz/usr/local                 160K   901G      160K  /ubuntu_yfogsz/usr/local
rpool/ROOT/oz_ubuntu_yfogsz/var                      1007M   901G       96K  /ubuntu_yfogsz/var
rpool/ROOT/oz_ubuntu_yfogsz/var/games                  96K   901G       96K  /ubuntu_yfogsz/var/games
rpool/ROOT/oz_ubuntu_yfogsz/var/lib                   988M   901G      734M  /ubuntu_yfogsz/var/lib
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/AccountsService   120K   901G      104K  /ubuntu_yfogsz/var/lib/AccountsService
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/NetworkManager    160K   901G      144K  /ubuntu_yfogsz/var/lib/NetworkManager
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/apt               196M   901G      196M  /ubuntu_yfogsz/var/lib/apt
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/dpkg             57.9M   901G     57.9M  /ubuntu_yfogsz/var/lib/dpkg
rpool/ROOT/oz_ubuntu_yfogsz/var/log                  18.4M   901G     18.4M  /ubuntu_yfogsz/var/log
rpool/ROOT/oz_ubuntu_yfogsz/var/mail                   96K   901G       96K  /ubuntu_yfogsz/var/mail
rpool/ROOT/oz_ubuntu_yfogsz/var/snap                  152K   901G      152K  /ubuntu_yfogsz/var/snap
rpool/ROOT/oz_ubuntu_yfogsz/var/spool                 136K   901G      120K  /ubuntu_yfogsz/var/spool
rpool/ROOT/oz_ubuntu_yfogsz/var/www                    96K   901G       96K  /ubuntu_yfogsz/var/www
rpool/ROOT/ubuntu_mj0kfs                              757G   901G     5.88G  /
rpool/ROOT/ubuntu_mj0kfs/srv                          320K   901G       96K  /srv
rpool/ROOT/ubuntu_mj0kfs/usr                          533M   901G       96K  none
rpool/ROOT/ubuntu_mj0kfs/usr/local                    533M   901G      506M  /usr/local
rpool/ROOT/ubuntu_mj0kfs/var                          734G   901G     47.8M  none
rpool/ROOT/ubuntu_mj0kfs/var/games                    320K   901G       96K  /var/games
rpool/ROOT/ubuntu_mj0kfs/var/lib                      531G   901G      279G  /var/lib
rpool/ROOT/ubuntu_mj0kfs/var/lib/AccountsService      584K   901G      104K  /var/lib/AccountsService
rpool/ROOT/ubuntu_mj0kfs/var/lib/NetworkManager      8.03M   901G      184K  /var/lib/NetworkManager
rpool/ROOT/ubuntu_mj0kfs/var/lib/apt                 3.37G   901G      494M  /var/lib/apt
rpool/ROOT/ubuntu_mj0kfs/var/lib/dpkg                 631M   901G     54.3M  /var/lib/dpkg
rpool/ROOT/ubuntu_mj0kfs/var/log                     10.5G   901G     1.06G  /var/log
rpool/ROOT/ubuntu_mj0kfs/var/mail                     320K   901G       96K  /var/mail
rpool/ROOT/ubuntu_mj0kfs/var/snap                     181G   901G      160G  /var/snap
rpool/ROOT/ubuntu_mj0kfs/var/snap_oz_yjgrj5          11.3G   901G     11.3G  none
rpool/ROOT/ubuntu_mj0kfs/var/spool                   5.08M   901G      160K  /var/spool
rpool/ROOT/ubuntu_mj0kfs/var/www                       96K   901G       96K  /var/www
rpool/ROOT/ubuntu_mj0kfs/var/www_autozsys_ly64kj       96K   901G       96K  none
rpool/USERDATA                                        133G   901G       96K  none
rpool/USERDATA/dror_z9i12l                            119M   901G      119M  /home/dror_z9i12l
rpool/USERDATA/dror_zo0whd                            133G   901G      131G  /home/dror
rpool/USERDATA/root_z9i12l                            220K   901G      220K  /root

I have 1168 old zsys snapshots. I have a total 1612 snapshots. Are these too many. All of them are from the time that the system used to work. I have no new ones. 1256 of the snapshots belong to rpool only 9 to bpool.

dominicdoty commented 4 years ago

Ah, looks like I somehow have zfs-auto-snapshot installed alongside zsys. I don't know how as I don't remember doing this. Thanks for your help figuring it out though!

Might be helpful for someone else someday, my little bash work to parse the ubuntu-bug report, which produces a list of datasets and the number of each snapshot type present for that dataset.

for dataset in $(grep -oP "Name\": \"\K[^@]+(?=@)" report.txt | sort | uniq); do echo "$dataset"; grep -oP "$dataset@zfs-auto-snap.\K[a-z]+" report.txt | sort | uniq -c; done

e.g.

rpool/USERDATA/root_ke713d
    310 daily
     40 frequent
    240 hourly
     20 monthly
     70 weekly
didrocks commented 4 years ago

@dominicdoty: YW, glad that fixed it :+1:

@odror: you didn’t report the bug as per the template (see my first request) and thus, I don’t have the list of all datasets and snapshots. My bet is that you have a lot of non ZSys snapshots or datasets and we have seen many external ZFS tools and installed there and interfering with the system by creating a lot of clones and snapshots. Then, the real bug is the go-libzfs performance one, which impacts ZSys and we should fix that, indeed.

odror commented 4 years ago

Attached full bug rreport

odror commented 4 years ago

Attached full bug report

To reproduce

zsysctl save -vv

DEBUG /zsys.Zsys/SaveUserState() call logged as [e89363bb:d7eadc45] 
DEBUG Check if grpc request peer is authorized     
DEBUG Authorized as being administrator            
INFO Requesting to save state for user "root"     
ERROR couldn't save state for user "root": Current machine isn't Zsys, nothing to create 

This was possibly related to prior installation of syncoid, which by now is disabled. Also all the syncoid snapshots have been removed.

Expected behavior See above

For ubuntu users, please run and copy the following:

  1. ubuntu-bug zsys --save=/tmp/report
  2. Copy paste below /tmp/report content:

For ubuntu users, please run and copy the following:

  1. ubuntu-bug zsys --save=/tmp/report

  2. Copy paste below /tmp/report content: report.txt

Screenshots Not applicable

Installed versions:

Additional context Available upon request

didrocks commented 4 years ago

@odror: Thanks for the logs! You have a similar situation than dominicdoty: a lot of datasets (outside ZSys) which hits the limits of go-libzfs in term of performance.

Those are all under lxd/ pool which has also some autosnapshotting (a separate tool?) took under it? Those are totalizing more than 800 datasets. If you can, I would suggest destroy the lxd pool if you don’t care about your lxd containers until we have a fix for the performance issue.

Keep me posted!

odror commented 4 years ago

I have cut the lxd datasets it down from 800 to 176. I have the same problem. See attachment report1.txt

Also out of the 1292 data sets in my system 1168 were generated by autozsys

# zfs list -H -t snapshot -o name,creation,used -S creation  | wc
1292
# zfs list -H -t snapshot -o name,creation,used -S creation  | grep @autozsys | wc
1168
didrocks commented 4 years ago

Sorry, I got the issues mixed up, you don’t have a timeout but your machine is reported as not being a ZSys one. Are you sure the whole migration was complete? This is coming from a missing marker on rpool/ROOT/oz_ubuntu_yfogsz If you zfs get all rpool/ROOT/oz_ubuntu_yfogsz, I’m pretty sure you are missing the com.ubuntu.zsys:bootfs marker. Please try to reset them. As you have already done a manual migration, I think you know those already, but just in case, you will find the set of user properties on https://didrocks.fr/2020/06/19/zfs-focus-on-ubuntu-20.04-lts-zsys-properties-on-zfs-datasets/ "ZFS user properties on system datasets".

Ensure then that zsysd is stopped and you can restart it.

odror commented 4 years ago

Initially I did do the migration manually, but it did not create a gpt partition. It was also a non UEFI. Then I decided (in order to save time) to do a fresh install of ubuntu 20.04 The root data was rpool/ROOT/ubuntu_yfogsz. When I had a working system. I rename the root data dateset to rpool/ROOT/oz_ubuntu_yfogsz mounted on /ubuntu_yfogsz. Then I put my restored Ubuntu installation (from backup) as the root dateset rpool/ROOT/ubuntu_mj0kfs mounted on /. So actually I need to change the marker to rpool/ROOT/ubuntu_mj0kfs. In bpool I just cloned bpool/BOOT/ubuntu_mj0kfs to bpool/BOOT/ubuntu_yfogsz

zfs list -r rpool

NAME                                                  USED  AVAIL     REFER  MOUNTPOINT
rpool                                                 692G  1.08T       96K  /
rpool/ROOT                                            560G  1.08T       96K  none
rpool/ROOT/oz_ubuntu_yfogsz                          7.31G  1.08T     6.33G  /ubuntu_yfogsz
rpool/ROOT/oz_ubuntu_yfogsz/srv                        96K  1.08T       96K  /ubuntu_yfogsz/srv
rpool/ROOT/oz_ubuntu_yfogsz/usr                       256K  1.08T       96K  /ubuntu_yfogsz/usr
rpool/ROOT/oz_ubuntu_yfogsz/usr/local                 160K  1.08T      160K  /ubuntu_yfogsz/usr/local
rpool/ROOT/oz_ubuntu_yfogsz/var                      1007M  1.08T       96K  /ubuntu_yfogsz/var
rpool/ROOT/oz_ubuntu_yfogsz/var/games                  96K  1.08T       96K  /ubuntu_yfogsz/var/games
rpool/ROOT/oz_ubuntu_yfogsz/var/lib                   988M  1.08T      734M  /ubuntu_yfogsz/var/lib
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/AccountsService   112K  1.08T      104K  /ubuntu_yfogsz/var/lib/AccountsService
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/NetworkManager    152K  1.08T      144K  /ubuntu_yfogsz/var/lib/NetworkManager
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/apt               196M  1.08T      196M  /ubuntu_yfogsz/var/lib/apt
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/dpkg             57.9M  1.08T     57.9M  /ubuntu_yfogsz/var/lib/dpkg
rpool/ROOT/oz_ubuntu_yfogsz/var/log                  18.4M  1.08T     18.4M  /ubuntu_yfogsz/var/log
rpool/ROOT/oz_ubuntu_yfogsz/var/mail                   96K  1.08T       96K  /ubuntu_yfogsz/var/mail
rpool/ROOT/oz_ubuntu_yfogsz/var/snap                  152K  1.08T      152K  /ubuntu_yfogsz/var/snap
rpool/ROOT/oz_ubuntu_yfogsz/var/spool                 128K  1.08T      120K  /ubuntu_yfogsz/var/spool
rpool/ROOT/oz_ubuntu_yfogsz/var/www                    96K  1.08T       96K  /ubuntu_yfogsz/var/www
rpool/ROOT/ubuntu_mj0kfs                              553G  1.08T     5.43G  /
rpool/ROOT/ubuntu_mj0kfs/srv                          152K  1.08T       96K  /srv
rpool/ROOT/ubuntu_mj0kfs/usr                          508M  1.08T       96K  none
rpool/ROOT/ubuntu_mj0kfs/usr/local                    508M  1.08T      506M  /usr/local
rpool/ROOT/ubuntu_mj0kfs/var                          541G  1.08T     47.8M  none
rpool/ROOT/ubuntu_mj0kfs/var/games                    152K  1.08T       96K  /var/games
rpool/ROOT/ubuntu_mj0kfs/var/lib                      328G  1.08T      279G  /var/lib
rpool/ROOT/ubuntu_mj0kfs/var/lib/AccountsService      368K  1.08T      104K  /var/lib/AccountsService
rpool/ROOT/ubuntu_mj0kfs/var/lib/NetworkManager      1.41M  1.08T      200K  /var/lib/NetworkManager
rpool/ROOT/ubuntu_mj0kfs/var/lib/apt                 1.35G  1.08T      495M  /var/lib/apt
rpool/ROOT/ubuntu_mj0kfs/var/lib/dpkg                 147M  1.08T     51.4M  /var/lib/dpkg
rpool/ROOT/ubuntu_mj0kfs/var/log                     1.36G  1.08T     1.06G  /var/log
rpool/ROOT/ubuntu_mj0kfs/var/mail                     152K  1.08T       96K  /var/mail
rpool/ROOT/ubuntu_mj0kfs/var/snap                     200G  1.08T      159G  /var/snap
rpool/ROOT/ubuntu_mj0kfs/var/snap_oz_yjgrj5          11.3G  1.08T     11.3G  none
rpool/ROOT/ubuntu_mj0kfs/var/spool                    868K  1.08T      160K  /var/spool
rpool/ROOT/ubuntu_mj0kfs/var/www                       96K  1.08T       96K  /var/www
rpool/ROOT/ubuntu_mj0kfs/var/www_autozsys_ly64kj       96K  1.08T       96K  none
rpool/USERDATA                                        132G  1.08T       96K  none
rpool/USERDATA/dror_z9i12l                            119M  1.08T      119M  /home/dror_z9i12l
rpool/USERDATA/dror_zo0whd                            132G  1.08T      131G  /home/dror
rpool/USERDATA/root_z9i12l                            372K  1.08T      372K  /root
didrocks commented 4 years ago

Ok, and I imagine this is when you did that migration and clones that things stopped working, starting to make sense :) The issue is on this manual migration.

Please read the blog post I pointed above and for both system and user datasets, tag appropriately.

odror commented 4 years ago

Thank you. Problem solved. I fixed the properties of the rpool and its subtree exactly as stated in the installation guide Ubuntu 20.04 Root on ZFS

The problem started when I upgraded from 20.04 to 20.10. The issue was some kind of conflict with syncoid. This issue still need to be resolved.

didrocks commented 3 years ago

Great to hear! Ok, so it was a setup issue. I will let you explore the syncoid issue. The performance issue is a separate one that I keep opened, but I’m closing this one to not conflict.

andylize commented 3 years ago

@dominicdoty , I made the same mistake on my system. Once I uninstalled zfs-auto-snapshot and deleted all of the auto snapshots, zsys services started working again.

I deleted all of the auto-snaps with the command: for dataset in $(zfs list -t snapshot -o name | grep zfs-auto-snap); do zfs destroy "$dataset"; done