Closed odror closed 3 years ago
Hey,
The usual issues for this is the number of datasets that are too many. Please report a bug as described in the template so that we have the info to confirm/debug it.
I'll throw my hat in the ring, I'm experiencing the same bug.
Describe the bug
Auto snapshot on apt actions fail with ERROR Service took too long to respond. Disconnecting client
.
zsysctl service status
fails with ERROR couldn't connect to zsys daemon: timed out waiting for server handshake
.
Also attempted zsysctl service reload
, which did not affect the above errors.
systemctl status zsysd
reports a healthy service.
Doing a systemctl restart zsysd
fixes it.
To Reproduce Unclear, except that it showed up after the upgrade to Ubuntu 20.10
Expected behavior Snapshots should work.
For ubuntu users, please run and copy the following:
ubuntu-bug zsys --save=/tmp/report
reports No pending crash reports. Try --help for more information.
Screenshots N/A
Installed versions:
Additional context
df -h
output:
Filesystem Size Used Avail Use% Mounted on
tmpfs 7.1G 7.3M 7.1G 1% /run
rpool/ROOT/ubuntu_ke713d 24G 13G 11G 55% /
rpool/ROOT/ubuntu_ke713d/home 13G 2.3G 11G 18% /home
rpool/ROOT/ubuntu_ke713d/home/ddoty 18G 6.8G 11G 39% /home/ddoty
tmpfs 36G 20K 36G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
tmpfs 36G 4.0K 36G 1% /tmp
bpool/BOOT/ubuntu_ke713d 932M 195M 738M 21% /boot
rpool/USERDATA/root_ke713d 11G 4.9M 11G 1% /root
rpool/ROOT/ubuntu_ke713d/srv 11G 53M 11G 1% /srv
rpool/ROOT/ubuntu_ke713d/var/games 11G 128K 11G 1% /var/games
rpool/ROOT/ubuntu_ke713d/var/lib 16G 5.4G 11G 34% /var/lib
rpool/ROOT/ubuntu_ke713d/usr/local 11G 16M 11G 1% /usr/local
rpool/ROOT/ubuntu_ke713d/var/spool 11G 2.5M 11G 1% /var/spool
rpool/ROOT/ubuntu_ke713d/var/snap 11G 128K 11G 1% /var/snap
rpool/ROOT/ubuntu_ke713d/var/log 12G 762M 11G 7% /var/log
rpool/ROOT/ubuntu_ke713d/var/www 11G 128K 11G 1% /var/www
rpool/ROOT/ubuntu_ke713d/var/mail 11G 640K 11G 1% /var/mail
/dev/sda1 505M 17M 488M 4% /boot/efi
rpool/ROOT/ubuntu_ke713d/var/lib/AccountsService 11G 128K 11G 1% /var/lib/AccountsService
rpool/ROOT/ubuntu_ke713d/var/lib/NetworkManager 11G 128K 11G 1% /var/lib/NetworkManager
/dev/zd0p1 30G 6.7G 22G 24% /var/lib/docker
rpool/ROOT/ubuntu_ke713d/var/lib/apt 11G 44M 11G 1% /var/lib/apt
rpool/ROOT/ubuntu_ke713d/var/lib/dpkg 11G 37M 11G 1% /var/lib/dpkg
zfs_pool 5.1T 4.6T 502G 91% /zfs_pool
tmpfs 7.1G 0 7.1G 0% /run/user/1000
Hey @dominicdoty, please look at my previous comment asking to follow the bug template collecting and reporting all datasets informations, without that, it’s hard to debug the issue. I see that ubuntu-bug zsys --save=/tmp/report
doesn’t collect the data for you, have you changed anything to default installation? You should have a bunch of packages like apport, apport-gtk and so on installed.
As told, most of the time, the number of datasets are too high because created by anoither tool. Please check your number of ZFS datasets.
Btw, it seems your system isn’t a ZSys one, you have: rpool/ROOT/ubuntu_ke713d/home 13G 2.3G 11G 18% /home rpool/ROOT/ubuntu_ke713d/home/ddoty 18G 6.8G 11G 39% /home/ddoty
Which defeats the purpose of separating user data and system data
As far as I can tell this is following the repo's bug template that comes up when you create a new issue, from here. I added everything requested in the template so I'm not sure what else you were looking for.
Yes there are some irregularities in my install because I migrated a running system over to zfs.
When you say there are too many datasets do you mean too many snapshots? My system does have a ton of snapshots from the default snapshot policy. I was under the impression it automatically cleaned out old ones that weren't in use but maybe that isn't the case.
The issue doesn’t have an ubuntu-bug output (due to the error you are seeing), which is what is the most useful for us to debug the issue. I think something has diverged from a regular installation to lead us to this, if you can get the ubuntu-bug output fix and attach it here, that would be nice!
On the number of datasets (snapshots are datasets), I might think you have a lot of them, indeed, and this is what can make the daemon timeout (due to the go-libzfs binding being slow), but it would be great to confirm that, can you list them all (snapshots and filesystem datasets)?
The system will indeed purge automatically taken snapshots.
I rebooted my system- unclear why but now ubuntu-bug worked. Attached. report.txt
Yeah, your issue is coming by the amount of datasets you have. Most of them are not created by ZSys but by a snapshot tool that you have (or a script you wrote?) You have 4300+ snapshots under the name "zfs-auto-snap-*" and you hit the go-libzfs limit in terms of performance.
As I think you don’t really need them as ZSys is performing the same kind of work, I suggest that you delete them all (and uninstall/remove the tool that create them). Then, ZSys should be performing correctly again. Please keep us posted if that worked out for you :)
I have reverted my system back to 20.04. It was working file for a while. the same issue came back. You indicated that I have too many datasets. Can you be more specific. Now "systemctl restart zsysd" does not fix the issue.
zfs list -rpool:
NAME USED AVAIL REFER MOUNTPOINT
rpool 897G 901G 96K /
rpool/ROOT 764G 901G 96K none
rpool/ROOT/oz_ubuntu_yfogsz 7.31G 901G 6.33G /ubuntu_yfogsz
rpool/ROOT/oz_ubuntu_yfogsz/srv 96K 901G 96K /ubuntu_yfogsz/srv
rpool/ROOT/oz_ubuntu_yfogsz/usr 256K 901G 96K /ubuntu_yfogsz/usr
rpool/ROOT/oz_ubuntu_yfogsz/usr/local 160K 901G 160K /ubuntu_yfogsz/usr/local
rpool/ROOT/oz_ubuntu_yfogsz/var 1007M 901G 96K /ubuntu_yfogsz/var
rpool/ROOT/oz_ubuntu_yfogsz/var/games 96K 901G 96K /ubuntu_yfogsz/var/games
rpool/ROOT/oz_ubuntu_yfogsz/var/lib 988M 901G 734M /ubuntu_yfogsz/var/lib
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/AccountsService 120K 901G 104K /ubuntu_yfogsz/var/lib/AccountsService
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/NetworkManager 160K 901G 144K /ubuntu_yfogsz/var/lib/NetworkManager
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/apt 196M 901G 196M /ubuntu_yfogsz/var/lib/apt
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/dpkg 57.9M 901G 57.9M /ubuntu_yfogsz/var/lib/dpkg
rpool/ROOT/oz_ubuntu_yfogsz/var/log 18.4M 901G 18.4M /ubuntu_yfogsz/var/log
rpool/ROOT/oz_ubuntu_yfogsz/var/mail 96K 901G 96K /ubuntu_yfogsz/var/mail
rpool/ROOT/oz_ubuntu_yfogsz/var/snap 152K 901G 152K /ubuntu_yfogsz/var/snap
rpool/ROOT/oz_ubuntu_yfogsz/var/spool 136K 901G 120K /ubuntu_yfogsz/var/spool
rpool/ROOT/oz_ubuntu_yfogsz/var/www 96K 901G 96K /ubuntu_yfogsz/var/www
rpool/ROOT/ubuntu_mj0kfs 757G 901G 5.88G /
rpool/ROOT/ubuntu_mj0kfs/srv 320K 901G 96K /srv
rpool/ROOT/ubuntu_mj0kfs/usr 533M 901G 96K none
rpool/ROOT/ubuntu_mj0kfs/usr/local 533M 901G 506M /usr/local
rpool/ROOT/ubuntu_mj0kfs/var 734G 901G 47.8M none
rpool/ROOT/ubuntu_mj0kfs/var/games 320K 901G 96K /var/games
rpool/ROOT/ubuntu_mj0kfs/var/lib 531G 901G 279G /var/lib
rpool/ROOT/ubuntu_mj0kfs/var/lib/AccountsService 584K 901G 104K /var/lib/AccountsService
rpool/ROOT/ubuntu_mj0kfs/var/lib/NetworkManager 8.03M 901G 184K /var/lib/NetworkManager
rpool/ROOT/ubuntu_mj0kfs/var/lib/apt 3.37G 901G 494M /var/lib/apt
rpool/ROOT/ubuntu_mj0kfs/var/lib/dpkg 631M 901G 54.3M /var/lib/dpkg
rpool/ROOT/ubuntu_mj0kfs/var/log 10.5G 901G 1.06G /var/log
rpool/ROOT/ubuntu_mj0kfs/var/mail 320K 901G 96K /var/mail
rpool/ROOT/ubuntu_mj0kfs/var/snap 181G 901G 160G /var/snap
rpool/ROOT/ubuntu_mj0kfs/var/snap_oz_yjgrj5 11.3G 901G 11.3G none
rpool/ROOT/ubuntu_mj0kfs/var/spool 5.08M 901G 160K /var/spool
rpool/ROOT/ubuntu_mj0kfs/var/www 96K 901G 96K /var/www
rpool/ROOT/ubuntu_mj0kfs/var/www_autozsys_ly64kj 96K 901G 96K none
rpool/USERDATA 133G 901G 96K none
rpool/USERDATA/dror_z9i12l 119M 901G 119M /home/dror_z9i12l
rpool/USERDATA/dror_zo0whd 133G 901G 131G /home/dror
rpool/USERDATA/root_z9i12l 220K 901G 220K /root
I have 1168 old zsys snapshots. I have a total 1612 snapshots. Are these too many. All of them are from the time that the system used to work. I have no new ones. 1256 of the snapshots belong to rpool only 9 to bpool.
Ah, looks like I somehow have zfs-auto-snapshot
installed alongside zsys
. I don't know how as I don't remember doing this. Thanks for your help figuring it out though!
Might be helpful for someone else someday, my little bash work to parse the ubuntu-bug report, which produces a list of datasets and the number of each snapshot type present for that dataset.
for dataset in $(grep -oP "Name\": \"\K[^@]+(?=@)" report.txt | sort | uniq); do echo "$dataset"; grep -oP "$dataset@zfs-auto-snap.\K[a-z]+" report.txt | sort | uniq -c; done
e.g.
rpool/USERDATA/root_ke713d
310 daily
40 frequent
240 hourly
20 monthly
70 weekly
@dominicdoty: YW, glad that fixed it :+1:
@odror: you didn’t report the bug as per the template (see my first request) and thus, I don’t have the list of all datasets and snapshots. My bet is that you have a lot of non ZSys snapshots or datasets and we have seen many external ZFS tools and installed there and interfering with the system by creating a lot of clones and snapshots. Then, the real bug is the go-libzfs performance one, which impacts ZSys and we should fix that, indeed.
Attached full bug rreport
Attached full bug report
To reproduce
DEBUG /zsys.Zsys/SaveUserState() call logged as [e89363bb:d7eadc45]
DEBUG Check if grpc request peer is authorized
DEBUG Authorized as being administrator
INFO Requesting to save state for user "root"
ERROR couldn't save state for user "root": Current machine isn't Zsys, nothing to create
This was possibly related to prior installation of syncoid, which by now is disabled. Also all the syncoid snapshots have been removed.
Expected behavior See above
For ubuntu users, please run and copy the following:
ubuntu-bug zsys --save=/tmp/report
/tmp/report
content:For ubuntu users, please run and copy the following:
ubuntu-bug zsys --save=/tmp/report
Copy paste below /tmp/report
content:
report.txt
Screenshots Not applicable
Installed versions:
zsysctl 0.4.8
zsysd 0.4.8
Additional context Available upon request
@odror: Thanks for the logs! You have a similar situation than dominicdoty: a lot of datasets (outside ZSys) which hits the limits of go-libzfs in term of performance.
Those are all under lxd/ pool which has also some autosnapshotting (a separate tool?) took under it? Those are totalizing more than 800 datasets. If you can, I would suggest destroy the lxd pool if you don’t care about your lxd containers until we have a fix for the performance issue.
Keep me posted!
I have cut the lxd datasets it down from 800 to 176. I have the same problem. See attachment report1.txt
Also out of the 1292 data sets in my system 1168 were generated by autozsys
# zfs list -H -t snapshot -o name,creation,used -S creation | wc
1292
# zfs list -H -t snapshot -o name,creation,used -S creation | grep @autozsys | wc
1168
Sorry, I got the issues mixed up, you don’t have a timeout but your machine is reported as not being a ZSys one. Are you sure the whole migration was complete? This is coming from a missing marker on rpool/ROOT/oz_ubuntu_yfogsz
If you zfs get all rpool/ROOT/oz_ubuntu_yfogsz
, I’m pretty sure you are missing the com.ubuntu.zsys:bootfs marker. Please try to reset them. As you have already done a manual migration, I think you know those already, but just in case, you will find the set of user properties on https://didrocks.fr/2020/06/19/zfs-focus-on-ubuntu-20.04-lts-zsys-properties-on-zfs-datasets/ "ZFS user properties on system datasets".
Ensure then that zsysd is stopped and you can restart it.
Initially I did do the migration manually, but it did not create a gpt partition. It was also a non UEFI. Then I decided (in order to save time) to do a fresh install of ubuntu 20.04 The root data was rpool/ROOT/ubuntu_yfogsz. When I had a working system. I rename the root data dateset to rpool/ROOT/oz_ubuntu_yfogsz mounted on /ubuntu_yfogsz. Then I put my restored Ubuntu installation (from backup) as the root dateset rpool/ROOT/ubuntu_mj0kfs mounted on /. So actually I need to change the marker to rpool/ROOT/ubuntu_mj0kfs. In bpool I just cloned bpool/BOOT/ubuntu_mj0kfs to bpool/BOOT/ubuntu_yfogsz
zfs list -r rpool
NAME USED AVAIL REFER MOUNTPOINT
rpool 692G 1.08T 96K /
rpool/ROOT 560G 1.08T 96K none
rpool/ROOT/oz_ubuntu_yfogsz 7.31G 1.08T 6.33G /ubuntu_yfogsz
rpool/ROOT/oz_ubuntu_yfogsz/srv 96K 1.08T 96K /ubuntu_yfogsz/srv
rpool/ROOT/oz_ubuntu_yfogsz/usr 256K 1.08T 96K /ubuntu_yfogsz/usr
rpool/ROOT/oz_ubuntu_yfogsz/usr/local 160K 1.08T 160K /ubuntu_yfogsz/usr/local
rpool/ROOT/oz_ubuntu_yfogsz/var 1007M 1.08T 96K /ubuntu_yfogsz/var
rpool/ROOT/oz_ubuntu_yfogsz/var/games 96K 1.08T 96K /ubuntu_yfogsz/var/games
rpool/ROOT/oz_ubuntu_yfogsz/var/lib 988M 1.08T 734M /ubuntu_yfogsz/var/lib
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/AccountsService 112K 1.08T 104K /ubuntu_yfogsz/var/lib/AccountsService
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/NetworkManager 152K 1.08T 144K /ubuntu_yfogsz/var/lib/NetworkManager
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/apt 196M 1.08T 196M /ubuntu_yfogsz/var/lib/apt
rpool/ROOT/oz_ubuntu_yfogsz/var/lib/dpkg 57.9M 1.08T 57.9M /ubuntu_yfogsz/var/lib/dpkg
rpool/ROOT/oz_ubuntu_yfogsz/var/log 18.4M 1.08T 18.4M /ubuntu_yfogsz/var/log
rpool/ROOT/oz_ubuntu_yfogsz/var/mail 96K 1.08T 96K /ubuntu_yfogsz/var/mail
rpool/ROOT/oz_ubuntu_yfogsz/var/snap 152K 1.08T 152K /ubuntu_yfogsz/var/snap
rpool/ROOT/oz_ubuntu_yfogsz/var/spool 128K 1.08T 120K /ubuntu_yfogsz/var/spool
rpool/ROOT/oz_ubuntu_yfogsz/var/www 96K 1.08T 96K /ubuntu_yfogsz/var/www
rpool/ROOT/ubuntu_mj0kfs 553G 1.08T 5.43G /
rpool/ROOT/ubuntu_mj0kfs/srv 152K 1.08T 96K /srv
rpool/ROOT/ubuntu_mj0kfs/usr 508M 1.08T 96K none
rpool/ROOT/ubuntu_mj0kfs/usr/local 508M 1.08T 506M /usr/local
rpool/ROOT/ubuntu_mj0kfs/var 541G 1.08T 47.8M none
rpool/ROOT/ubuntu_mj0kfs/var/games 152K 1.08T 96K /var/games
rpool/ROOT/ubuntu_mj0kfs/var/lib 328G 1.08T 279G /var/lib
rpool/ROOT/ubuntu_mj0kfs/var/lib/AccountsService 368K 1.08T 104K /var/lib/AccountsService
rpool/ROOT/ubuntu_mj0kfs/var/lib/NetworkManager 1.41M 1.08T 200K /var/lib/NetworkManager
rpool/ROOT/ubuntu_mj0kfs/var/lib/apt 1.35G 1.08T 495M /var/lib/apt
rpool/ROOT/ubuntu_mj0kfs/var/lib/dpkg 147M 1.08T 51.4M /var/lib/dpkg
rpool/ROOT/ubuntu_mj0kfs/var/log 1.36G 1.08T 1.06G /var/log
rpool/ROOT/ubuntu_mj0kfs/var/mail 152K 1.08T 96K /var/mail
rpool/ROOT/ubuntu_mj0kfs/var/snap 200G 1.08T 159G /var/snap
rpool/ROOT/ubuntu_mj0kfs/var/snap_oz_yjgrj5 11.3G 1.08T 11.3G none
rpool/ROOT/ubuntu_mj0kfs/var/spool 868K 1.08T 160K /var/spool
rpool/ROOT/ubuntu_mj0kfs/var/www 96K 1.08T 96K /var/www
rpool/ROOT/ubuntu_mj0kfs/var/www_autozsys_ly64kj 96K 1.08T 96K none
rpool/USERDATA 132G 1.08T 96K none
rpool/USERDATA/dror_z9i12l 119M 1.08T 119M /home/dror_z9i12l
rpool/USERDATA/dror_zo0whd 132G 1.08T 131G /home/dror
rpool/USERDATA/root_z9i12l 372K 1.08T 372K /root
Ok, and I imagine this is when you did that migration and clones that things stopped working, starting to make sense :) The issue is on this manual migration.
Please read the blog post I pointed above and for both system and user datasets, tag appropriately.
Thank you. Problem solved. I fixed the properties of the rpool and its subtree exactly as stated in the installation guide Ubuntu 20.04 Root on ZFS
The problem started when I upgraded from 20.04 to 20.10. The issue was some kind of conflict with syncoid. This issue still need to be resolved.
Great to hear! Ok, so it was a setup issue. I will let you explore the syncoid issue. The performance issue is a separate one that I keep opened, but I’m closing this one to not conflict.
@dominicdoty , I made the same mistake on my system. Once I uninstalled zfs-auto-snapshot
and deleted all of the auto snapshots, zsys
services started working again.
I deleted all of the auto-snaps with the command:
for dataset in $(zfs list -t snapshot -o name | grep zfs-auto-snap); do zfs destroy "$dataset"; done
systemctl status zsysd shows a running process with no issues.
also when installing new packages I get the following error:
ERROR Service took too long to respond. Disconnecting client.
When restarting zsysd I am able to install new packages with no error and zsys works for a while.
when restarting the computer the issue is coming back.
this issue can be tested when installing any package. for example:
sudo apt install --reinstall xterm
I also find out that when I do sudo /sbin/zsysctl state save
I get the following error: ERROR couldn't save state for user "root": user "root" doesn't exist
then I found out that the root pool was not mounted. I manually mounted it. I sill get the error Because of that zsys does not save states since the upgrade to 20.10. I have another machine with 20.10 that does not have this issues.