Closed marceldegraaf closed 3 years ago
This is going to (of necessity) be a ZFS issue, not a syncoid issue—ZFS is what's throwing that "dataset busy" error, syncoid is merely passing it along to you.
To figure out why the dataset is busy after that initial full replication (which succeeded, from looking at your report) you'll have to investigate on the target side. You might check lsof
to see if anything's going on with filesystem locks inside the newly created dataset, for example.
If you can't figure anything out obvious there—locked files inside the dataset, etc—you'll probably need/want to open a bug with OpenZFS itself. If the project gets shirty about pointing fingers back at syncoid, you can simplify things by stripping the ZFS replication commands out of the syncoid output, and reproducing them manually. EG:
zfs send -I 'homes/home'@'autosnap_2021-03-25_22:27:28_monthly' 'homes/home'@'syncoid_hestia_2021-03-26:10:45:49' | mbuffer -q -s 128k -m 16M 2>/dev/null | pv -s 19133424 | zfs receive -s -F 'tank/backups/homes/home'
You can just run that exact command directly from the command line. You can also simplify it considerably, again for easier problem reporting upstream (and bisecting for potential problem points):
zfs send -I 'homes/home'@'autosnap_2021-03-25_22:27:28_monthly' 'homes/home'@'syncoid_hestia_2021-03-26:10:45:49' | pv -s 19133424 | zfs receive -s -F 'tank/backups/homes/home'
You can simplify even further by stripping out the pv
command as well, but things may get painful if you do, due to the lack of progress bar letting you know where you sit in a potentially very long-running operation.
It looks to me like there are some issues with replicating encrypted datasets, at the ZFS level. This isn't something I've personally done any testing on yet, but like you said it's a similar issue to #598 —and you're both working with encrypted datasets and replication.
Hey Jim – thanks for your detailed reply. Today I noticed my machine reporting severe memory issues with memtester
so I'm going to fix that first, to make sure that isn't what was causing my ZFS issues.
@marceldegraaf just tested an initial send/receive from a non-encrypted to an encrypted pool on debian 10 with syncoid and it works just fine, so it seems the problem is with your system
@phreaker0 how large was that initial send/receive? My suspicion is that ZFS is doing some work related to encryption after the initial send, which makes the dataset report as busy for large initial receives. My initial snapshot is ~191 GB in size.
I've been able to reproduce my issue with zfs send/receive
, so without using syncoid
. See below.
I'm OK with keeping this issue closed as the issue seems to be related to ZFS, not syncoid
.
First I sent the initial snapshot to the encrypted destination:
zfs send 'homes/home@autosnap_2021-03-25_22:27:28_monthly' | pv -s 205354192152 | zfs receive -s -F 'tank/backups/homes/home'
And then, after a few minutes (because I had to look up the exact commands and send size for pv
), an incremental send to the latest snapshot:
zfs send -I 'homes/home@autosnap_2021-03-25_22:27:28_monthly' 'homes/home@autosnap_2021-03-31_18:00:01_hourly' | pv -s 1659827952 | zfs receive -s -F 'tank/backups/homes/home'
This worked without issue. However, I then removed the tank/backups/homes/home
dataset (and all its snapshots) to start with a clean slate, and simulated what syncoid
does by immediately sending the incremental snapshots after the initial send is complete:
zfs send 'homes/home@autosnap_2021-03-25_22:27:28_monthly' | pv -s 205354192152 | zfs receive -s -F 'tank/backups/homes/home' && zfs send -I 'homes/home@autosnap_2021-03-25_22:27:28_monthly' 'homes/home@autosnap_2021-03-31_18:00:01_hourly' | pv -s 1659827952 | zfs receive -s -F 'tank/backups/homes/home'
Lo and behold: the same dataset is busy
error as thrown from syncoid
:
cannot receive incremental stream: dataset is busy
It seems ZFS is doing some work related to encryption after the initial send. To test this I created a
new, non-encrypted dataset on tank/backups
, and tried simulating syncoid
again. That worked without issue.
Finally I tried running syncoid
again, this time with the non-encrypted dataset as destination: syncoid --debug homes/home tank/backups/temp/home
. That also worked without issue.
@marceldegraaf ~19GB
interesting, but as you said it's mostly likely a zfs issue then.
Yes, we've at least ruled out syncoid
as the culprit. Thanks for your help anyway!
Isn't this a syncoid issue in the sense that syncoid should wait before sending the next dataset until the target pool is no longer busy? Or retry in some short while?
I suspect this happens mostly with slower storage/cpu targets.
In my case I'm sending a unencrypted dataset to a encrypted target (encrypting at target).
For me it looks like the issue appears when syncoid is trying to send the next dataset while zfs on the target is stuck with mounting the previously sent dataset. I only manage to get this error on initial/full syncs, which makes sense as mounting of the synced dataset only happens on initial sync, right?
Do we have any ZFS issue opened about this? I didn't find anything that looked like what is experienced here or experienced by me, except some issues with raw sends.. Or maybe this: https://github.com/openzfs/zfs/issues/6504 ?
@MrRinkana syncoid can't properly detect if a zfs dataset is busy. It will attempt to do what it was instructed and zfs will throw an busy error if some other operation is already going on.
For me it looks like the issue appears when syncoid is trying to send the next dataset while zfs on the target is stuck with mounting the previously sent dataset. I only manage to get this error on initial/full syncs, which makes sense as mounting of the synced dataset only happens on initial sync, right?
AFAIK For ZFS mounting a dataset and receiving a dataset at the same time works.
Hmm, okay thanks! But what operations could trigger dataset is busy then?
Would it still make sense to retry the last command with some minor delay if it exited with "dataset is busy"?
Currently i'm trying with "sleep 60".
TL;DR – latest stable
syncoid
fails on Debian 10 when performing the initial send/receive from a non-encrypted to an encrypted pool. The error seems to becannot receive incremental stream: dataset is busy
. Issue #598 seems similar, although in that case asendoption
flag seems to have been the culprit. I'm not using those in mysyncoid
command, and AFAIK my ZFS setup is pretty vanilla.Using sanoid/syncoid v2.0.3 (latest stable tag) via the latest Debian instructions.
According to those instructions I set up a systemd timer for sanoid, so it creates hourly snapshots. Here's a list of the snapshots created on my
homes
zpool:The
homes/home
dataset is not encrypted:I want to send all snapshots of
homes/home
to mytank/backups/homes
dataset. This dataset is encrypted:I'm using the following command to send/receive these snapshots, assuming ZFS will create a child dataset in
tank/backups/homes/home
that inherits encryption from the parenttank/backups/homes
:However, this command fails consistently with the following output:
Issue #598 seems similar, although in that case a
sendoption
flag seems to have been the culprit. I'm not using those in mysyncoid
command – it's as vanilla as possible.This post on the FreeBSD forum seems similar to my issue, in that I also see a clone on the received snapshot:
Is this what triggers the error, or am I doing something wrong? Happy to provide additional output if needed!
Here's the version output for
syncoid
:I'm running on Debian 10 (Buster):
With this ZFS version: