Open ElvishJerricco opened 1 year ago
I've just noticed via iotop
that the receiving machine (whether it's the same as the sender or remote) has [receive_writer]
and [txg_sync]
nearly constantly at 99.99% IO in iotop
. It dips to ~80% every couple of seconds, but it's mostly at 99.99%.
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
3141680 be/7 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [receive_writer]
1565 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [txg_sync]
I tried using a VM with ZFS 0.8 as the receiver and got something interesting and frightening. This VM's storage is on NVMe. I had noticed before that trying to receive on NVMe would get slightly further before hanging, but never gave it more than a few minutes, since I assumed it was the same as my HDD based pools that would hang for hours before I tried to kill them. After letting the VM sit there for a while, it eventually received another ~100MiB. And over the course of 3 hours, it randomly alternated between bursts of high intake speed, and 30-45min periods of semi-hanging (where it would receive a few hundred KiB and then stall for 30+ seconds).
The frightening part was when I decided to run a scrub on the VM's pool mid-receive. The scrub ran nice and quickly as I would expect from a VM on NVMe, but it detected 6 CKSUM errors in files in the @originalsnap
that it already had. Note that this VM's storage is also stored on ZFS, which reported no errors, so it's highly unlikely that the host corrupted the data. So receiving this incremental stream caused it to corrupt pre-existing blocks!
The receive did finish after three and a half hours. But the CKSUM errors remain. Oh and if I generate the send stream from the VM, now that it has @newsnap
, it does create the same hang when my HDD pools try to receive it.
System information
Describe the problem you're observing
When sending an incremental snapshot of my home directory to either of my backup pools, the receive process hangs after only
60MiB
was transferred according topv
. When Ictrl+c
, it takes over 15min before it finally dies. Thesend
process is fine; I can pipe that to a file somewhere and it completes without issue, making as large a file as you would expect.zpool scrub
on any of the pools I mention has been error-free.In May, I decided to reformat my SSD and reinstall my OS from scratch. My previous home directory had been backed up automatically with
send/recv
with a long snapshot history on the backups. I reinstalled with NixOS22.05.20220511.41ff747
, kernel5.15.37
, and ZFS2.1.4
, this time with ZFS native encryption. Then I usedsend/recv
to restore the backup of my home directory to the new install. I didn't get around to re-setting up automatic snapshots and backups again, so it was some months before I decided to do an incremental backup manually. That's when I noticed the bug. When I made a new snapshot on the new pool and sent the incremental stream back to the backup pool I had restored from, the receive process froze.Describe how to reproduce the problem
I've tried many variations on this to narrow down any sort of link, but the only thing I can conclude is that, very specifically, the stream produced by
zfs send newpool/homedir@newsnap -i @originalsnap
is simply not receivable. I've tried:@originalsnap
. It doesn't matter what pool it's on; the receive process hangs.@originalsnap
(obviously excludingnewpool/homedir
, as I'm not willing to rollback that dataset and lose all the new stuff), changes made to one can successfully be incrementally sent to the other.@originalsnap
via a send stream generated from the new pool likezfs send newpool/homedir@originalsnap
, the previous observations are identical. It can't receive the broken stream, but it can send and receive functioning incremental streams. Again, it does not matter what pool this dataset is on.newpool/homedir@originalsnap
. It can send and receive new incremental streams, but it cannot receive that broken stream.@newsnap
and incrementally send@even-newer-snap
. In these cases, there are periods where the transfer slows to a terrible crawl for a minute or two at a time, but the majority of the transfer goes as expected.I have absolutely no idea how to put oneself in a position that their dataset is only capable of producing broken incremental streams like this. Note that, as I said,
newpool/homedir
is using ZFS native encryption. But I have not done any testing whatsoever with--raw
. I also have not done any sending with flags like-L
,-e
,-c
,-p
, or-R
. Finally, I have also done most of the same tests with my other backup server as the receiver, which is running ZFS 2.0.4, and have not observed any differences.I would really like to fix this, as I'd greatly prefer to keep my long snapshot history on my backup pools, without having to have a whole other duplicate of my home directory on them. Though to be honest I'm not exactly feeling confident in the safety of continuing to use
newpool/homedir
given that it's creating these problems.Include any warning/errors/backtraces from the system logs
I have not been able to observe any related logs in
dmesg
orjournalctl
.