Receiving incremental stream causes process to hang

ElvishJerricco commented 1 year ago

System information

Type	Version/Name
Distribution Name	NixOS
Distribution Version	22.11.20221118.52b2ac8
Kernel Version	Linux 5.15.79
Architecture	x86_64
OpenZFS Version	zfs-2.1.6-1, zfs-kmod-2.1.6-1

Describe the problem you're observing

When sending an incremental snapshot of my home directory to either of my backup pools, the receive process hangs after only 60MiB was transferred according to pv. When I ctrl+c, it takes over 15min before it finally dies. The send process is fine; I can pipe that to a file somewhere and it completes without issue, making as large a file as you would expect. zpool scrub on any of the pools I mention has been error-free.

In May, I decided to reformat my SSD and reinstall my OS from scratch. My previous home directory had been backed up automatically with send/recv with a long snapshot history on the backups. I reinstalled with NixOS 22.05.20220511.41ff747, kernel 5.15.37, and ZFS 2.1.4, this time with ZFS native encryption. Then I used send/recv to restore the backup of my home directory to the new install. I didn't get around to re-setting up automatic snapshots and backups again, so it was some months before I decided to do an incremental backup manually. That's when I noticed the bug. When I made a new snapshot on the new pool and sent the incremental stream back to the backup pool I had restored from, the receive process froze.

Describe how to reproduce the problem

I've tried many variations on this to narrow down any sort of link, but the only thing I can conclude is that, very specifically, the stream produced by zfs send newpool/homedir@newsnap -i @originalsnap is simply not receivable. I've tried:

Sending this broken stream to a new dataset that was just restored from @originalsnap. It doesn't matter what pool it's on; the receive process hangs.
Given any two datasets that are currently rolled back to @originalsnap (obviously excluding newpool/homedir, as I'm not willing to rollback that dataset and lose all the new stuff), changes made to one can successfully be incrementally sent to the other.
Even if a dataset was created from @originalsnap via a send stream generated from the new pool like zfs send newpool/homedir@originalsnap, the previous observations are identical. It can't receive the broken stream, but it can send and receive functioning incremental streams. Again, it does not matter what pool this dataset is on.
The same goes for a promoted clone of newpool/homedir@originalsnap. It can send and receive new incremental streams, but it cannot receive that broken stream.
The same problem does not occur if I fully send or clone+promote @newsnap and incrementally send @even-newer-snap. In these cases, there are periods where the transfer slows to a terrible crawl for a minute or two at a time, but the majority of the transfer goes as expected.

I have absolutely no idea how to put oneself in a position that their dataset is only capable of producing broken incremental streams like this. Note that, as I said, newpool/homedir is using ZFS native encryption. But I have not done any testing whatsoever with --raw. I also have not done any sending with flags like -L, -e, -c, -p, or -R. Finally, I have also done most of the same tests with my other backup server as the receiver, which is running ZFS 2.0.4, and have not observed any differences.

I would really like to fix this, as I'd greatly prefer to keep my long snapshot history on my backup pools, without having to have a whole other duplicate of my home directory on them. Though to be honest I'm not exactly feeling confident in the safety of continuing to use newpool/homedir given that it's creating these problems.

Include any warning/errors/backtraces from the system logs

I have not been able to observe any related logs in dmesg or journalctl.

ElvishJerricco commented 1 year ago

I've just noticed via iotop that the receiving machine (whether it's the same as the sender or remote) has [receive_writer] and [txg_sync] nearly constantly at 99.99% IO in iotop. It dips to ~80% every couple of seconds, but it's mostly at 99.99%.

    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
3141680 be/7 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [receive_writer]
   1565 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [txg_sync]

ElvishJerricco commented 1 year ago

I tried using a VM with ZFS 0.8 as the receiver and got something interesting and frightening. This VM's storage is on NVMe. I had noticed before that trying to receive on NVMe would get slightly further before hanging, but never gave it more than a few minutes, since I assumed it was the same as my HDD based pools that would hang for hours before I tried to kill them. After letting the VM sit there for a while, it eventually received another ~100MiB. And over the course of 3 hours, it randomly alternated between bursts of high intake speed, and 30-45min periods of semi-hanging (where it would receive a few hundred KiB and then stall for 30+ seconds).

The frightening part was when I decided to run a scrub on the VM's pool mid-receive. The scrub ran nice and quickly as I would expect from a VM on NVMe, but it detected 6 CKSUM errors in files in the @originalsnap that it already had. Note that this VM's storage is also stored on ZFS, which reported no errors, so it's highly unlikely that the host corrupted the data. So receiving this incremental stream caused it to corrupt pre-existing blocks!

The receive did finish after three and a half hours. But the CKSUM errors remain. Oh and if I generate the send stream from the VM, now that it has @newsnap, it does create the same hang when my HDD pools try to receive it.

openzfs / zfs