openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

crash doing zfs send encrypted #13703

Open clhedrick opened 2 years ago

clhedrick commented 2 years ago

System information

Type | Version/Name --- | 0.8.3 Distribution Name | Ubuntu Distribution Version | 20.03 Kernel Version | 5.4.0 Architecture | Intel 64 OpenZFS Version | 0.8.3

Describe the problem you're observing

We` had zfs hang while doing a zfs send. I reported it with 11679, but was asked to report it separately.

You asked for a backtrace of our send-side failure. Here it is. The next line in kern.log is from a reboot an hour later. There are no further backtraces. At that point our problems are with ZFS at a user level. I don't have a full narrative of everything we saw. I apologize. The system has 512 G of memory, a bunch of disks as RAIDZ1's, a mirror of SSD as special for metadata, and a mirror of SSD as slog, and a l2arc (ssd mirror, as an experiment -- it wasn't worth it). Quotas are in use, but I don't know whether the file system that was being sent used quotas.

Recovery options were limited by the fact that it takes a few days to do a scrub, but downtime of the main file systems on it are a real problem. So I went to a mode of recovery I was confident would work (rebuilding the whole thing from backup, with the most commonly used file systems first, and no encryption). The backup system was also encrypted (that's where the send was going), but the restore was unencrypted. I believe at the time the system was around 500 TB, with 200 TB in use.

I believe the file system was being sent and simultaneously used by NFS.

This is Ubuntu 20, with the ZFS that comes with it.

zfs.bug.txt

Describe how to reproduce the problem

Can't reproduce.

Include any warning/errors/backtraces from the system logs

behlendorf commented 2 years ago

According to the stack trace there's some unexplained damage to one of the ZFS block pointers which is causing the crash. Specifically, there's no valid checksum algorithm set: PANIC: blkptr at 00000000f9099df0 has invalid CHECKSUM 0. I can't explain how that would have happened, but as of ZFS 2.0.7 this kind of damage will be handled gracefully and an error returned rather than a system crash.

clhedrick commented 2 years ago

That's good news. In the current version how much would we lose? A file? The file system?

behlendorf commented 2 years ago

It would depend on exactly which block is damaged and if it's the only one. If it's just this block pointer then most likely a file, or a portion of the file.

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.