Open rincebrain opened 2 years ago
Starting to think that we need to have all of that consistent, modular, and internal to zfs on all OS along with crypto and any other bits impacting what is written (vs the io pipelines for how it's written).
@rincebrain: would you be able to confirm that the same block of 0xdeadbeef
repeating or whatnot appears as non-duplicate in the DDTs when written from BSD and then Linux (since their sums shouldn't match)?
I could, but why check? If the checksums don't match, it's not going to go into the DDT as a duplicate. (If it does, a lot more things are quite broken...)
Agreed, reason for asking is to have a record of commercially marketed features being broken such that the powers that be turn their focus to compression and encryption architecture (vs fixing this instance of a bigger problem).
@sempervictus I've seen you mentioning encryption in this context a couple of times. If I understand the problem correctly encryption isn't affected by this problem. For a given plaintext any implementation of AES will produce exactly the same ciphertext and same holds for the MACs produced by the modes (GCM or CCM).
@sempervictus I've seen you mentioning encryption in this context a couple of times. If I understand the problem correctly encryption isn't affected by this problem. For a given plaintext any implementation of AES will produce exactly the same ciphertext and same holds for the MACs produced by the modes (GCM or CCM).
Encryption doesn't produce inconsistent results on disk across platforms, no, it just sometimes panics your system or mangles your encryption key on recv, if you're unlucky. NBD.
@rincebrain First reaction :-), but really :-( :-(. Yeah, there's still a bit work left regarding encryption. I've never seen panics but I could reproduce a couple of key corruption issues with send/recv. I've a couple of comments and partial ideas how to fix it, but TBD (sorry for the pun). Well this gets off topic...
I've got a nice patch that just adds a usleep in dbuf_read and then it's really easy to reproduce the panic...
...fixing it is less obvious to me, unfortunately.
Interesting, could you post the patch? Just curious, not deep in that code either.
@AttilaFueloep: concur, I mention crypto only because it is another low level dmu operation needing guards/checks x-platform to ensure safety.
Interesting, could you post the patch? Just curious, not deep in that code either.
https://github.com/rincebrain/zfs/tree/randdelay is the tree I was using; I found zfs_arc_wild_delay=20 or 50 or so on most modern x86 would let you reproduce the NULL dereference by running zfs_receive_raw in a loop a few dozen times, compared to ~never on a lot of modern hardware. My thoughts on the problem it illustrates are written down here; I found trying to just mutex the problem away fruitless because it appears dbuf_dest gets called on the dbuf anyway at which point everything is extremely NULL. Still adding hooks to find out who's responsible for that, since I think I expect all those callers to check refcounts first...
(It's also not guaranteed to always be the same panic, since the problem is "in the middle of a bunch of code using a number of buffers, something sets them all to NULL")
@sempervictus All right, I see.
@rincebrain Thanks for the explanation and pointers. I'll have a look when time permits.
This issue is a really interesting find, and I certainly enjoy any evidence that non-deterministic-compression is not a disaster in the wild.
Since the Linux kernel also has lz4 and zstd (and probably gzip) implementations of its own, I wonder if we should consider defering to those on the Linux side too. I think the downsides would be annoying - like, we'd have to internalize our own implementation anyway in case the kernel is missing it, so it's not like it particularly reduces maintenance burden, just multiplies the testing area... (makes me wonder what the rationale is for doing this on FreeBSD, is that articulated somewhere?)
There's just one vague upside to deferring to the in-kernel implementations: The in-kernel compressors can use SSE2/AVX (zstd has some SSE2 optimizations) with little overhead (the overhead of getting SSE2 robust for ZFS' internalized implementations may utterly negate any benefits). My gut feeling is it's not worth it.
Oh my, I've somehow thought that ZoL used it's own gzip implementation, but it's really not.
Besides - it's an interesting question if QAT gzip acceleration will produce same results.
Link to my thoughts on a topic https://github.com/openzfs/zfs/issues/12840#issuecomment-991867595
If memory serves from when I went digging before opening this, there was a time when ZFS carried its own implementation, but that got ejected a Long time ago. (I would not be surprised if it yielded a third set of results.)
Besides - it's an interesting question if QAT gzip acceleration will produce same results.
Looks like QAT will produce different results, see #7896
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
Ever since I learned that the gzip compression option uses the OS's zlib implementation, I'd been slightly worried that perhaps various OS's zlibs might not produce the same outputs.
I finally get around to testing it, and wouldn't you know it...
(Same file, all on the same single "disk" pool, in separate datasets)
Linux (OpenZFS 2.0.6, Debian 11):
FreeBSD (13-RELEASE):
illumos (git from 20211221):
Notably:
I believe this is likely because Linux, way back in 2006 or so, did something like what I'm proposing in #12805 - they updated the zlib decompressor to 1.2.3, but left the compressor at 1.1.3 (because it performed marginally less well sometimes)...and haven't refreshed it from upstream since, while FreeBSD and illumos are both shipping 1.2.x. (I haven't yet tried swapping out OpenZFS on Linux's calls to call into a 1.2.x implementation, but that's what I'd like to try next.) (The alternate explanation is just that they made some subtly inconsistent changes at the time or since, but I find that less probable.)
Describe how to reproduce the problem
Include any warning/errors/backtraces from the system logs
N/A