openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.76k forks source link

vdev_disk: try harder to ensure IO alignment rules #16687

Closed robn closed 4 weeks ago

robn commented 1 month ago

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

Ongoing IO alignment issues on Linux with dm-crypt.

Closes: #16631, #15646, #15533, #14533. Maybe also: #10094

Description

The key insight from out of #16631 is that blocks data must not cross physical page boundaries. So, this updates the alignment check function, then changes the fallback allocation method to ensure page alignment.

I'm not very happy with abd_alloc_scatter(). It feels like a big call to just create it and assert that its always page-aligned. But it's probaby right, and the ABD interface as a whole is kinda of a mess these days, and there really isn't an easier allocator to reach for that will do page alignment (even using alloc_pages() needs a bunch of ceremony that abd_alloc_chunks() already does. So it doesn't seem too bad.

Update: @amotin educated me on why the existing abd_alloc_for_io() was actually what I wanted. So this has been changed for that, and abd_alloc_scatter() no longer included.

I did notice through this that we don't honour the "minimum IO" device limit at all. Most things don't care, but as usual, dm-crypt does, because the block size is part of the crypto. So creating an ashift=9 pool over a 4K dm-crypt device just throws IO errors. I think it always did though (I can't see how it couldn't have), so I'm not going to worry too much for now. It does show that we need to work to minimum limits, which will mean eg not doing <4K accesses, but that seems like it's gonna be work all over the stack (labels, gang headers, etc)... some other time!

Update: Turns out this was a long-standing bug with ZFS, in that it should never be possible to use an ashift smaller than LBS. So now we do check for this case, and #16690 will make it impossible to get into that situation.

How Has This Been Tested?

The test case in #16631 doesn't trip the bug anymore. If I set zfs_vdev_disk_classic=1 it blows up pretty quick. My test cases from #15588 (forcing ganging) continue to work correctly.

Full ZTS run in progress, but it's all looking pretty good so far.

Update: ZTS ran to completion.

Update: I've now run the #16631 test case with various combinations of --sector-size= to cryptsetup luksCreate and -o ashift= to zpool create (always with ashift larger than sector size). Nothing tripped.

Types of changes

Checklist:

robn commented 1 month ago

Last push removed the TODO comment, added a check to ensure that the data len within the page is a multiple of LBS, and adjusted the page_alignment test to match.