util-linux / util-linux

http://en.wikipedia.org/wiki/Util-linux
GNU General Public License v2.0
2.62k stars 1.19k forks source link

Last partition created by fdisk overlaps with GPT secondary header? #2862

Open ngrigoriev opened 5 months ago

ngrigoriev commented 5 months ago

Hi,

Sorry for probably completely silly question, I am not familiar enough with GPT structure. I will share my observations.

I am using new feature of cryptsetup to configure hardware encryption (OPAL) on an NVME disk. I have noticed that when the range is locked, the kernel complains about errors reading some sectors. This happens in minimal install environment and even at boot time. My conclusion so far that this happens when the kernel reads the partition table (GPT). The disk is partitioned with fdisk, where for the end of the last partition I used the default value - to use the rest of the disk. Here is the current partition table:

Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 488378646 sectors
Disk model: CT2000T500SSD8
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 1E1A1574-2754-4C9D-A7D9-25D2F9C8DD2C

Device          Start       End   Sectors  Size Type
/dev/nvme0n1p1    256     65791     65536  256M Linux filesystem
/dev/nvme0n1p2  65792    590079    524288    2G Linux filesystem
/dev/nvme0n1p3 590080 488378623 487788544  1.8T Linux filesystem

With OPAL, I have the area of 487784448 sectors starting from the sector 594176 (partition start + 16Mb of LUKS2 header) locked by the drive.

Locking Range Configuration for /dev/nvme0n1
LR0 Begin 0 for 0
            RLKEna = N  WLKEna = N  RLocked = N  WLocked = N
LR1 Begin 0 for 0
            RLKEna = N  WLKEna = N  RLocked = N  WLocked = N
LR2 Begin 0 for 0
            RLKEna = N  WLKEna = N  RLocked = N  WLocked = N
LR3 Begin 594176 for 487784448
...

This makes the range of 594176 to 488378624 locked by OPAL.

Then when the system boots and the sector range is not yet unlocked, I observe the following kernel errors:

[ 1069.190489]  nvme0n1: p1 p2 p3
[ 1069.194658] nvme0n1: Read(0x2) @ LBA 488378613, 1 blocks, Access Denied (sct 0x2 / sc 0x86) DNR
[ 1069.196989] critical medium error, dev nvme0n1, sector 3907028904 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[ 1069.199271] nvme0n1: Read(0x2) @ LBA 488378613, 1 blocks, Access Denied (sct 0x2 / sc 0x86) DNR
[ 1069.201175] critical medium error, dev nvme0n1, sector 3907028904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 1069.202796] Buffer I/O error on dev nvme0n1, logical block 488378613, async page read
[ 1069.207209] nvme0n1: Read(0x2) @ LBA 488378614, 1 blocks, Access Denied (sct 0x2 / sc 0x86) DNR
[ 1069.209798] critical medium error, dev nvme0n1, sector 3907028912 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
[ 1069.211272] nvme0n1: Read(0x2) @ LBA 488378614, 1 blocks, Access Denied (sct 0x2 / sc 0x86) DNR
[ 1069.212625] critical medium error, dev nvme0n1, sector 3907028912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[ 1069.215141] Buffer I/O error on dev nvme0n1, logical block 488378614, async page read

As you can see, the kernel is reading the sectors (488378613, 488378614) that fall into the locked range. But they also fall into the partition's own range (590080 to 488378623). 488,378,613 < 488,378,623.

So, these are symptoms. Now I look at the bigger picture.

When I look at https://en.wikipedia.org/wiki/GUID_Partition_Table, I see that GPT uses a "secondary header" located starting from LBA-33 and down. Well, according to my parition table, 488378646-33=488378613). This seems to match exactly the place where the kernel hits the problem.

So, my question. Is it correct for fdisk to allow creation of the partition that appears to overlap with that bottom part of the LBA range?

Thanks!

oldium commented 5 months ago

Try to use gdisk. It have not helped me, though, I created one partition (gdisk synchronizes second copy of GPT automatically), I protected it with OPAL (cryptsetup --hw-opal-only luksFormat /dev/sdd1), but reading errors near the end of the partition are still there...

Update: My reading errors are caused by Linux RAID detection from blkid tool called from udev rules, fixed here https://github.com/util-linux/util-linux/pull/2882

karelzak commented 5 months ago

Please try fdisk /dev/nvme0n1 --list-details. It provides more details about the location of the headers and partitioned areas.

ngrigoriev commented 5 months ago

Sorry for the delay. I was trying to troubleshoot this a bit more myself.

Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 488378646 sectors
Disk model: CT2000T500SSD8                          
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 738C1233-675B-47BC-95B4-4B1D645398DE
First usable LBA: 256
Last usable LBA: 488378640
Alternative LBA: 488378645
Partition entries starting LBA: 2
Allocated partition entries: 128
Partition entries ending LBA: 5

Device          Start       End   Sectors Type-UUID                            UUID                                 Name Attrs
/dev/nvme0n1p1    256     65791     65536 0FC63DAF-8483-4772-8E79-3D69D8477DE4 795FDE31-F112-4608-8526-048D4FAB7E8F      
/dev/nvme0n1p2  65792    590079    524288 0FC63DAF-8483-4772-8E79-3D69D8477DE4 219850EF-5989-46DE-A2A7-320B8CE019D2      
/dev/nvme0n1p3 590080 488378623 487788544 0FC63DAF-8483-4772-8E79-3D69D8477DE4 7BC6C861-C285-45E2-801F-51E4069378C3      

So, basically, my question: is it a problem that the partition ends 23 sectors (4K sectors) before the end of the disk?

Now, about the hardware encryption and the troubles that triggered me opening this. I probably was wrong about the reason for the errors. I found that the partition "probing" process in the kernel and by "partprobe" tool appears to be different. But they all seem to have one thing in common: they read not only the partition table, but also the beginnings (at lest) of each partition. At least this is what I observed with strace over partprobe. I found that partprobe was trying, for some unknown reason, to read the LAST sector of the partition protected by OPAL. Of course, this results in I/O errors.

So far I have concluded that, at some point, the alignment to 1Mb happens. Only if I create the encrypted partition where the end of it is aligned to 1Mb mark, I do not get these errors. Anyway, this seems like a separate issue to be reported against cryptsetup. For this issue, I will stick to my original question about "LBA-33" as per Wikipedia.

Thanks!

mr-bronson commented 3 weeks ago

my original question about "LBA-33" as per Wikipedia.

Wikipedia is a great source for secondary information, but if you want to get deeper, you have to go to the specs, e.g. here. As you may have noticed in your original post:

Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

So there is no 512-byte emulation going on. But the entries for your partitions don't necessarily get any bigger. They can be any power of two >= 128 (e.g. 128, 256, 512, etc.)-bytes long, each. That's something that can be specified in the GPT headers. But they don't need to be any bigger than 128-bytes. As long as you have the minimum 16 KiB reserved for entries, you're fine. And since 32 512-byte sectors is the same as 4 4K sectors, you only need 5 sectors at the end of the device and 6 at the beginning to store the GPT data.

So, basically, my question: is it a problem that the partition ends 23 sectors (4K sectors) before the end of the disk?

22 sectors, actually. The 488378646 is the length in sectors, not the last LBA. The last LBA is obviously one less than that with zero-based indexing: Alternative LBA: 488378645

But, yes, it's fine. As long as your last usable LBA Last usable LBA: 488378640 is >= the last sector of your last (in order) partition (488378623) you're fine. You actually have 17 sectors of unallocated space after the last partition, but that's good because your last partition is aligned on 1 MiB boundaries and is an integer number of MiB in size.

Hope that clears things up.