Pool broken after zpool online -e; Can't import device unavailable; disk invalid label

alpiua commented 2 years ago

System information

Type	Version/Name
Distribution Name	Proxmox VE
Distribution Version	7.2
Kernel Version	5.15.30-2-pve
Architecture	x64
OpenZFS Version	zfs-2.1.3-pve1

Describe the problem you're observing

After trying to enlarge a disk in the pool my pool got broken completely. I can't import a pool from any system I have:

zpool import 
    rpool              UNAVAIL  insufficient replicas
  mirror-0               UNAVAIL  insufficient replicas
    nvme0n1p3       UNAVAIL  invalid label
    nvme1n1p3       UNAVAIL  invalid label

No chance to import:

zpool import -f rpool rpool_bkp 
Cannot import: one or more devices is currently unavailable.

zpool import rpool -f -d /dev/disk/by-id/usb-Realtek_RTL9210_NVME_012345678904-0\:0-part3
cannot import 'rpool': one or more devices is currently unavailable

zpool import <pool_id> -f new_pool_name
cannot import 'rpool': one or more devices is currently unavailable

Describe how to reproduce the problem

I have a recent setup of Proxmox on an NVME disk. I replaced a disk with a larger one in the mirror pool; After that, I added a second disk to a pool, resilvered successfully, and run zpool online -e rpool nvme0n1p3 (By ignorance, I was not aware that I should expand a pool partition before trying to expand zfs). After running zpool online -e my system hangs, after reboot I was not able to boot. Suddenly, the entire pool got UNAVAIL, and boot partitions become broken.

Include any warning/errors/backtraces from the system logs

pool looks fine `zdb -l /dev/nvme0n1p3

version: 5000
name: 'rpool'
state: 0
txg: 6132885
pool_guid: 11333899401152683132
errata: 0
hostid: 3866964428
hostname: 'pve.lan'
top_guid: 15187296562327313504
guid: 9707396299858239856
vdev_children: 1
vdev_tree:
    type: 'mirror'
    id: 0
    guid: 15187296562327313504
    whole_disk: 0
    metaslab_array: 256
    metaslab_shift: 31
    ashift: 12
    asize: 249516523520
    is_log: 0
    create_txg: 4
    children[0]:
        type: 'disk'
        id: 0
        guid: 9707396299858239856
        path: '/dev/nvme0n1p3'
        whole_disk: 0
        DTL: 2689
        create_txg: 4
    children[1]:
        type: 'disk'
        id: 1
        guid: 11271255887110782831
        path: '/dev/nvme1n1p3'
        whole_disk: 0
        DTL: 104
        create_txg: 4
features_for_read:
    com.delphix:hole_birth
    com.delphix:embedded_data
labels = 0 1 2 3

parted free output:

Model: Realtek RTL9210 NVME (scsi)
Disk /dev/sde: 512GB Sector size (logical/physical): 512B/16384B Partition Table: gpt Disk Flags:
Number Start End Size File system Name Flags
1 17.4kB 1049kB 1031kB bios_grub
2 1049kB 538MB 537MB fat32 boot, esp
3 538MB 250GB 250GB
   250GB 512GB 262GB Free Space

fsck of the EFI partition shows:

Cluster 125940 out of range (79166174 > 130812). Setting to EOF.
Cluster 125941 out of range (193575330 > 130812). Setting to EOF.
Cluster 125942 out of range (52952923 > 130812). Setting to EOF.
Cluster 125943 out of range (20067544 > 130812). Setting to EOF.
Cluster 125944 out of range (146243563 > 130812). Setting to EOF.
Cluster 125945 out of range (108471559 > 130812). Setting to EOF.
Cluster 125946 out of range (18968573 > 130812). Setting to EOF.
Cluster 125947 out of range (251158597 > 130812). Setting to EOF.
Cluster 125948 out of range (203795967 > 130812). Setting to EOF.
Cluster 125949 out of range (107445111 > 130812). Setting to EOF.
/EFI/proxmox
Expected a valid '.' entry in the first slot, found free entry.
Drop parent
[12?q]? 1
/EFI/proxmox/Sø\
_▒¢.j┬
/EFI/proxmox
Has a large number of bad entries. (64/64)
/EFI/proxmox/Sø\
_▒¢.j┬
Bad short file name (Sø\
_▒¢.j┬).
[1234?q]?

Is it possible to restore a pool somehow? Or at least restore some VM data?

ryao commented 2 years ago

@alpiua Contact me on the OpenZFS slack server after 2pm EST tomorrow on the 31:

https://openzfs.slack.com/

I am willing to try to help (for free). Keep in mind that this would be the first time that I would be trying to recover a pool from this state, so I might need to try multiple approaches (like maybe recovering it from the disk that was replaced, although that is my backup plan). I also will be reading the ZFS source code to try to understand what happened to reverse it, so expect me to spend a fair amount of time doing that too.

alpiua commented 2 years ago

@ryao thank you! I'll arrange some access to the server by that time, as it is now offline. Unfortunately, I already used the old disk.

ryao commented 2 years ago

The vdev label checksums were corrupt. I was able to get zdb to start reading the pool with this patch:

diff --git a/module/zfs/vdev_label.c b/module/zfs/vdev_label.c
index 6e47c8cb6..56641243f 100644
--- a/module/zfs/vdev_label.c
+++ b/module/zfs/vdev_label.c
@@ -194,7 +194,7 @@ vdev_label_read(zio_t *zio, vdev_t *vd, int l, abd_t *buf, uint64_t offset,

        zio_nowait(zio_read_phys(zio, vd,
            vdev_label_offset(vd->vdev_psize, l, offset),
-           size, buf, ZIO_CHECKSUM_LABEL, done, private,
+           size, buf, ZIO_CHECKSUM_OFF, done, private,
            ZIO_PRIORITY_SYNC_READ, flags, B_TRUE));
 }

Then I noticed the change made to zhack in #12686 to allow repairing the disk labels. The rear labels were gone, but the front labels were present. After running it on his vdevs, we were able to import the pool readonly (I suspect a rw import would have worked, but I do readonly imports when attempting to recover data from damaged pools). I advised him to copy his data off the pool.

I was not able to find the root cause of the corruption during the time I was on the system, but at least I was able to restore access to his data.

ryao commented 2 years ago

@alpiua contacted me afterward to tell me that he was having trouble copying the data off, so I looked again. There is extensive corruption on his pool with several hundred thousand checksum errors (and equal numbers on both mirror members). This could not have been caused by the zpool online command. Some of the errors were detected by zdb, so I tried disabling checksum verification (such that all checksums would return true) in the ZFS source code, but this lead to other errors appearing in zdb.

His machine does not have ECC memory, so I suspect that memory corruption damaged his in-core kernel in a way that caused it to start writing out bad data (with bad checksums) and this had been in progress by the time he tried running the zpool online command, which caused the corruption to finally take down the system. It is not a perfect explanation, but I could not find an alternative explanation for how there could have been so much corruption.

alpiua commented 2 years ago

Well, so I finally get the whole picture of how exactly that happened. I tested RAM and found no issues, so decide to test disks for errors. Only then did I realize, that the day when I lost my pool I tried to benchmark my drives with fio. Obviously, I missed a part saying to replace the path to drive with a path to file in this article. So, the only reason my pool was broken - was a pure reading of the documentation due to the lack of sleep.

Very thanks @ryao for your immense support and help with restoring some data.

ryao commented 2 years ago

Well, so I finally get the whole picture of how exactly that happened. I tested RAM and found no issues, so decide to test disks for errors. Only then did I realize, that the day when I lost my pool I tried to benchmark my drives with fio. Obviously, I missed a part saying to replace the path to drive with a path to file in this article. So, the only reason my pool was broken - was a pure reading of the documentation due to the lack of sleep.

Very thanks @ryao for your immense support and help with restoring some data.

@alpiua Thanks for the update. That makes much more sense than my theory.

openzfs / zfs