openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.64k stars 1.75k forks source link

enclosure path not updated when it changes between reboots #11950

Closed ZNikke closed 3 years ago

ZNikke commented 3 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04.2 LTS
Linux Kernel 5.4.0-72-generic
Architecture x86_64
ZFS Version 0.8.3-1ubuntu12.8 and 2.1.0-rc4
SPL Version 0.8.3-1ubuntu12.8 and 2.1.0-rc4

Describe the problem you're observing

If enclosure path changes between reboots, the old enclosure path is still left in the configuration even though it should be updated on import, AIUI. Even after an explicit zpool export followed by zpool import of the pool the old/wrong enclosure path is still present. It should be noted that the config does update other properties, like path (there seems to be some regression with our vdev_id.conf, see #11951, so the disk numbers are different with 2.1.0-rc4).

# zdb -C blob3|tail -n 15
                children[11]:
                    type: 'disk'
                    id: 11
                    guid: 15843257147329148638
                    path: '/dev/disk/by-vdev/enc10d11-part1'
                    devid: 'scsi-35000cca23b221f98-part1'
                    phys_path: 'pci-0000:20:00.0-sas-exp0x5001438018375120-phy11-lun-0'
                    vdev_enc_sysfs_path: '/sys/class/enclosure/2:0:7:0/11'
                    whole_disk: 1
                    DTL: 512
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 141
# ls -la /dev/disk/by-vdev/enc10d11
lrwxrwxrwx 1 root root 10 Apr 27 10:59 /dev/disk/by-vdev/enc10d11 -> ../../sdai
# ls -la /sys/block/sdai/device/enclosure_device*
lrwxrwxrwx 1 root root 0 Apr 27 11:01 /sys/block/sdai/device/enclosure_device:11 -> ../../../../port-2:0:12/end_device-2:0:12/target2:0:16/2:0:16:0/enclosure/2:0:16:0/11/
# ls /sys/class/enclosure/
1:0:12:0@  2:0:16:0@

As can be seen above, enclosure 2:0:7:0 isn't present anymore, but 2:0:16:0 now exists.

Describe how to reproduce the problem

Our enclosures are primarily HP D2600 and home-built ones based on HP DL180G6.

The enclosure device number seems to move depending on how many disks are inserted when plugged in. So for example if you have 3 disks in the enclosure, plug it in, and then fill it with disks, and then reboot the machine, you will find the enclosure on a different ID.

Include any warning/errors/backtraces from the system logs

behlendorf commented 3 years ago

That's right, the device paths and vdev_enc_sysfs_path should be updated as part of the import to reference the right locations (or remove the entries if the sysfs paths don't exist). Looking at the source code and your debugging it's not clear to me why this wouldn't be happening. If you don't mind doing a little debugging you could try adding a few printf's to the update_vdev_config_dev_strs() function to see what's happening. This function should be called for each disk vdev during import and is responsible for updating the vdev_enc_sysfs_path.

@tonyhutter's also familiar with this area of the code and might have some ideas.

ZNikke commented 3 years ago

Uh. Something's even funkier with 2.1.0rc4, a pool created with this version has no vdev_enc_sysfs_path at all:

# zdb -C zblob | tail -n 14
                children[11]:
                    type: 'disk'
                    id: 11
                    guid: 8192447068451071697
                    path: '/dev/disk/by-vdev/enc9d24-part1'
                    devid: 'scsi-35000c500ca14a2db-part1'
                    phys_path: 'pci-0000:1b:00.0-sas-exp0x50014380262ca7bf-phy11-lun-0'
                    whole_disk: 1
                    DTL: 643
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 141
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
# ls -l /dev/disk/by-vdev/enc9d24
lrwxrwxrwx 1 root root 9 Apr 27 15:43 /dev/disk/by-vdev/enc9d24 -> ../../sdo
# ls -l /sys/block/sdo/device/enclosure_device*
lrwxrwxrwx 1 root root 0 Apr 27 14:29 /sys/block/sdo/device/enclosure_device:11 -> ../../../../port-1:0:12/end_device-1:0:12/target1:0:12/1:0:12:0/enclosure/1:0:12:0/11/

I can have a stab at printf-debugging this when time allows. I'm not too familiar with tinkering in the openzfs code, but since I reckon this is userland as in the zpool binary I might not even have to install the resulting debug binary?

behlendorf commented 3 years ago

That's right. In fact, you can build and run everything directly from the openzfs source tree. There are some directions for how to do that here. I just talked with @tonyhutter and he's going to take a look at this too, hopefully he'll be able to reproduce it but you never know so any additional debugging would be helpful.

a pool created with this version has no vdev_enc_sysfs_path at all

That's a useful clue. The vdev_enc_sysfs_path entry will only be created when the code can resolve the vdev path in to a valid /sys/class/enclosure/ entry. Something may have changed under the sysfs paths which is causing the problem.

ZNikke commented 3 years ago

What confounds me is that looking at update_vdev_config_dev_strs() (https://github.com/openzfs/zfs/blob/master/lib/libzutil/os/linux/zutil_import_os.c#L856-L860) it either sets the enclosure path if found/valid, or deletes it. This would suggest to me that update_vdev_config_dev_strs() isn't even called on import of my old pool? But how did the drive path get updated then?

ZNikke commented 3 years ago

@behlendorf So things look OK in update_vdev_config_dev_strs() when doing zpool import pool on the pool created with 2.1.0rc4:

nikkedebug: uvcds called
nikkedebug: uvcds: encode_device_strings
nikkedebug: uvcds: devphys: pci-0000:1b:00.0-sas-exp0x50014380262ca7bf-phy11-lun-0
nikkedebug: uvcds: upath: /dev/sdo
nikkedebug: uvcds: spath: /sys/class/enclosure/1:0:12:0/11
nikkedebug: uvcds return: end

So update_vdev_config_dev_strs() is indeed called.

But no trace of an enclosure variable in the config:

                    type: 'disk'
                    id: 11
                    guid: 8192447068451071697
                    path: '/dev/disk/by-id/scsi-35000c500ca14a2db-part1'
                    devid: 'scsi-35000c500ca14a2db-part1'
                    phys_path: 'pci-0000:1b:00.0-sas-exp0x50014380262ca7bf-phy11-lun-0'
                    whole_disk: 1
                    DTL: 643
                    create_txg: 4
                    com.delphix:vdev_zap_leaf: 141

Any hint on where I should look next?

ZNikke commented 3 years ago

I managed to figure out that zutil_import.c was where the interesting stuff happened, and when I discovered dump_nvlist() this got much easier.

After a lot of dumping nvlists I have concluded that https://github.com/openzfs/zfs/blob/master/lib/libzutil/zutil_import.c#L804-L811 is where the enclosure paths disappear.

Given this printf-debugging of zutil_import.c:

        printf("nikkedebug: get_configs(): dump config 1:\n---\n");
        dump_nvlist(config, 0);
        printf("---\n");

                if ((nvl = zutil_refresh_config(hdl, config)) == NULL) {
                        printf("nikkedebug: get_configs: zutil_refresh_config(hdl, config) failed\n");  
                        nvlist_free(config);
                        config = NULL;
                        continue;
                }

                nvlist_free(config);
                config = nvl;

        printf("nikkedebug: get_configs(): dump config 2:\n---\n");
        dump_nvlist(config, 0);
        printf("---\n");

The first dump has the enclosure paths, but in the second dump they are gone.

Then I ran into a wall trying to figure out what the refresh stuff is really trying to do, but it should have narrowed down where to investigate further @behlendorf @tonyhutter

tonyhutter commented 3 years ago

@ZNikke thanks for getting it this far. The remaining callpath looks like this:

zutil_refresh_config()
    hdl->lpc_ops->pco_refresh_config() [refresh_config()]
            zfs_ioctl(hdl, ZFS_IOC_POOL_TRYIMPORT ...)
                zfs_ioc_pool_tryimport()
                    spa_tryimport()
                        spa_config_generate()
                            vdev_config_generate()
                                    if (vd->vdev_enc_sysfs_path != NULL)
                                        fnvlist_add_string(nv, ZPOOL_CONFIG_VDEV_ENC_SYSFS_PATH,
                                            vd->vdev_enc_sysfs_path);

You can troubleshoot the module code by inserting using zfs_dbgmsg() or vdev_dbgmsg_*() prints. Those get logged to /proc/spl/kstat/zfs/dbgmsg, but you have to enable the log first:

$ echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable 
$
$ cat /proc/spl/kstat/zfs/dbgmsg
8 0 0x01 -1 0 67233570093 3901526317654529
timestamp    message 
1620342361   txg.c:592:txg_quiesce_thread(): txg=139735 quiesce_txg=139736 sync_txg=139720
1620342361   txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 139735
1620342361   txg.c:542:txg_sync_thread(): txg=139735 quiesce_txg=139736 sync_txg=139720
1620342361   txg.c:509:txg_sync_thread(): waiting; tx_synced=139735 waiting=139720 dp=ffff8a86892c5000

Unfortunately, I don't have a good reproducer on my end, so it may be up to you to further bisect it to where it's failing.

ZNikke commented 3 years ago

OK, I'll give it a try when time allows.

ZNikke commented 3 years ago

I feel that I'm not really grasping where the vdev information passed to spa_tryimport is supposed to be carried over to the final vdev info used, from what I see it's kinda behaving like it's mostly ignoring the passed config and digging up old info from what's stored on disk and using that.

I've improved vdev_dbgmsg_print_tree() to also print the enclosure:

diff --git a/module/zfs/vdev.c b/module/zfs/vdev.c
index d00782d93..0f980fc8c 100644
--- a/module/zfs/vdev.c
+++ b/module/zfs/vdev.c
@@ -199,11 +199,13 @@ vdev_dbgmsg_print_tree(vdev_t *vd, int indent)
                    (uint_t)vd->vdev_state);
        }

-       zfs_dbgmsg("%*svdev %u: %s%s, guid: %llu, path: %s, %s", indent,
+       zfs_dbgmsg("%*svdev %u: %s%s, guid: %llu, path: %s, enc: %s, %s", indent,
            "", (int)vd->vdev_id, vd->vdev_ops->vdev_op_type,
            vd->vdev_islog ? " (log)" : "",
            (u_longlong_t)vd->vdev_guid,
-           vd->vdev_path ? vd->vdev_path : "N/A", state);
+           vd->vdev_path ? vd->vdev_path : "N/A",
+           vd->vdev_enc_sysfs_path ? vd->vdev_enc_sysfs_path : "N/A",
+           state);

        for (uint64_t i = 0; i < vd->vdev_children; i++)
                vdev_dbgmsg_print_tree(vd->vdev_child[i], indent + 2);

And setting the zfs module option spa_load_print_vdev_tree=1 to enable printing the tree also on successful operations.

The disk enc10d1 is part of a pool created with zfs-0.8, it has enclosure data on disk, although it's now incorrect (the initial reason for this bug report).

The disk enc9d1 is part of a pool created with zfs-2.1.0rc4, it has no enclosure data.

Running zpool import and then egrep 'note|enc9d1-part1|enc10d1-part1' /proc/spl/kstat/zfs/dbgmsg gives this partial output:

1620486689   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): provided vdev tree:
1620486689   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:16:0/0, healthy
1620486689   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): MOS vdev tree:
1620486689   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:7:0/0, closed
1620486689   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): final vdev tree:
1620486689   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:7:0/0, healthy
1620486690   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): provided vdev tree:
1620486690   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: /sys/class/enclosure/1:0:12:0/0, healthy
1620486690   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): MOS vdev tree:
1620486690   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: N/A, closed
1620486690   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): final vdev tree:
1620486690   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: N/A, healthy

Note that in both these cases, the provided vdev tree has the correct enclosure info, but both the MOS and final vdev tree has the obsolete info that seems to originate from the on-disk data?

My best guess is that this has been broken for a while, but vdev paths has special handling in the kernel making things work anyway when vdevs move and thus hiding the fact that the supplied config from userspace is mostly ignored? Or am I completely misunderstanding what's going on?

To get any further with my side of the debugging, a hint on where to look (or a print-relevant-debug-info-patch) is needed, as currently it feels like a vital piece of code has gone missing somewhere...

nscfreny commented 3 years ago

I have experienced the same problems on HPE Apollo 4510 (smartpqi controllers) where enclosure scsi id is displaced by adding/removing drives and rebooting. On other systems with LSI HBAs and IBM DCS3700 enclosures (NetApp E5400), scsi id is usually lower than drives thus not affected by adding or removing drives.

Implicit export/import solves the problem Apollo 4510:

Creating a new pool on Apollo 4510 does not seem to set vdev_enc_sysfs_path on zfs-2.1.0-rc4, which does work on zfs-2.0.4 and zfs-2.1.99-159_g2babd20.

Edit: Creating a new pool with LSI HBAs and IBM DCS3700 enclosures did set vdev_enc_sysfs_path correctly with zfs-2.1.0-rc4.

(CentOS 7.9 with 3.10.0-1160.2.1.el7_lustre.x86_64/zfs-0.7.13, 3.10.0-1160.25.1.el7.x86_64 for zfs 0.8+)

ZNikke commented 3 years ago

@tonyhutter I think that debugging this doesn't really need an enclosure, just the possibility to override the enclosure path detection in zpool in order to set an enclosure path.

Given this patch:

diff --git a/lib/libzutil/os/linux/zutil_import_os.c b/lib/libzutil/os/linux/zutil_import_os.c
index 61c42cf2e..2a2e8ed71 100644
--- a/lib/libzutil/os/linux/zutil_import_os.c
+++ b/lib/libzutil/os/linux/zutil_import_os.c
@@ -850,9 +850,17 @@ update_vdev_config_dev_strs(nvlist_t *nv)
                            vds.vds_devphys);
                }

-               /* Add enclosure sysfs path (if disk is in an enclosure). */
-               upath = zfs_get_underlying_path(path);
-               spath = zfs_get_enclosure_sysfs_path(upath);
+               /* Allow overriding enclosure path for debug or test */
+               spath = getenv("ZPOOL_CONFIG_VDEV_ENCPATH_OVERRIDE");
+               if(spath) {
+                       upath = NULL;
+                       spath = strdup(spath);
+               }
+               else {
+                       /* Add enclosure sysfs path (if disk is in an enclosure). */
+                       upath = zfs_get_underlying_path(path);
+                       spath = zfs_get_enclosure_sysfs_path(upath);
+               }
                if (spath)
                        nvlist_add_string(nv, ZPOOL_CONFIG_VDEV_ENC_SYSFS_PATH,
                            spath);

Using the same debug printing as in https://github.com/openzfs/zfs/issues/11950#issuecomment-835411608 but instead running: env ZPOOL_CONFIG_VDEV_ENCPATH_OVERRIDE=/bogus/test/encpath ./zpool import shows the override path being propagated to kernel space but ignored the same way as the legitimate path:

1621085832   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): provided vdev tree:
1621085832   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /bogus/test/encpath, healthy
1621085832   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): MOS vdev tree:
1621085832   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:7:0/0, closed
1621085832   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): final vdev tree:
1621085832   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:7:0/0, healthy
1621085832   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): provided vdev tree:
1621085832   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: /bogus/test/encpath, healthy
1621085832   spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): MOS vdev tree:
1621085832   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: N/A, closed
1621085832   spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): final vdev tree:
1621085832   vdev.c:202:vdev_dbgmsg_print_tree():       vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: N/A, healthy

Something along this patch could be used to both debug the issue, and add a test that verifies that enclosure path propagation/refresh works after this issue is resolved.

tonyhutter commented 3 years ago

@ZNikke thanks that's really helpful. I'll give your ZPOOL_CONFIG_VDEV_ENCPATH_OVERRIDE code a test on my enclosures and will investigate further.

tonyhutter commented 3 years ago

@ZNikke I'm seeing what you're seeing with your ZPOOL_CONFIG_VDEV_ENCPATH_OVERRIDE code, where it's not including the ZPOOL_CONFIG_VDEV_ENC_SYSFS_PATH passed in the nvlist from userspace. I'll keep digging...

tonyhutter commented 3 years ago

@ZNikke can you try https://github.com/openzfs/zfs/pull/12095 and let me know if it fixes the issue for you?

Looks like on import we load the config from the MOS (from disk), and then later fix it up by resetting the vdev path to what it really is. We need to do that for the enc_sysfs_path as well, which is what the PR does.