Closed ZNikke closed 3 years ago
That's right, the device paths and vdev_enc_sysfs_path
should be updated as part of the import to reference the right locations (or remove the entries if the sysfs paths don't exist). Looking at the source code and your debugging it's not clear to me why this wouldn't be happening. If you don't mind doing a little debugging you could try adding a few printf's to the update_vdev_config_dev_strs()
function to see what's happening. This function should be called for each disk vdev during import and is responsible for updating the vdev_enc_sysfs_path
.
@tonyhutter's also familiar with this area of the code and might have some ideas.
Uh. Something's even funkier with 2.1.0rc4, a pool created with this version has no vdev_enc_sysfs_path
at all:
# zdb -C zblob | tail -n 14
children[11]:
type: 'disk'
id: 11
guid: 8192447068451071697
path: '/dev/disk/by-vdev/enc9d24-part1'
devid: 'scsi-35000c500ca14a2db-part1'
phys_path: 'pci-0000:1b:00.0-sas-exp0x50014380262ca7bf-phy11-lun-0'
whole_disk: 1
DTL: 643
create_txg: 4
com.delphix:vdev_zap_leaf: 141
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
# ls -l /dev/disk/by-vdev/enc9d24
lrwxrwxrwx 1 root root 9 Apr 27 15:43 /dev/disk/by-vdev/enc9d24 -> ../../sdo
# ls -l /sys/block/sdo/device/enclosure_device*
lrwxrwxrwx 1 root root 0 Apr 27 14:29 /sys/block/sdo/device/enclosure_device:11 -> ../../../../port-1:0:12/end_device-1:0:12/target1:0:12/1:0:12:0/enclosure/1:0:12:0/11/
I can have a stab at printf-debugging this when time allows. I'm not too familiar with tinkering in the openzfs code, but since I reckon this is userland as in the zpool binary I might not even have to install the resulting debug binary?
That's right. In fact, you can build and run everything directly from the openzfs source tree. There are some directions for how to do that here. I just talked with @tonyhutter and he's going to take a look at this too, hopefully he'll be able to reproduce it but you never know so any additional debugging would be helpful.
a pool created with this version has no vdev_enc_sysfs_path at all
That's a useful clue. The vdev_enc_sysfs_path
entry will only be created when the code can resolve the vdev path in to a valid /sys/class/enclosure/
entry. Something may have changed under the sysfs paths which is causing the problem.
What confounds me is that looking at update_vdev_config_dev_strs()
(https://github.com/openzfs/zfs/blob/master/lib/libzutil/os/linux/zutil_import_os.c#L856-L860) it either sets the enclosure path if found/valid, or deletes it. This would suggest to me that update_vdev_config_dev_strs()
isn't even called on import of my old pool? But how did the drive path get updated then?
@behlendorf So things look OK in update_vdev_config_dev_strs()
when doing zpool import pool
on the pool created with 2.1.0rc4:
nikkedebug: uvcds called
nikkedebug: uvcds: encode_device_strings
nikkedebug: uvcds: devphys: pci-0000:1b:00.0-sas-exp0x50014380262ca7bf-phy11-lun-0
nikkedebug: uvcds: upath: /dev/sdo
nikkedebug: uvcds: spath: /sys/class/enclosure/1:0:12:0/11
nikkedebug: uvcds return: end
So update_vdev_config_dev_strs()
is indeed called.
But no trace of an enclosure variable in the config:
type: 'disk'
id: 11
guid: 8192447068451071697
path: '/dev/disk/by-id/scsi-35000c500ca14a2db-part1'
devid: 'scsi-35000c500ca14a2db-part1'
phys_path: 'pci-0000:1b:00.0-sas-exp0x50014380262ca7bf-phy11-lun-0'
whole_disk: 1
DTL: 643
create_txg: 4
com.delphix:vdev_zap_leaf: 141
Any hint on where I should look next?
I managed to figure out that zutil_import.c
was where the interesting stuff happened, and when I discovered dump_nvlist()
this got much easier.
After a lot of dumping nvlists I have concluded that https://github.com/openzfs/zfs/blob/master/lib/libzutil/zutil_import.c#L804-L811 is where the enclosure paths disappear.
Given this printf-debugging of zutil_import.c
:
printf("nikkedebug: get_configs(): dump config 1:\n---\n");
dump_nvlist(config, 0);
printf("---\n");
if ((nvl = zutil_refresh_config(hdl, config)) == NULL) {
printf("nikkedebug: get_configs: zutil_refresh_config(hdl, config) failed\n");
nvlist_free(config);
config = NULL;
continue;
}
nvlist_free(config);
config = nvl;
printf("nikkedebug: get_configs(): dump config 2:\n---\n");
dump_nvlist(config, 0);
printf("---\n");
The first dump has the enclosure paths, but in the second dump they are gone.
Then I ran into a wall trying to figure out what the refresh stuff is really trying to do, but it should have narrowed down where to investigate further @behlendorf @tonyhutter
@ZNikke thanks for getting it this far. The remaining callpath looks like this:
zutil_refresh_config()
hdl->lpc_ops->pco_refresh_config() [refresh_config()]
zfs_ioctl(hdl, ZFS_IOC_POOL_TRYIMPORT ...)
zfs_ioc_pool_tryimport()
spa_tryimport()
spa_config_generate()
vdev_config_generate()
if (vd->vdev_enc_sysfs_path != NULL)
fnvlist_add_string(nv, ZPOOL_CONFIG_VDEV_ENC_SYSFS_PATH,
vd->vdev_enc_sysfs_path);
You can troubleshoot the module code by inserting using zfs_dbgmsg()
or vdev_dbgmsg_*()
prints. Those get logged to /proc/spl/kstat/zfs/dbgmsg
, but you have to enable the log first:
$ echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
$
$ cat /proc/spl/kstat/zfs/dbgmsg
8 0 0x01 -1 0 67233570093 3901526317654529
timestamp message
1620342361 txg.c:592:txg_quiesce_thread(): txg=139735 quiesce_txg=139736 sync_txg=139720
1620342361 txg.c:600:txg_quiesce_thread(): quiesce done, handing off txg 139735
1620342361 txg.c:542:txg_sync_thread(): txg=139735 quiesce_txg=139736 sync_txg=139720
1620342361 txg.c:509:txg_sync_thread(): waiting; tx_synced=139735 waiting=139720 dp=ffff8a86892c5000
Unfortunately, I don't have a good reproducer on my end, so it may be up to you to further bisect it to where it's failing.
OK, I'll give it a try when time allows.
I feel that I'm not really grasping where the vdev information passed to spa_tryimport is supposed to be carried over to the final vdev info used, from what I see it's kinda behaving like it's mostly ignoring the passed config and digging up old info from what's stored on disk and using that.
I've improved vdev_dbgmsg_print_tree()
to also print the enclosure:
diff --git a/module/zfs/vdev.c b/module/zfs/vdev.c
index d00782d93..0f980fc8c 100644
--- a/module/zfs/vdev.c
+++ b/module/zfs/vdev.c
@@ -199,11 +199,13 @@ vdev_dbgmsg_print_tree(vdev_t *vd, int indent)
(uint_t)vd->vdev_state);
}
- zfs_dbgmsg("%*svdev %u: %s%s, guid: %llu, path: %s, %s", indent,
+ zfs_dbgmsg("%*svdev %u: %s%s, guid: %llu, path: %s, enc: %s, %s", indent,
"", (int)vd->vdev_id, vd->vdev_ops->vdev_op_type,
vd->vdev_islog ? " (log)" : "",
(u_longlong_t)vd->vdev_guid,
- vd->vdev_path ? vd->vdev_path : "N/A", state);
+ vd->vdev_path ? vd->vdev_path : "N/A",
+ vd->vdev_enc_sysfs_path ? vd->vdev_enc_sysfs_path : "N/A",
+ state);
for (uint64_t i = 0; i < vd->vdev_children; i++)
vdev_dbgmsg_print_tree(vd->vdev_child[i], indent + 2);
And setting the zfs module option spa_load_print_vdev_tree=1
to enable printing the tree also on successful operations.
The disk enc10d1
is part of a pool created with zfs-0.8, it has enclosure data on disk, although it's now incorrect (the initial reason for this bug report).
The disk enc9d1
is part of a pool created with zfs-2.1.0rc4, it has no enclosure data.
Running zpool import
and then egrep 'note|enc9d1-part1|enc10d1-part1' /proc/spl/kstat/zfs/dbgmsg
gives this partial output:
1620486689 spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): provided vdev tree:
1620486689 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:16:0/0, healthy
1620486689 spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): MOS vdev tree:
1620486689 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:7:0/0, closed
1620486689 spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): final vdev tree:
1620486689 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:7:0/0, healthy
1620486690 spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): provided vdev tree:
1620486690 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: /sys/class/enclosure/1:0:12:0/0, healthy
1620486690 spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): MOS vdev tree:
1620486690 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: N/A, closed
1620486690 spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): final vdev tree:
1620486690 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: N/A, healthy
Note that in both these cases, the provided vdev tree has the correct enclosure info, but both the MOS and final vdev tree has the obsolete info that seems to originate from the on-disk data?
My best guess is that this has been broken for a while, but vdev paths has special handling in the kernel making things work anyway when vdevs move and thus hiding the fact that the supplied config from userspace is mostly ignored? Or am I completely misunderstanding what's going on?
To get any further with my side of the debugging, a hint on where to look (or a print-relevant-debug-info-patch) is needed, as currently it feels like a vital piece of code has gone missing somewhere...
I have experienced the same problems on HPE Apollo 4510 (smartpqi controllers) where enclosure scsi id is displaced by adding/removing drives and rebooting. On other systems with LSI HBAs and IBM DCS3700 enclosures (NetApp E5400), scsi id is usually lower than drives thus not affected by adding or removing drives.
Implicit export/import solves the problem Apollo 4510:
Creating a new pool on Apollo 4510 does not seem to set vdev_enc_sysfs_path on zfs-2.1.0-rc4, which does work on zfs-2.0.4 and zfs-2.1.99-159_g2babd20.
Edit: Creating a new pool with LSI HBAs and IBM DCS3700 enclosures did set vdev_enc_sysfs_path correctly with zfs-2.1.0-rc4.
(CentOS 7.9 with 3.10.0-1160.2.1.el7_lustre.x86_64/zfs-0.7.13, 3.10.0-1160.25.1.el7.x86_64 for zfs 0.8+)
@tonyhutter I think that debugging this doesn't really need an enclosure, just the possibility to override the enclosure path detection in zpool in order to set an enclosure path.
Given this patch:
diff --git a/lib/libzutil/os/linux/zutil_import_os.c b/lib/libzutil/os/linux/zutil_import_os.c
index 61c42cf2e..2a2e8ed71 100644
--- a/lib/libzutil/os/linux/zutil_import_os.c
+++ b/lib/libzutil/os/linux/zutil_import_os.c
@@ -850,9 +850,17 @@ update_vdev_config_dev_strs(nvlist_t *nv)
vds.vds_devphys);
}
- /* Add enclosure sysfs path (if disk is in an enclosure). */
- upath = zfs_get_underlying_path(path);
- spath = zfs_get_enclosure_sysfs_path(upath);
+ /* Allow overriding enclosure path for debug or test */
+ spath = getenv("ZPOOL_CONFIG_VDEV_ENCPATH_OVERRIDE");
+ if(spath) {
+ upath = NULL;
+ spath = strdup(spath);
+ }
+ else {
+ /* Add enclosure sysfs path (if disk is in an enclosure). */
+ upath = zfs_get_underlying_path(path);
+ spath = zfs_get_enclosure_sysfs_path(upath);
+ }
if (spath)
nvlist_add_string(nv, ZPOOL_CONFIG_VDEV_ENC_SYSFS_PATH,
spath);
Using the same debug printing as in https://github.com/openzfs/zfs/issues/11950#issuecomment-835411608 but instead running:
env ZPOOL_CONFIG_VDEV_ENCPATH_OVERRIDE=/bogus/test/encpath ./zpool import
shows the override path being propagated to kernel space but ignored the same way as the legitimate path:
1621085832 spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): provided vdev tree:
1621085832 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /bogus/test/encpath, healthy
1621085832 spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): MOS vdev tree:
1621085832 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:7:0/0, closed
1621085832 spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): final vdev tree:
1621085832 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 822270773398956947, path: /dev/disk/by-vdev/enc10d1-part1, enc: /sys/class/enclosure/2:0:7:0/0, healthy
1621085832 spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): provided vdev tree:
1621085832 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: /bogus/test/encpath, healthy
1621085832 spa_misc.c:418:spa_load_note(): spa_load($import, config untrusted): MOS vdev tree:
1621085832 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: N/A, closed
1621085832 spa_misc.c:418:spa_load_note(): spa_load($import, config trusted): final vdev tree:
1621085832 vdev.c:202:vdev_dbgmsg_print_tree(): vdev 0: disk, guid: 8784233026325928283, path: /dev/disk/by-vdev/enc9d1-part1, enc: N/A, healthy
Something along this patch could be used to both debug the issue, and add a test that verifies that enclosure path propagation/refresh works after this issue is resolved.
@ZNikke thanks that's really helpful. I'll give your ZPOOL_CONFIG_VDEV_ENCPATH_OVERRIDE
code a test on my enclosures and will investigate further.
@ZNikke I'm seeing what you're seeing with your ZPOOL_CONFIG_VDEV_ENCPATH_OVERRIDE
code, where it's not including the ZPOOL_CONFIG_VDEV_ENC_SYSFS_PATH
passed in the nvlist from userspace. I'll keep digging...
@ZNikke can you try https://github.com/openzfs/zfs/pull/12095 and let me know if it fixes the issue for you?
Looks like on import we load the config from the MOS (from disk), and then later fix it up by resetting the vdev path to what it really is. We need to do that for the enc_sysfs_path as well, which is what the PR does.
System information
Describe the problem you're observing
If enclosure path changes between reboots, the old enclosure path is still left in the configuration even though it should be updated on import, AIUI. Even after an explicit
zpool export
followed byzpool import
of the pool the old/wrong enclosure path is still present. It should be noted that the config does update other properties, like path (there seems to be some regression with our vdev_id.conf, see #11951, so the disk numbers are different with 2.1.0-rc4).As can be seen above, enclosure 2:0:7:0 isn't present anymore, but 2:0:16:0 now exists.
Describe how to reproduce the problem
Our enclosures are primarily HP D2600 and home-built ones based on HP DL180G6.
The enclosure device number seems to move depending on how many disks are inserted when plugged in. So for example if you have 3 disks in the enclosure, plug it in, and then fill it with disks, and then reboot the machine, you will find the enclosure on a different ID.
Include any warning/errors/backtraces from the system logs