openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.63k stars 1.75k forks source link

vdev_id regression: slot remapping doesn't work anymore #11951

Closed ZNikke closed 3 years ago

ZNikke commented 3 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04.2 LTS
Linux Kernel 5.4.0-72-generic
Architecture x86_64
ZFS Version 2.1.0-rc4
SPL Version 2.1.0-rc4

Describe the problem you're observing

It looks to me that slot remapping has stopped working sometime between zfs-0.8.3 (actually 0.8.3-1ubuntu12.8) and zfs-2.1.0-rc4.

Our disks ends up at different id:s depending on enclosure model, none of them matching the physical 1-12 numbering on the enclosure, so we want to map them all to match the 1-12 numbering.

Our previously working vdev_id.conf seems to still map the channels, but the disks ends up with whatever slot the enclosures use. As the documentation and examples for vdev_id.conf suggests this should still work, I think it's a regression.

Describe how to reproduce the problem

Given this vdev_id.conf:

# Settings, specify them even if they are the same as the defaults as the
# vdev_id mapping script has some interesting corner cases if they're missing.
multipath       no
topology        sas_direct
phys_per_port   4
slot            bay

# PCI-Express Slot 1
channel 1b:00.0 0 enc9d
# PCI-Express Slot 2
channel 20:00.0 0 enc10d

## ZFS log
# storcli64 /c0/v1 show all|grep Id
alias c0v1 wwn-0x600507604097171827cf79423d452f1f

## ZFS cache
# PCI-Express Slot 1
channel 1b:00.0 1 c3bay
# PCI-Express Slot 2
channel 20:00.0 1 c4bay

# Slot remapping
#    Linux      Mapped
#    Slot       Slot    Channel

# Drives in internal bays
slot 4           8      c3bay
slot 5           9      c3bay
slot 6          10      c3bay
slot 7          11      c3bay
slot 4          12      c4bay
slot 5          13      c4bay
slot 6          14      c4bay
slot 7          15      c4bay

# Map slot numbering in HP D2600 enclosure to match the 1-12 legend on box
slot 13          1      enc9d
slot 14          2      enc9d
slot 15          3      enc9d
slot 16          4      enc9d
slot 17          5      enc9d
slot 18          6      enc9d
slot 19          7      enc9d
slot 20          8      enc9d
slot 21          9      enc9d
slot 22         10      enc9d
slot 23         11      enc9d
slot 24         12      enc9d

# Default mapping for our DL180G6 "plåtsax" enclosures, they are off-by-one
# compared to the 1-12 numbering as printed on the label.
slot  0          1
slot  1          2
slot  2          3
slot  3          4
slot  4          5
slot  5          6
slot  6          7
slot  7          8
slot  8          9
slot  9         10
slot 10         11
slot 11         12

We get this mapping:

# ls --hide="*-part*" /dev/disk/by-vdev/
c0v1    c3bay7  c4bay7    enc10d11  enc10d5  enc10d9  enc9d16  enc9d20  enc9d24
c3bay4  c4bay4  enc10d0   enc10d2   enc10d6  enc9d13  enc9d17  enc9d21
c3bay5  c4bay5  enc10d1   enc10d3   enc10d7  enc9d14  enc9d18  enc9d22
c3bay6  c4bay6  enc10d10  enc10d4   enc10d8  enc9d15  enc9d19  enc9d23

Ie no slot remapping was done! The expected result of the slot mapping would be for c3bay c4bay to have the disks in the 8 .. 15 sequence, and that the enc9/enc10 enclosures to have the disks in the 1 .. 12 sequence.

Things get extra interesting if I remove the slot bay assignment, then I only get c0v1 mapped in by-vdev/. The vdev_id.confman page says that bay is the default, so it really shouldn't be needed to specify it. On this subject it's interesting to note that vdev_id.conf.sas_direct.example doesn't mention slot bay, but but the example in the vdev_id.confman page do. I personally think that the double-use of the slot keyword is a mistake, it's very confusing...

Include any warning/errors/backtraces from the system logs

gdevenyi commented 3 years ago

Substantial rewrite at https://github.com/openzfs/zfs/pull/11526

behlendorf commented 3 years ago

@arshad512 and @AeonJJohnson it sounds like a regression was accidentally introduced with the recent vdev_id rework. Would you mind taking a look at this.

AeonJJohnson commented 3 years ago

Can you clarify the following?

  1. Model of HBA (specify which HBA is in host that works and which HBA in host where it's not working)
  2. Model of enclosure (same as above, which model works and which model does not)
  3. Last known version of ZFS where enclosure/drive enumeration and representation was correct

The rebuilt vdev_id script does depend on how enclosure firmware and HBA drivers populate sysfs with values. I have seen enclosures that have physical labeling (decals, etc) on slots that are different from what the firmware developers designate the slots. Example being external labeling starting with one (Slot 1) but that same slot is designated as 0. Zero start in firmware and One start in external visual labeling.

If it was working and now is not, something has surely changed.

Can you post the output of the following command, changing the argument to reflect a drive that is not reporting correctly?

sh -x /usr/lib/udev/vdev_id -d sdX (where X equals the sd device instance not correctly reporting)

Be sure to include what you think that drive should be processed by vdev_id as.

gdevenyi commented 3 years ago

This package is ideal for answering these questions, https://pypi.org/project/sasutils/

ZNikke commented 3 years ago

Can you clarify the following?

1. Model of HBA  (specify which HBA is in host that works and which HBA in host where it's not working)

The mapping behaviour switches between work (zfs-0.8.3) and not-work (zfs-2.1.0rc4) on the same host, so I don't think it's at all hardware dependant. These particular HBA:s are LSI 9212-4i4e IIRC.

2. Model of enclosure (same as above, which model works and which model does not)

As above, behavior changes between zfs version and reboot, currently connected is a HP D2600 and a HP DL180G6 converted to be a disk enclosure.

3. Last known version of ZFS where enclosure/drive enumeration and representation was correct

I have only checked the Ubuntu 20.04 build of zfs 0.8.3 (which works) and 2.1.0rc4 that doesn't.

The rebuilt vdev_id script does depend on how enclosure firmware and HBA drivers populate sysfs with values. I have seen enclosures that have physical labeling (decals, etc) on slots that are different from what the firmware developers designate the slots. Example being external labeling starting with one (Slot 1) but that same slot is designated as 0. Zero start in firmware and One start in external visual labeling.

Yup, enclosures map ID:s all over the place and it gets worse when you have internal HBA:s with individual connectors that are not in the same order as the ports... That's why we need the slot mapping to be able to bring order to the madness.

If it was working and now is not, something has surely changed.

Can you post the output of the following command, changing the argument to reflect a drive that is not reporting correctly?

sh -x /usr/lib/udev/vdev_id -d sdX (where X equals the sd device instance not correctly reporting)

# ls -l /dev/disk/by-vdev/c3bay4
lrwxrwxrwx 1 root root 9 Apr 27 13:46 /dev/disk/by-vdev/c3bay4 -> ../../sds
# sh -x /usr/lib/udev/vdev_id -d sds
+ PATH=/bin:/sbin:/usr/bin:/usr/sbin
+ CONFIG=/etc/zfs/vdev_id.conf
+ PHYS_PER_PORT=
+ DEV=
+ TOPOLOGY=
+ BAY=
+ ENCL_ID=
+ UNIQ_ENCL_ID=
+ getopts c:d:eg:jmp:h OPTION
+ DEV=sds
+ getopts c:d:eg:jmp:h OPTION
+ [ ! -r /etc/zfs/vdev_id.conf ]
+ [ -z sds ]
+ [ -z  ]
+ awk ($1 == "topology") {print $2; exit} /etc/zfs/vdev_id.conf
+ TOPOLOGY=sas_direct
+ [ -z  ]
+ awk ($1 == "slot") {print $2; exit} /etc/zfs/vdev_id.conf
+ BAY=bay
+ TOPOLOGY=sas_direct
+ [  = yes ]
+ alias_handler
+ DM_PART=
+ echo 
+ grep -q -E p[0-9][0-9]*$
+ ID_VDEV=
+ [ -z  ]
+ BAY=bay
+ sas_handler
+ [ -z  ]
+ awk $1 == "phys_per_port" \
                        {print $2; exit} /etc/zfs/vdev_id.conf
+ PHYS_PER_PORT=4
+ PHYS_PER_PORT=4
+ echo 4
+ grep -q -E ^[0-9]+$
+ [ -z  ]
+ awk $1 == "multipath" \
                        {print $2; exit} /etc/zfs/vdev_id.conf
+ MULTIPATH_MODE=no
+ [ -z  ]
+ awk $1 == "multijbod" \
                        {print $2; exit} /etc/zfs/vdev_id.conf
+ MULTIJBOD_MODE=
+ [ no = yes ]
+ echo sds
+ grep -q ^/devices/
+ udevadm info -q path -p /sys/block/sds
+ sys_path=/devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1/port-1:4/end_device-1:4/target1:0:16/1:0:16:0/block/sds
+ echo /devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1/port-1:4/end_device-1:4/target1:0:16/1:0:16:0/block/sds
+ tr /  
+ set -- devices pci0000:00 0000:00:03.0 0000:1b:00.0 host1 port-1:4 end_device-1:4 target1:0:16 1:0:16:0 block sds
+ num_dirs=11
+ scsi_host_dir=/sys
+ i=1
+ [ 1 -le 11 ]
+ eval echo ${1}
+ echo devices
+ d=devices
+ scsi_host_dir=/sys/devices
+ echo devices
+ grep -q -E ^host[0-9]+$
+ i=2
+ [ 2 -le 11 ]
+ eval echo ${2}
+ echo pci0000:00
+ d=pci0000:00
+ scsi_host_dir=/sys/devices/pci0000:00
+ echo pci0000:00
+ grep -q -E ^host[0-9]+$
+ i=3
+ [ 3 -le 11 ]
+ eval echo ${3}
+ echo 0000:00:03.0
+ d=0000:00:03.0
+ scsi_host_dir=/sys/devices/pci0000:00/0000:00:03.0
+ echo 0000:00:03.0
+ grep -q -E ^host[0-9]+$
+ i=4
+ [ 4 -le 11 ]
+ eval echo ${4}
+ echo 0000:1b:00.0
+ d=0000:1b:00.0
+ scsi_host_dir=/sys/devices/pci0000:00/0000:00:03.0/0000:1b:00.0
+ echo 0000:1b:00.0
+ grep -q -E ^host[0-9]+$
+ i=5
+ [ 5 -le 11 ]
+ eval echo ${5}
+ echo host1
+ d=host1
+ scsi_host_dir=/sys/devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1
+ echo host1
+ grep -q -E ^host[0-9]+$
+ break
+ echo host1
+ awk -F/ { gsub("host","",$NF); print $NF}
+ HOSTCHAN=1
+ [ 5 = 11 ]
+ eval echo ${4}
+ echo+  0000:1b:00.0
awk -F: {print $2":"$3}
+ PCI_ID=1b:00.0
+ port_dir=/sys/devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1
+ j=6
+ i=6
+ [ 6 -le 6 ]
+ eval echo ${6}
+ echo port-1:4
+ port_dir=/sys/devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1/port-1:4
+ i=7
+ [ 7 -le 6 ]
+ + ls -d /sys/devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1/port-1:4/phy-1:4
head -1
+ awk -F: {print $NF}
+ PHY=4
+ [ -z 4 ]
+ PORT=1
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1/port-1:4
+ [ 7 -lt 11 ]
+ eval echo ${7}
+ echo end_device-1:4
+ d=end_device-1:4
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1/port-1:4/end_device-1:4
+ echo end_device-1:4
+ grep -q ^end_device
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1/port-1:4/end_device-1:4/sas_device/end_device-1:4
+ break
+ SLOT=
+ cat /sys/devices/pci0000:00/0000:00:03.0/0000:1b:00.0/host1/port-1:4/end_device-1:4/sas_device/end_device-1:4/bay_identifier
+ SLOT=4
+ [ -z 4 ]
+ [  = yes ]
+ map_channel 1b:00.0 1
+ MAPPED_CHAN=
+ PCI_ID=1b:00.0
+ PORT=1
+ awk -v pciID=1b:00.0 -v port=1 $1 == "channel" && $2 == pciID && $3 == port \
                        {print $4} /etc/zfs/vdev_id.conf
+ MAPPED_CHAN=c3bay
+ printf %s c3bay
+ CHAN=c3bay
+ map_slot 4 c3bay
+ LINUX_SLOT=4
+ CHANNEL=c3bay
+ awk $1 == "slot" && $2 == "${LINUX_SLOT}" && \
                        $4 ~ /^${CHANNEL}$|^$/ { print $3; exit} /etc/zfs/vdev_id.conf
+ MAPPED_SLOT=
+ [ -z  ]
+ MAPPED_SLOT=4
+ printf %d 4
+ SLOT=4
+ [ -z c3bay ]
+ echo c3bay4
+ ID_VDEV=c3bay4
+ [ -n c3bay4 ]
+ echo ID_VDEV=c3bay4
ID_VDEV=c3bay4
+ echo ID_VDEV_PATH=disk/by-vdev/c3bay4
ID_VDEV_PATH=disk/by-vdev/c3bay4

Be sure to include what you think that drive should be processed by vdev_id as.

The vdev_id.conf line:

slot 4           8      c3bay

should trigger and map this drive as c3bay8 - right now this is obviously not triggered as MAPPED_SLOT is empty above.

From a quick look it can be too "hard" quoting, so the LINUX_SLOT etc variables aren't expanded as they should in the awk command line?

arshad512 commented 3 years ago

@ZNikke , Thanks for the debug output.

should trigger and map this drive as c3bay8 - right now this is obviously not triggered as MAPPED_SLOT is empty above.

Right

From a quick look it can be too "hard" quoting, so the LINUX_SLOT etc variables aren't expanded as they should in the awk command line?

I will update you on this. (However, this should not be the case)

arshad512 commented 3 years ago

@ZNikke

I will update you on this. (However, this should not be the case)

You are correct. When the code was moved form ``(back-ticks command subsitution) to $() version the expansion was not handled. I will also skim thourgh rest of code if any other place this change is required. I will update the patch with the changes.

arshad512 commented 3 years ago

@ZNikke, Could you please try patch https://github.com/openzfs/zfs/pull/11959 and let us know.

ZNikke commented 3 years ago

Progress, now it processes the per-channel mappings but still not the default/fallback mapping, ie enc10 in our config:

# ls --hide="*-part*" /dev/disk/by-vdev/
c0v1@     c4bay12@  enc10d1@   enc10d4@  enc10d9@  enc9d2@  enc9d7@
c3bay10@  c4bay13@  enc10d10@  enc10d5@  enc9d1@   enc9d3@  enc9d8@
c3bay11@  c4bay14@  enc10d11@  enc10d6@  enc9d10@  enc9d4@  enc9d9@
c3bay8@   c4bay15@  enc10d2@   enc10d7@  enc9d11@  enc9d5@
c3bay9@   enc10d0@  enc10d3@   enc10d8@  enc9d12@  enc9d6@

Debug output on one of the enc10 disks:

# ls -la /dev/disk/by-vdev/enc10d0
lrwxrwxrwx 1 root root 9 Apr 28 10:07 /dev/disk/by-vdev/enc10d0 -> ../../sdx
# sh -x /usr/lib/udev/vdev_id -d sdx
+ PATH=/bin:/sbin:/usr/bin:/usr/sbin
+ CONFIG=/etc/zfs/vdev_id.conf
+ PHYS_PER_PORT=
+ DEV=
+ TOPOLOGY=
+ BAY=
+ ENCL_ID=
+ UNIQ_ENCL_ID=
+ getopts c:d:eg:jmp:h OPTION
+ DEV=sdx
+ getopts c:d:eg:jmp:h OPTION
+ [ ! -r /etc/zfs/vdev_id.conf ]
+ [ -z sdx ]
+ [ -z  ]
+ awk ($1 == "topology") {print $2; exit} /etc/zfs/vdev_id.conf
+ TOPOLOGY=sas_direct
+ [ -z  ]
+ awk ($1 == "slot") {print $2; exit} /etc/zfs/vdev_id.conf
+ BAY=bay
+ TOPOLOGY=sas_direct
+ [  = yes ]
+ alias_handler
+ DM_PART=
+ echo 
+ grep -q -E p[0-9][0-9]*$
+ ID_VDEV=
+ [ -z  ]
+ BAY=bay
+ sas_handler
+ [ -z  ]
+ awk $1 == "phys_per_port" \
                        {print $2; exit} /etc/zfs/vdev_id.conf
+ PHYS_PER_PORT=4
+ PHYS_PER_PORT=4
+ echo 4
+ grep -q -E ^[0-9]+$
+ [ -z  ]
+ awk $1 == "multipath" \
                        {print $2; exit} /etc/zfs/vdev_id.conf
+ MULTIPATH_MODE=no
+ [ -z  ]
+ awk $1 == "multijbod" \
                        {print $2; exit} /etc/zfs/vdev_id.conf
+ MULTIJBOD_MODE=
+ [ no = yes ]
+ echo sdx
+ grep -q ^/devices/
+ udevadm info -q path -p /sys/block/sdx
+ sys_path=/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0/end_device-2:0:0/target2:0:4/2:0:4:0/block/sdx
+ echo /devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0/end_device-2:0:0/target2:0:4/2:0:4:0/block/sdx
+ tr /  
+ set -- devices pci0000:00 0000:00:03.2 0000:20:00.0 host2 port-2:4 expander-2:0 port-2:0:0 end_device-2:0:0 target2:0:4 2:0:4:0 block sdx
+ num_dirs=13
+ scsi_host_dir=/sys
+ i=1
+ [ 1 -le 13 ]
+ eval echo ${1}
+ echo devices
+ d=devices
+ scsi_host_dir=/sys/devices
+ echo devices
+ grep -q -E ^host[0-9]+$
+ i=2
+ [ 2 -le 13 ]
+ eval echo ${2}
+ echo pci0000:00
+ d=pci0000:00
+ scsi_host_dir=/sys/devices/pci0000:00
+ echo pci0000:00
+ grep -q -E ^host[0-9]+$
+ i=3
+ [ 3 -le 13 ]
+ eval echo ${3}
+ echo 0000:00:03.2
+ d=0000:00:03.2
+ scsi_host_dir=/sys/devices/pci0000:00/0000:00:03.2
+ echo 0000:00:03.2
+ grep -q -E ^host[0-9]+$
+ i=4
+ [ 4 -le 13 ]
+ eval echo ${4}
+ echo 0000:20:00.0
+ d=0000:20:00.0
+ scsi_host_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0
+ echo 0000:20:00.0
+ grep -q -E ^host[0-9]+$
+ i=5
+ [ 5 -le 13 ]
+ eval echo ${5}
+ echo host2
+ d=host2
+ scsi_host_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2
+ echo host2
+ grep -q -E ^host[0-9]+$
+ break
+ echo host2
+ awk -F/ { gsub("host","",$NF); print $NF}
+ HOSTCHAN=2
+ [ 5 = 13 ]
+ eval echo ${4}
+ echo 0000:20:00.0
+ awk -F: {print $2":"$3}
+ PCI_ID=20:00.0
+ port_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2
+ j=6
+ i=6
+ [ 6 -le 6 ]
+ eval echo ${6}
+ echo port-2:4
+ port_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4
+ i=7
+ [ 7 -le 6 ]
+ + ls -d /sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/phy-2:0 /sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/phy-2:1head /sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/phy-2:2 -1 /sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/phy-2:3

+ awk -F: {print $NF}
+ PHY=0
+ [ -z 0 ]
+ PORT=0
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4
+ [ 7 -lt 13 ]
+ eval echo ${7}
+ echo expander-2:0
+ d=expander-2:0
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0
+ echo expander-2:0
+ grep -q ^end_device
+ i=8
+ [ 8 -lt 13 ]
+ eval echo ${8}
+ echo port-2:0:0
+ d=port-2:0:0
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0
+ echo port-2:0:0
+ grep -q ^end_device
+ i=9
+ [ 9 -lt 13 ]
+ eval echo ${9}
+ echo end_device-2:0:0
+ d=end_device-2:0:0
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0/end_device-2:0:0
+ echo end_device-2:0:0
+ grep -q ^end_device
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0/end_device-2:0:0/sas_device/end_device-2:0:0
+ break
+ SLOT=
+ cat /sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0/end_device-2:0:0/sas_device/end_device-2:0:0/bay_identifier
+ SLOT=0
+ [ -z 0 ]
+ [  = yes ]
+ map_channel 20:00.0 0
+ MAPPED_CHAN=
+ PCI_ID=20:00.0
+ PORT=0
+ awk -v pciID=20:00.0 -v port=0 $1 == "channel" && $2 == pciID && $3 == port \
                        {print $4} /etc/zfs/vdev_id.conf
+ MAPPED_CHAN=enc10d
+ printf %s enc10d
+ CHAN=enc10d
+ map_slot 0 enc10d
+ LINUX_SLOT=0
+ CHANNEL=enc10d
+ awk -v linux_slot=0 -v channel=enc10d $1 == "slot" && $2 == linux_slot && \
                        $4 ~ "^"channel"$" { print $3; exit} /etc/zfs/vdev_id.conf
+ MAPPED_SLOT=
+ [ -z  ]
+ MAPPED_SLOT=0
+ printf %d 0
+ SLOT=0
+ [ -z enc10d ]
+ echo enc10d0
+ ID_VDEV=enc10d0
+ [ -n enc10d0 ]
+ echo ID_VDEV=enc10d0
ID_VDEV=enc10d0
+ echo ID_VDEV_PATH=disk/by-vdev/enc10d0
ID_VDEV_PATH=disk/by-vdev/enc10d0

For this drive I would have expected this vdev_id.conf line to trigger:

slot  0          1

and map this drive to enc10d1.

arshad512 commented 3 years ago

@ZNikke , thanks for trying out the patch. For the fallback mapping I will update the patch in sometime.

arshad512 commented 3 years ago

@ZNikke , thanks for trying out the patch. For the fallback mapping I will update the patch in sometime.

@ZNikke, please try out the revised patch https://github.com/openzfs/zfs/pull/11959 and let us know the result. Thanks!

tonyhutter commented 3 years ago

I'll give this a test on our enclosures

ZNikke commented 3 years ago

With the revised patch the result is as expected:

# ls --hide="*-part*" /dev/disk/by-vdev/
c0v1     c3bay9   c4bay15   enc10d12  enc10d5  enc10d9  enc9d12  enc9d5  enc9d9
c3bay10  c4bay12  enc10d1   enc10d2   enc10d6  enc9d1   enc9d2   enc9d6
c3bay11  c4bay13  enc10d10  enc10d3   enc10d7  enc9d10  enc9d3   enc9d7
c3bay8   c4bay14  enc10d11  enc10d4   enc10d8  enc9d11  enc9d4   enc9d8

However, if I comment out the slot bay line in vdev_id.conf I only get c0v1 mapped in by-vdev/. This should mean that the file vdev_id.conf.sas_direct.example as shipped doesn't actually work. I think the best way forward would be to fix this bug too, as slot bay should be the default according to the vdev_id.conf man page.

With #slot bay in vdev_id.conf a debug run gives:

# sh -x /usr/lib/udev/vdev_id -d sdx
+ PATH=/bin:/sbin:/usr/bin:/usr/sbin
+ CONFIG=/etc/zfs/vdev_id.conf
+ PHYS_PER_PORT=
+ DEV=
+ TOPOLOGY=
+ BAY=
+ ENCL_ID=
+ UNIQ_ENCL_ID=
+ getopts c:d:eg:jmp:h OPTION
+ DEV=sdx
+ getopts c:d:eg:jmp:h OPTION
+ [ ! -r /etc/zfs/vdev_id.conf ]
+ [ -z sdx ]
+ [ -z  ]
+ awk ($1 == "topology") {print $2; exit} /etc/zfs/vdev_id.conf
+ TOPOLOGY=sas_direct
+ [ -z  ]
+ awk ($1 == "slot") {print $2; exit} /etc/zfs/vdev_id.conf
+ BAY=4
+ TOPOLOGY=sas_direct
+ [  = yes ]
+ alias_handler
+ DM_PART=
+ echo 
+ grep -q -E p[0-9][0-9]*$
+ ID_VDEV=
+ [ -z  ]
+ BAY=4
+ sas_handler
+ [ -z  ]
+ awk $1 == "phys_per_port" \
                        {print $2; exit} /etc/zfs/vdev_id.conf
+ PHYS_PER_PORT=4
+ PHYS_PER_PORT=4
+ echo 4
+ grep -q -E ^[0-9]+$
+ [ -z  ]
+ awk $1 == "multipath" \
                        {print $2; exit} /etc/zfs/vdev_id.conf
+ MULTIPATH_MODE=no
+ [ -z  ]
+ awk $1 == "multijbod" \
                        {print $2; exit} /etc/zfs/vdev_id.conf
+ MULTIJBOD_MODE=
+ [ no = yes ]
+ echo sdx
+ grep -q ^/devices/
+ udevadm info -q path -p /sys/block/sdx
+ sys_path=/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0/end_device-2:0:0/target2:0:4/2:0:4:0/block/sdx
+ echo /devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0/end_device-2:0:0/target2:0:4/2:0:4:0/block/sdx
+ tr /  
+ set -- devices pci0000:00 0000:00:03.2 0000:20:00.0 host2 port-2:4 expander-2:0 port-2:0:0 end_device-2:0:0 target2:0:4 2:0:4:0 block sdx
+ num_dirs=13
+ scsi_host_dir=/sys
+ i=1
+ [ 1 -le 13 ]
+ eval echo ${1}
+ echo devices
+ d=devices
+ scsi_host_dir=/sys/devices
+ echo devices
+ grep -q -E ^host[0-9]+$
+ i=2
+ [ 2 -le 13 ]
+ eval echo ${2}
+ echo pci0000:00
+ d=pci0000:00
+ scsi_host_dir=/sys/devices/pci0000:00
+ echo pci0000:00
+ grep -q -E ^host[0-9]+$
+ i=3
+ [ 3 -le 13 ]
+ eval echo ${3}
+ echo 0000:00:03.2
+ d=0000:00:03.2
+ scsi_host_dir=/sys/devices/pci0000:00/0000:00:03.2
+ echo 0000:00:03.2
+ grep -q -E ^host[0-9]+$
+ i=4
+ [ 4 -le 13 ]
+ eval echo ${4}
+ echo 0000:20:00.0
+ d=0000:20:00.0
+ scsi_host_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0
+ echo 0000:20:00.0
+ grep -q -E ^host[0-9]+$
+ i=5
+ [ 5 -le 13 ]
+ eval echo ${5}
+ echo host2
+ d=host2
+ scsi_host_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2
+ echo host2
+ grep -q -E ^host[0-9]+$
+ break
+ echo host2
+ awk -F/ { gsub("host","",$NF); print $NF}
+ HOSTCHAN=2
+ [ 5 = 13 ]
+ eval echo ${4}
+ echo 0000:20:00.0
+ awk -F: {print $2":"$3}
+ PCI_ID=20:00.0
+ port_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2
+ j=6
+ i=6
+ [ 6 -le 6 ]
+ eval echo ${6}
+ echo port-2:4
+ port_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4
+ i=7
+ [ 7 -le 6 ]
+ ls -d /sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/phy-2:0 /sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/phy-2:1+  /sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/phy-2:2 /sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/phy-2:3
head -1
+ awk -F: {print $NF}
+ PHY=0
+ [ -z 0 ]
+ PORT=0
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4
+ [ 7 -lt 13 ]
+ eval echo ${7}
+ echo expander-2:0
+ d=expander-2:0
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0
+ echo expander-2:0
+ grep -q ^end_device
+ i=8
+ [ 8 -lt 13 ]
+ eval echo ${8}
+ echo port-2:0:0
+ d=port-2:0:0
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0
+ echo port-2:0:0
+ grep -q ^end_device
+ i=9
+ [ 9 -lt 13 ]
+ eval echo ${9}
+ echo end_device-2:0:0
+ d=end_device-2:0:0
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0/end_device-2:0:0
+ echo end_device-2:0:0
+ grep -q ^end_device
+ end_device_dir=/sys/devices/pci0000:00/0000:00:03.2/0000:20:00.0/host2/port-2:4/expander-2:0/port-2:0:0/end_device-2:0:0/sas_device/end_device-2:0:0
+ break
+ SLOT=
+ [ -z  ]
+ return
+ ID_VDEV=
+ [ -n  ]

...

arshad512 commented 3 years ago

@ZNikke ,

With the revised patch the result is as expected:

Thank you for confirming.

However, if I comment out the slot bay line in vdev_id.conf I only get c0v1 mapped in by-vdev/

The reason for this is that when you comment out the "slot bay" line under vdev_id.conf instead of picking up the default which is 'bay' it picks up the first line which is defined under 'mappings' sections. From your example debug run you posted...

awk ($1 == "slot") {print $2; exit} /etc/zfs/vdev_id.conf                     
BAY=4   

The '4' comes from the first line which is matching 'slot' once 'slot bay' is removed.

# Drives in internal bays
slot 4           8      c3bay

This is clearly incorrect which leads to wrong values we are getting.

The question is did it ever run correctly? The code changes under https://github.com/jsai20/zfs/commit/1f9db7f5936b6c4c4a1f3f17bfff3e86789e2033 did not change this part and this should be failing under old verison also. Could you please confirm if this worked in older version or not? If not then we should still be fixing this ofcourse but in a new bug and leave this for the original issue you filed which was a regression.

Second, if we look at the man page the "bay" is optional and taken as default. However the key (slot) should be present. IMHO, the usage of removing the full line of "slot bay" is incorrect. I am commenting purely from the code I read. I will let other folks comment on this if this is indeed a bug which needs to be fixed or not ?. (@AeonJJohnson , @gdevenyi , @tonyhutter , @behlendorf )

slot <bay|phy|id|lun>
    Specifies from which element of a SAS identifier the slot number is taken. The default is bay. 
ZNikke commented 3 years ago

I'm pretty sure this has been broken for quite some time, I have vivid memories of fighting with vdev_id.conf for over a week without understanding how it could ever work until I restarted from scratch cut&pasting an example with the defaults explicitly listed.

In any case, either the bug needs to be fixed or the shipped vdev_id.conf.sas_direct.example and vdev_id.conf man page needs to be updated to conform to the script behavior.

I'm guessing fixing the script is less work (should be as easy as only accepting a line which is valid for the slot <bay|phy|id|lun> spec for the assignment).

If it should be done within this issue or in a separate issue is another question I guess.

arshad512 commented 3 years ago

I'm pretty sure this has been broken for quite some time

Understood. Than, I would say, this is expected behavior (as per the code). This is not part of regression.

In any case, either the bug needs to be fixed or the shipped vdev_id.conf.sas_direct.example and vdev_id.conf man page needs to be updated to conform to the script behavior.

As per the code it expects 'slot' with 'bay' as default. So, man page sould be updated if that is confusing for users.

I'm guessing fixing the script is less work (should be as easy as only accepting a line which is valid for the slot <bay|phy|id|lun> spec for the assignment).

The code assumes 'bay' as default.

if [ -z "$ID_VDEV" ] ; then
        BAY=${BAY:-bay}

If we change the behavior then we might end up breaking few cases where only 'slot' is mentioned by user. It would be good to update the man page. IMO.

AeonJJohnson commented 3 years ago

I think we may be going into creeping elegance territory here. The original complaint root cause was identified and fixed. We should avoid fine tuning to a single use case.

One of the challenges with this entire subsystem is that the SES enclosure standard is a loose standard and how it is applied by various platform, hardware and enclosure manufacturers varies. The slot bay function isn't broken, it just might not work correctly on some hardware. If you take a look at the sysfs enclosure tree on your system and see what enclosure pointers exist under the end_device subdir you will get an idea of whether bay, phy, id, ses or mix will work best for your specific hardware. This works fine for the megascale "a list" hardware like HGST, Quanta, Newisys and Netapp. Others, based on their firmware robustness, can be less so. Adaptec HBAs for example make strange representations in sysfs, LSI based HBAs run flawless (as long as you don't have their RAID 0/1/1E "IR" firmware loaded.

If you descend into the sysfs tree under /sys/class/sas_host you'll find the enclosure device subdir for each disk (after several subdir levels) that contains the SES pointers the script pulls data from. In the case below (HGST UltraData60 JBOD) you'll see the slot special file that contains the numeric id for the enclosure slot number:

[root@localhost 10:0:25:0]# ls enclosure_device\:SLOT\ 07\,8HJ2WHZH\ \ \ \ \ \ \ \ \ \ \ \ /
active  device  fault  locate  power  power_status  slot  status  type  uevent
[root@localhost 10:0:25:0]# cat enclosure_device\:SLOT\ 07\,8HJ2WHZH\ \ \ \ \ \ \ \ \ \ \ \ /slot
7

Based upon a specific combination of HBA and enclosure one of the other settings may work better. With this newer version of vdev_id I added the "mix" use case because of a hardware situation I discovered where a "slot compliant" JBOD may be connected to a host with directly connected disks inside of it, direct cable to slot with no SAS expander or SES function in the middle. This gave vdev_id the ability to enumerate disks both in a SES JBOD as well as directly connected.

Maybe polishing some of what I said above and adding it to the man page may be a better path than tweaking vdev_id for site specific use cases.

ZNikke commented 3 years ago

I agree that the original issue is fixed.

However, for the "slot bay" thing, my main point is that the shipped example file won't work as is, and the man page is confusing, but I'll readily agree that it's likely best handled in a separate issue :)

Although I have to add that the mapping code is really not about dealing with the hardware, as it's performed after that stage to deal with whatever the results of the hardware-dealing was to map ID:s into the logic the user wants (based on human preferences such as naming conventions and matching ID:s to what's printed on boxes etc).

And btw, the slot mix mode is neither mentioned in the vdev_id.conf man page nor in an example file. It is indeed documented in vdev_id itself, but it's not obvious to look there since there is a man page, so getting the man page up to speed sounds like a good idea indeed!

The double-use of the slot keyword is confusing, but that's also another debate...

behlendorf commented 3 years ago

Thanks everybody for the quick fix and all the testing. I'll go ahead and get it merged and the documentation updates can be handled in a new PR.