intel / ledmon

Enclosure LED Utilities
GNU General Public License v2.0
72 stars 44 forks source link

ledctl support for Dell SSD is reporting unsupported #150

Closed tasleson closed 6 months ago

tasleson commented 10 months ago
# ledctl --version
Intel(R) Enclosure LED Control Application 0.97 
Copyright (C) 2009-2022 Intel Corporation.

This is free software; see the source for copying conditions. There is NO warranty;
not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE
# ledctl -L
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0/0000:68:04.0/0000:6a:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:04.0/0000:ef:00.0/0000:f0:14.0/0000:f2:00.0 (Dell SSD)
/sys/devices/pci0000:00/0000:00:17.0 (AHCI)
/sys/devices/pci0000:00/0000:00:11.5 (AHCI)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:04.0/0000:ef:00.0/0000:f0:10.0/0000:f1:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0/0000:68:18.0/0000:6f:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0/0000:e6:1c.0/0000:ee:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0/0000:68:14.0/0000:6e:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0/0000:68:1c.0/0000:70:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0/0000:e6:18.0/0000:ed:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:04.0/0000:71:00.0/0000:72:1c.0/0000:76:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0/0000:68:10.0/0000:6d:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0/0000:e6:14.0/0000:ec:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0/0000:e6:08.0/0000:e9:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:04.0/0000:71:00.0/0000:72:18.0/0000:75:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0/0000:68:0c.0/0000:6c:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:04.0/0000:ef:00.0/0000:f0:1c.0/0000:f4:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0/0000:68:00.0/0000:69:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0/0000:e6:10.0/0000:eb:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0/0000:e6:04.0/0000:e8:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:04.0/0000:71:00.0/0000:72:14.0/0000:74:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0/0000:68:08.0/0000:6b:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:04.0/0000:ef:00.0/0000:f0:18.0/0000:f3:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0/0000:e6:0c.0/0000:ea:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0/0000:e6:00.0/0000:e7:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:04.0/0000:71:00.0/0000:72:10.0/0000:73:00.0 (Dell SSD)
# ledctl locate=/dev/nvme0n1
/dev/shm/ledmon.conf: does not exist, using global config file
/etc/ledmon.conf: does not exist, using built-in defaults
ledctl: /dev/nvme0n1: device not supported
ledctl: IBPI LOCATE: missing block device(s)... pattern ignored.

I built latest code which also exhibits the same error. Adding some debug, the issue appears because on this particular dell when we translate from /dev/nvme0n1 to /sys/dev/block/259:7 to /sys/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1 and compare that with our list of controllers.

We fail to match: /sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0/0000:68:04.0/0000:6a:00.0/nvme/nvme0/nvme0c0n1 to /sys/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1

Not sure what the correction for this is at the moment.

tasleson commented 10 months ago

This is caused by NVMe multipath support. We'll need to add some logic to handle this. /dev/nvme0n1 is a virtual path which can have 1 or more physical controllers. The other block devices for each of the redundant paths are "hidden".

More details here

mtkaczyk commented 10 months ago

This is caused by NVMe multipath support. We'll need to add some logic to handle this. /dev/nvme0n1 is a virtual path which can have 1 or more physical controllers. The other block devices for each of the redundant paths are "hidden".

In this case I would check if we can report controller earlier, to avoid multipath specyfic paths but I don't know if it is possible and I don't have hardware to check:

/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:04.0/0000:ef:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:04.0/0000:71:00.0 (Dell SSD)

Theoretically, if the controller will be reported for real NVME device, then problem should disappear (If I assumed correctly that these paths are NVMEs). We will need to do this anyway to hide duplicates.

I would prefer to not hide controllers but I understand that in this case it could not be possible.

tasleson commented 10 months ago

This is caused by NVMe multipath support. We'll need to add some logic to handle this. /dev/nvme0n1 is a virtual path which can have 1 or more physical controllers. The other block devices for each of the redundant paths are "hidden".

In this case I would check if we can report controller earlier, to avoid multipath specyfic paths but I don't know if it is possible and I don't have hardware to check:

The issue is that the normal /dev/nvme0n1 is the "virtual" one. There is nothing in /dev/ that refers to the physical one. It can be seen in sysfs /sys/block/nvme0c0n1 but to figure that out in code is a little weird IMHO.

/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:04.0/0000:ef:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:00.0/0000:67:00.0 (Dell SSD)
/sys/devices/pci0000:e2/0000:e2:02.0/0000:e3:00.0/0000:e4:00.0/0000:e5:00.0 (Dell SSD)
/sys/devices/pci0000:64/0000:64:02.0/0000:65:00.0/0000:66:04.0/0000:71:00.0 (Dell SSD)

Theoretically, if the controller will be reported for real NVME device, then problem should disappear (If I assumed correctly that these paths are NVMEs). We will need to do this anyway to hide duplicates.

All these paths are real NVMe. There are no duplicates as this system only has 1 path to each of the devices. I'm not sure if you can have multiple paths for directly attached PCI based NVMe devices. I would think this only applies if you're using NVMe-of. In that case I don't believe we can control the LED's anyways.

I would prefer to not hide controllers but I understand that in this case it could not be possible.

I don't think we need to hide anything yet.

mtkaczyk commented 10 months ago

This is caused by NVMe multipath support. We'll need to add some logic to handle this. /dev/nvme0n1 is a virtual path which can have 1 or more physical controllers. The other block devices for each of the redundant paths are "hidden".

In this case I would check if we can report controller earlier, to avoid multipath specyfic paths but I don't know if it is possible and I don't have hardware to check:

The issue is that the normal /dev/nvme0n1 is the "virtual" one. There is nothing in /dev/ that refers to the physical one. It can be seen in sysfs /sys/block/nvme0c0n1 but to figure that out in code is a little weird IMHO.

You right, my bad, I was surprised by a huge list of controllers so first I wanted to make it smaller but it is not possible. Those links are pci devices (physical ones) and there are no nvme-subsystem entries. At first glance, I thought that there are duplicates on controller list (multiple links to the same nvme device ).

Let me review the change you proposed, I need to setup platform with MP drives to see how it will affect ledmon so I need few days.

tasleson commented 10 months ago

Should be addressed with: https://github.com/intel/ledmon/pull/167