intel / ledmon

Enclosure LED Utilities
GNU General Public License v2.0
73 stars 47 forks source link

Lookup minimal mp support #167

Closed mtkaczyk closed 10 months ago

mtkaczyk commented 1 year ago

@tasleson here my proposal.

It works on my setup for nvme and not nvme drives:

[root@gklab-108-180 ledmon]# ./src/ledctl/ledctl --list-slots -c vmd
/dev/shm/ledmon.conf: does not exist, using global config file
/etc/ledmon.conf: does not exist, using built-in defaults
...
slot: 4-1             led state: LOCATE          device: /dev/nvme6c6n1
slot: 2-2             led state: LOCATE          device: /dev/nvme8n1
...
[root@gklab-108-180 ledmon]# ./src/ledctl/ledctl -x failure=/dev/nvme8n1

[root@gklab-108-180 ledmon]# ./src/ledctl/ledctl -x failure=/dev/nvme6n1

[root@gklab-108-180 ledmon]# ./src/ledctl/ledctl --list-slots -c vmd

...
slot: 4-1             led state: FAILURE         device: /dev/nvme6c6n1
slot: 2-2             led state: FAILURE         device: /dev/nvme8n1
...

I works with AHCI too, we don't have slots for SATA, so I checked it manually:

[root@gklab-108-180 ledmon]# cat /sys/devices/pci0000\:00/0000\:00\:17.0/ata7/host8/scsi_host/host8/em_message
400000
[root@gklab-108-180 ledmon]# ./src/ledctl/ledctl locate=/dev/sdd
[root@gklab-108-180 ledmon]# cat /sys/devices/pci0000\:00/0000\:00\:17.0/ata7/host8/scsi_host/host8/em_message
80000
tasleson commented 1 year ago

This change will not pass the updated unit test as:

slot: 4-1             led state: LOCATE          device: /dev/nvme6c6n1

device refers to a device that doesn't exist.

mtkaczyk commented 1 year ago

@tasleson

Yes:

[root@gklab-108-180 tests]# ./lib_unit_test
Running suite(s): ledmon
60%: Checks: 5, Failures: 2, Errors: 0
lib_unit_test.c:108:F:lib_unit_test:test_list_slots:0: stat failed for No such file or directory, errno: (null)
lib_unit_test.c:266:F:lib_unit_test:test_led_by_path:0: led_device_name_lookup 8
[root@gklab-108-180 tests]#

I wanted to left it but after rethinking I see that I can fix it easily, so will do.

mtkaczyk commented 1 year ago

@tasleson done:

[root@gklab-108-180 ledmon]# ./src/ledctl/ledctl --list-slots -c vmd
/dev/shm/ledmon.conf: does not exist, using global config file
/etc/ledmon.conf: does not exist, using built-in defaults
slot: 4-2             led state: LOCATE          device: (empty)
slot: 2-1             led state: NORMAL          device: /dev/nvme4n1
slot: 3               led state: LOCATE          device: (empty)
slot: 3-2             led state: NORMAL          device: /dev/nvme9n1
slot: 1               led state: LOCATE          device: (empty)
slot: 1-1             led state: NORMAL          device: /dev/nvme3n1
slot: 4-1             led state: LOCATE          device: (empty)
slot: 2-2             led state: LOCATE          device: /dev/nvme8n1
slot: 4               led state: LOCATE          device: /dev/nvme2n1
slot: 2               led state: LOCATE          device: /dev/nvme0n1
slot: 3-1             led state: LOCATE          device: /dev/nvme5n1
slot: 1-2             led state: LOCATE          device: (empty)
[root@gklab-108-180 ledmon]# ./test
test-driver  tests/
[root@gklab-108-180 ledmon]# ./tests/
.deps/         .libs/         lib_unit_test  __pycache__/   runtests.sh
[root@gklab-108-180 ledmon]# ./tests/lib_unit_test
Running suite(s): ledmon
100%: Checks: 5, Failures: 0, Errors: 0
[root@gklab-108-180 ledmon]#
tasleson commented 1 year ago

@mtkaczyk

This passes the unit test, but I see the code removes the device block if it doesn't represent an actual device node. It would be better if we could convert back to the virtual device node that the user interacts with, eg: slot: 4-1 -> /dev/nvme6n1

tasleson commented 1 year ago

@mtkaczyk I think we could do something like:

diff --git a/src/lib/libled.c b/src/lib/libled.c
index 0c5083f..d0e5891 100644
--- a/src/lib/libled.c
+++ b/src/lib/libled.c
@@ -159,11 +159,32 @@ void led_flush(struct led_ctx *ctx)
    }
 }

+static void set_slot_device_name(const char *sysfs_path, struct led_slot_list_entry *se)
+{
+   char temp[PATH_MAX];
+   struct stat st;
+
+   snprintf(temp, PATH_MAX, "/dev/%s", basename(sysfs_path));
+   if (stat(temp, &st) < 0) {
+       int nvme_num, ns;
+
+       /* device node not present, check for physical path for virt nvme
+        * eg. nvme12c12n1 -> nvme12n1
+        */
+       if (sscanf(temp, "/dev/nvme%dc%*dn%d", &nvme_num, &ns) != 2)
+           return;
+
+       snprintf(temp, PATH_MAX, "/dev/nvme%dn%d", nvme_num, ns);
+       if (stat(temp, &st) < 0)
+           return;
+   }
+
+   str_cpy(se->device_name, temp, PATH_MAX);
+}
+
 static struct led_slot_list_entry *init_slot(struct slot_property *slot)
 {
    struct led_slot_list_entry *s = NULL;
-   struct stat st;
-   char temp[PATH_MAX];

    if (!slot)
        return NULL;
@@ -180,10 +201,7 @@ static struct led_slot_list_entry *init_slot(struct slot_property *slot)
     * Assumption that "/sys/block/" device has a devnode is no longer true with nvme multipath.
     */
    if (slot->bl_device) {
-       snprintf(temp, PATH_MAX, "/dev/%s", basename(slot->bl_device->sysfs_path));
-       if (stat(temp, &st) < 0)
-           return s;
-       str_cpy(s->device_name, temp, PATH_MAX);
+       set_slot_device_name(slot->bl_device->sysfs_path, s);
    }

    return s;

See: https://github.com/intel/ledmon/pull/171

This is untested as I don't have a way to test it!

mtkaczyk commented 1 year ago

HI @tasleson, Sorry for the delay. I need to focus on other high priority task for a while. I will back to this as soon as possible.

mtkaczyk commented 1 year ago

Hi @tasleson, Please take a look and let me know what do you think about implementing it this way. The change is not complete yet, I need to add some tests, and lock possibility to use locate=/whatever_here/nvme1n1 but this is they way I would like to implement it. All cases handled and works correctly for me, lib tests passed too. This is a draft, I'm looking for approval :)

tasleson commented 1 year ago

Hi @tasleson, Please take a look and let me know what do you think about implementing it this way. The change is not complete yet, I need to add some tests, and lock possibility to use locate=/whatever_here/nvme1n1 but this is they way I would like to implement it. All cases handled and works correctly for me, lib tests passed too. This is a draft, I'm looking for approval :)

My initial simplistic review is that this change seems OK. However, after I fetched your branch and re-based it I tried running it on a couple of different systems and this latest change breaks our Dell system which doesn't support slots API, we simply get:

# ./ledctl locate=/dev/nvme20n1
/dev/shm/ledmon.conf: does not exist, using global config file
/etc/ledmon.conf: does not exist, using built-in defaults
ledctl: /dev/nvme20n1: device not supported
ledctl: IBPI LOCATE: missing block device(s)... pattern ignored.

I'll try to look into why this is broken. Our SES system ran make check fine.

tasleson commented 1 year ago

@mtkaczyk I located the bug in your current proposal:

bool is_virt_nvme(const char * const name)
{
    /* Simple match by name */
    if (strncmp(name, "nvme", 4) == 0 && name[5] == 'c')
        return true;
    return false;
}

This function doesn't work when the number of NVMe devices gets into double digits or more, eg. nvme15c15n1 fails to be recognized as a virtual NVMe device as the c is in index 6, not 5.

mtkaczyk commented 1 year ago

This function doesn't work when the number of NVMe devices gets into double digits or more, eg. nvme15c15n1 fails to be recognized as a virtual NVMe device as the c is in index 6, not 5.

Yes, of course matching by specific position is bad approach and bug prone. Thanks for catching this! I will replace this by strchr()

mtkaczyk commented 10 months ago

FYI all interested, I made a general clean up int ledctl tests #190 . I need to wrote few tests for --set-slot and --get-slot by device and test this with MP drives. WIP