intel / ledmon

Enclosure LED Utilities
GNU General Public License v2.0
72 stars 44 forks source link

Error work ledmon/ledctl on Dell R540\R530 Perc h330 #147

Closed napaster closed 6 months ago

napaster commented 11 months ago

Good time of day. There is a Dell PowerEdge R540 server with PERC H330 Adapter (Embedded) installed. When trying to use ledmon\ledctl, it throws an error:

[root@ceph-osd7 tmp]# ledctl locate=/dev/sda ledctl: /dev/sda: device not supported ledctl: IBPI LOCATE: missing block device(s)... pattern ignored. [root@ceph-osd7 tmp]#

Version ledmon:

[root@ceph-osd7 tmp]# ledmon --version Intel(R) Enclosure LED Monitor Service 0.97 Copyright (C) 2009-2022 Intel Corporation. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ledmon[204813]: exit status is STATUS_SUCCESS. [root@ceph-osd7 tmp]#

Vesrion OS system:

[root@ceph-osd7 ~]# cat /etc/redhat-release CentOS Stream release 8 [root@ceph-osd7 ~]#

Which way to look ? and what to do ?

mtkaczyk commented 11 months ago

Hello, First please provide output of ledctl -L

k0ste commented 10 months ago

@napaster please provide the output as @mtkaczyk was suggested

napaster commented 10 months ago

Sorry

[root@ceph-osd7 ~]# ledctl -L /sys/devices/pci0000:00/0000:00:17.0 (AHCI) /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 (Dell SSD) /sys/devices/pci0000:00/0000:00:11.5 (AHCI) /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0 (Dell SSD) /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0 (Dell SSD) [root@ceph-osd7 ~]#

mtkaczyk commented 10 months ago

Ok, thanks. Now, please provide output of #ls -l /sys/block

Do you have nvme multipath enabled?

napaster commented 10 months ago

[root@ceph-osd7 ~]# ls -l /sys/block total 0 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-0 -> ../devices/virtual/block/dm-0 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-1 -> ../devices/virtual/block/dm-1 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-10 -> ../devices/virtual/block/dm-10 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-11 -> ../devices/virtual/block/dm-11 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-12 -> ../devices/virtual/block/dm-12 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-13 -> ../devices/virtual/block/dm-13 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-14 -> ../devices/virtual/block/dm-14 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-15 -> ../devices/virtual/block/dm-15 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-16 -> ../devices/virtual/block/dm-16 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-17 -> ../devices/virtual/block/dm-17 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-18 -> ../devices/virtual/block/dm-18 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-19 -> ../devices/virtual/block/dm-19 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-2 -> ../devices/virtual/block/dm-2 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-20 -> ../devices/virtual/block/dm-20 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-21 -> ../devices/virtual/block/dm-21 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-22 -> ../devices/virtual/block/dm-22 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-23 -> ../devices/virtual/block/dm-23 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-3 -> ../devices/virtual/block/dm-3 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-4 -> ../devices/virtual/block/dm-4 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-5 -> ../devices/virtual/block/dm-5 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-6 -> ../devices/virtual/block/dm-6 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-7 -> ../devices/virtual/block/dm-7 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-8 -> ../devices/virtual/block/dm-8 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-9 -> ../devices/virtual/block/dm-9 lrwxrwxrwx 1 root root 0 Jul 31 17:40 nvme0n1 -> ../devices/pci0000:00/0000:00:1c.0/0000:01:00.0/nvme/nvme0/nvme0n1 lrwxrwxrwx 1 root root 0 Jul 31 17:40 nvme1n1 -> ../devices/pci0000:17/0000:17:00.0/0000:18:00.0/nvme/nvme1/nvme1n1 lrwxrwxrwx 1 root root 0 Jul 31 17:40 nvme2n1 -> ../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/nvme/nvme2/nvme2n1 lrwxrwxrwx 1 root root 0 Jul 31 17:40 sda -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:0/1:0:0:0/block/sda lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdb -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:1/1:0:1:0/block/sdb lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdc -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:2/1:0:2:0/block/sdc lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdd -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:3/1:0:3:0/block/sdd lrwxrwxrwx 1 root root 0 Jul 31 17:40 sde -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:4/1:0:4:0/block/sde lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdf -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:5/1:0:5:0/block/sdf lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdg -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:6/1:0:6:0/block/sdg lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdh -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:7/1:0:7:0/block/sdh lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdi -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:8/1:0:8:0/block/sdi lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdj -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:9/1:0:9:0/block/sdj lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdk -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:10/1:0:10:0/block/sdk lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdl -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:11/1:0:11:0/block/sdl lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdm -> ../devices/pci0000:00/0000:00:14.0/usb1/1-9/1-9:1.0/host0/target0:0:0/0:0:0:0/block/sdm lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdn -> ../devices/pci0000:00/0000:00:14.0/usb1/1-14/1-14.1/1-14.1.3/1-14.1.3:1.0/host16/target16:0:0/16:0:0:1/block/sdn lrwxrwxrwx 1 root root 0 Jul 31 17:40 sr0 -> ../devices/pci0000:00/0000:00:14.0/usb1/1-14/1-14.1/1-14.1.3/1-14.1.3:1.0/host16/target16:0:0/16:0:0:0/block/sr0 [root@ceph-osd7 ~]#

napaster commented 10 months ago

Do you have nvme multipath enabled?

I don't know how to check?

mtkaczyk commented 10 months ago

Do you have nvme multipath enabled?

I don't know how to check?

Oh, sorry it is Sata not nvme. it is not a case here.

[root@ceph-osd7 tmp]# ledctl locate=/dev/sda ledctl: /dev/sda: device not supported ledctl: IBPI LOCATE: missing block device(s)... pattern ignored.

You tried sda, controller in current implementation is always a subpath to the device (https://github.com/intel/ledmon/blob/master/src/lib/block.c#L199): Let compare sda with controller list:


/sys/devices/pci0000:00/0000:00:17.0 (AHCI)
/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 (Dell SSD)
/sys/devices/pci0000:00/0000:00:11.5 (AHCI)
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0 (Dell SSD)
/sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0 (Dell SSD)

AHCI controller matches, it is not DELL specific. but the error: `` ledctl: /dev/sda: device not supported```

obviously suggest that we failed to match the device with controller and the device is not in block_list. This type of device must receive host and hostN property https://github.com/intel/ledmon/blob/v0.97/src/block.c#L261 so I think that is the reason it failed. Path to my device which work with AHCI is quite different: ./devices/pci0000:00/0000:00:11.5/ata5/host4/target4:0:0/4:0:0:0/block/sda

It seems that you device is not connected expected way. Please try ti debug and determine where is the difference. I expect that the value we read might be returned in unexpected format.

mtkaczyk commented 10 months ago

Hello @napaster , Do you try to investigate it?

napaster commented 10 months ago

Good day. I'm sorry that I answered for a long time. Lots of work.

I don't quite understand what needs to be done. I compared with your string (which you expect to receive) and with the construction that comes out of me. They digress slightly.

The string you are expecting ./devices/pci0000:00/0000:00:11.5/ata5/host4/target4:0:0/4:0:0:0/block/sda

The string that I get.

../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:6/1:0:6:0/block/sdg

That is, in fact, the difference comes down to the fact that ata is not indicated in front of my host. Instead of the ata number, its address is indicated.

mtkaczyk commented 10 months ago

Please compile ledmon with debug flags or add some prints to block_device_init using tag v0.97 to determine why /dev/sdg is not added to block device list. I need to know why error is returned. You can also try to manually translate you path to em_message to check if it is readable. In my case it is like following: Device: /sys/devices/pci0000:00/0000:00:11.5/ata5/host4/target4:0:0/4:0:0:0/block/sda/

cat /sys/devices//pci0000\:00/0000\:00\:11.5/ata5/host4/scsi_host/host4/em_message

napaster commented 10 months ago

[root@ceph-osd8 host5]# cat /sys/devices/pci0000\:00/0000\:00\:11.5/ata5/host5/scsi_host/host5/em_message 0 [root@ceph-osd8 host5]#`

It will be difficult to enable debugging, the machine is in production and it will not be possible to roll a custom package there.

mtkaczyk commented 10 months ago

you don't need to install it, just compile and run locally: "./src/ledctl locate=/dev/sdg" That is all you need.

napaster commented 10 months ago

What flag do I need to run ./configure with so that the package is built with the necessary debug mode. The fact is that we have giltab assembling packages and on a host in production it will not work just to download and try to compile the package. Just collect the package in gitlab and distribute it to the hosts.

mtkaczyk commented 10 months ago

You need -ggdb flag.

I cannot help you if there is nothing which points me to the root cause. I don't have similar setup to assist you. I need your input to resolve this issue.

napaster commented 9 months ago

In general, I tried to manually build the package on another server (which is not in production). I assembled it with the key ./configure CFLAGS="-ggdb" and as a result I had this output.

ledmon 0.97 configuration: Source code location: . Preprocessor flags: -D_DEBUG -D_GNU_SOURCE -D_DEFAULT_SOURCE -DDMALLOC_DISABLE -DBUILD_LABEL=\""$(BUILD_LABEL)"\"
C compiler flags: -Wall -I../config -Wformat -Werror=format-security -Werror=format-overflow=2 -Werror=format-truncation=1 -Werror=shift-negative-value -Werror=alloca -Werror=missing-field-initializers -Werror=format-sign edness -ggdb Common install location: /usr configure parameters: --enable-systemd=no

I took 2 ledmon and ledctl files from the /src folder and tried to simply run them through ./ and I got this output on the server.

[root@ceph-osd8 tmp]# ./ledctl -x -L /sys/devices/pci0000:00/0000:00:17.0 (AHCI) /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 (Dell SSD) /sys/devices/pci0000:00/0000:00:11.5 (AHCI) /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0 (Dell SSD) /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0 (Dell SSD) [root@ceph-osd8 tmp]#

and try locate

[root@ceph-osd8 tmp]# ./ledctl locate=/dev/sdc ledctl: /dev/sdc: device not supported ledctl: IBPI LOCATE: missing block device(s)... pattern ignored. ledctl: main(): _ibpi_parse() failed (status=STATUS_NOT_SUPPORTED). [root@ceph-osd8 tmp]#

But apparently this is not enough? or did I assemble the package incorrectly?

mtkaczyk commented 9 months ago

Good job :)

There are 2 options:

I suspect that device is not added to block list to please take a look into block_device_init, you can simply make a break (using gdb) or add messages like: log_info("Processing %s\n", path ) log_info("realpath failed for %s\n", path ) It is up to you.. the goal is to understand why device is rejected. We need to understand why.

You can also try to compile latest upstream ledmon to see it issue is fixed.

napaster commented 9 months ago

Hello. Sorry for another long answer. Job. In general, I tried to compile with the key that you indicated. Does not work. I tried to compile on different systems, and even tried to compile on a baremetal server. Does not work. It crashes with an error. I tried to compile under CentOS 8 Stream

[root@mks ledmon]# ./configure CFLAGS="-cgdb" configure: loading site script /usr/share/config.site checking for a BSD-compatible install... /bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... gawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking whether make supports the include directive... yes (GNU style) checking for gcc... gcc checking whether the C compiler works... no configure: error: in /root/ledmon: configure: error: C compiler cannot create executables See config.log for more details [root@mks ledmon]#

I am attaching the log config.log

mtkaczyk commented 9 months ago

I log you provided I can find following:

gcc: error: unrecognized command line option '-Wwrapv'; did you mean '-fwrapv'?
configure:13343: $? = 1

This option could be not supported by your compiler. What it gcc version? @pawpiatko could you please take a look deeper into this problem?

napaster commented 9 months ago

[root@mks ledmon]# gcc --version gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20) Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[root@mks ledmon]#

[root@mks ledmon]# g++ --version g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20) Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[root@mks ledmon]#

napaster commented 9 months ago

@pawpiatko can you help ?

pawpiatko commented 9 months ago

@pawpiatko can you help ?

@napaster please retry with this change: https://github.com/intel/ledmon/pull/181

napaster commented 8 months ago

Ok, I’ll test it in just a couple of days

mtkaczyk commented 7 months ago

@napaster ping? Is the issue still valid?

napaster commented 7 months ago

Yes, sorry, there’s a lot of work, I don’t have time, I’ll try to assemble and check this week.

ktanska commented 6 months ago

@napaster I close this bug, because it has no response from 2 months. Feel free to open if you will want to work on that again.