Closed napaster closed 6 months ago
Hello,
First please provide output of ledctl -L
@napaster please provide the output as @mtkaczyk was suggested
Sorry
[root@ceph-osd7 ~]# ledctl -L /sys/devices/pci0000:00/0000:00:17.0 (AHCI) /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 (Dell SSD) /sys/devices/pci0000:00/0000:00:11.5 (AHCI) /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0 (Dell SSD) /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0 (Dell SSD) [root@ceph-osd7 ~]#
Ok, thanks.
Now, please provide output of
#ls -l /sys/block
Do you have nvme multipath enabled?
[root@ceph-osd7 ~]# ls -l /sys/block total 0 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-0 -> ../devices/virtual/block/dm-0 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-1 -> ../devices/virtual/block/dm-1 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-10 -> ../devices/virtual/block/dm-10 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-11 -> ../devices/virtual/block/dm-11 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-12 -> ../devices/virtual/block/dm-12 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-13 -> ../devices/virtual/block/dm-13 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-14 -> ../devices/virtual/block/dm-14 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-15 -> ../devices/virtual/block/dm-15 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-16 -> ../devices/virtual/block/dm-16 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-17 -> ../devices/virtual/block/dm-17 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-18 -> ../devices/virtual/block/dm-18 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-19 -> ../devices/virtual/block/dm-19 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-2 -> ../devices/virtual/block/dm-2 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-20 -> ../devices/virtual/block/dm-20 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-21 -> ../devices/virtual/block/dm-21 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-22 -> ../devices/virtual/block/dm-22 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-23 -> ../devices/virtual/block/dm-23 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-3 -> ../devices/virtual/block/dm-3 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-4 -> ../devices/virtual/block/dm-4 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-5 -> ../devices/virtual/block/dm-5 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-6 -> ../devices/virtual/block/dm-6 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-7 -> ../devices/virtual/block/dm-7 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-8 -> ../devices/virtual/block/dm-8 lrwxrwxrwx 1 root root 0 Jul 31 17:40 dm-9 -> ../devices/virtual/block/dm-9 lrwxrwxrwx 1 root root 0 Jul 31 17:40 nvme0n1 -> ../devices/pci0000:00/0000:00:1c.0/0000:01:00.0/nvme/nvme0/nvme0n1 lrwxrwxrwx 1 root root 0 Jul 31 17:40 nvme1n1 -> ../devices/pci0000:17/0000:17:00.0/0000:18:00.0/nvme/nvme1/nvme1n1 lrwxrwxrwx 1 root root 0 Jul 31 17:40 nvme2n1 -> ../devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0/nvme/nvme2/nvme2n1 lrwxrwxrwx 1 root root 0 Jul 31 17:40 sda -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:0/1:0:0:0/block/sda lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdb -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:1/1:0:1:0/block/sdb lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdc -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:2/1:0:2:0/block/sdc lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdd -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:3/1:0:3:0/block/sdd lrwxrwxrwx 1 root root 0 Jul 31 17:40 sde -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:4/1:0:4:0/block/sde lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdf -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:5/1:0:5:0/block/sdf lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdg -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:6/1:0:6:0/block/sdg lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdh -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:7/1:0:7:0/block/sdh lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdi -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:8/1:0:8:0/block/sdi lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdj -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:9/1:0:9:0/block/sdj lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdk -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:10/1:0:10:0/block/sdk lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdl -> ../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:11/1:0:11:0/block/sdl lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdm -> ../devices/pci0000:00/0000:00:14.0/usb1/1-9/1-9:1.0/host0/target0:0:0/0:0:0:0/block/sdm lrwxrwxrwx 1 root root 0 Jul 31 17:40 sdn -> ../devices/pci0000:00/0000:00:14.0/usb1/1-14/1-14.1/1-14.1.3/1-14.1.3:1.0/host16/target16:0:0/16:0:0:1/block/sdn lrwxrwxrwx 1 root root 0 Jul 31 17:40 sr0 -> ../devices/pci0000:00/0000:00:14.0/usb1/1-14/1-14.1/1-14.1.3/1-14.1.3:1.0/host16/target16:0:0/16:0:0:0/block/sr0 [root@ceph-osd7 ~]#
Do you have nvme multipath enabled?
I don't know how to check?
Do you have nvme multipath enabled?
I don't know how to check?
Oh, sorry it is Sata not nvme. it is not a case here.
[root@ceph-osd7 tmp]# ledctl locate=/dev/sda ledctl: /dev/sda: device not supported ledctl: IBPI LOCATE: missing block device(s)... pattern ignored.
You tried sda
, controller in current implementation is always a subpath to the device (https://github.com/intel/ledmon/blob/master/src/lib/block.c#L199):
Let compare sda
with controller list:
/sys/devices/pci0000:00/0000:00:17.0 (AHCI)
/sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 (Dell SSD)
/sys/devices/pci0000:00/0000:00:11.5 (AHCI)
/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0 (Dell SSD)
/sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0 (Dell SSD)
AHCI controller matches, it is not DELL specific. but the error: `` ledctl: /dev/sda: device not supported```
obviously suggest that we failed to match the device with controller and the device is not in block_list. This type of device must receive host and hostN property https://github.com/intel/ledmon/blob/v0.97/src/block.c#L261 so I think that is the reason it failed.
Path to my device which work with AHCI is quite different:
./devices/pci0000:00/0000:00:11.5/ata5/host4/target4:0:0/4:0:0:0/block/sda
It seems that you device is not connected expected way. Please try ti debug and determine where is the difference. I expect that the value we read might be returned in unexpected format.
Hello @napaster , Do you try to investigate it?
Good day. I'm sorry that I answered for a long time. Lots of work.
I don't quite understand what needs to be done. I compared with your string (which you expect to receive) and with the construction that comes out of me. They digress slightly.
The string you are expecting
./devices/pci0000:00/0000:00:11.5/ata5/host4/target4:0:0/4:0:0:0/block/sda
The string that I get.
../devices/pci0000:17/0000:17:02.0/0000:19:00.0/host1/target1:0:6/1:0:6:0/block/sdg
That is, in fact, the difference comes down to the fact that ata is not indicated in front of my host. Instead of the ata number, its address is indicated.
Please compile ledmon with debug flags or add some prints to block_device_init
using tag v0.97 to determine why /dev/sdg
is not added to block device list. I need to know why error is returned.
You can also try to manually translate you path to em_message to check if it is readable. In my case it is like following:
Device:
/sys/devices/pci0000:00/0000:00:11.5/ata5/host4
/target4:0:0/4:0:0:0/block/sda/
cat /sys/devices//pci0000\:00/0000\:00\:11.5/ata5/host4
/scsi_host/host4/em_message
[root@ceph-osd8 host5]# cat /sys/devices/pci0000\:00/0000\:00\:11.5/ata5/host5/scsi_host/host5/em_message 0 [root@ceph-osd8 host5]#`
It will be difficult to enable debugging, the machine is in production and it will not be possible to roll a custom package there.
you don't need to install it, just compile and run locally: "./src/ledctl locate=/dev/sdg" That is all you need.
What flag do I need to run ./configure with so that the package is built with the necessary debug mode. The fact is that we have giltab assembling packages and on a host in production it will not work just to download and try to compile the package. Just collect the package in gitlab and distribute it to the hosts.
You need -ggdb flag.
I cannot help you if there is nothing which points me to the root cause. I don't have similar setup to assist you. I need your input to resolve this issue.
In general, I tried to manually build the package on another server (which is not in production). I assembled it with the key ./configure CFLAGS="-ggdb"
and as a result I had this output.
ledmon 0.97 configuration: Source code location: . Preprocessor flags: -D_DEBUG -D_GNU_SOURCE -D_DEFAULT_SOURCE -DDMALLOC_DISABLE -DBUILD_LABEL=\""$(BUILD_LABEL)"\"
C compiler flags: -Wall -I../config -Wformat -Werror=format-security -Werror=format-overflow=2 -Werror=format-truncation=1 -Werror=shift-negative-value -Werror=alloca -Werror=missing-field-initializers -Werror=format-sign edness -ggdb Common install location: /usr configure parameters: --enable-systemd=no
I took 2 ledmon and ledctl files from the /src folder and tried to simply run them through ./ and I got this output on the server.
[root@ceph-osd8 tmp]# ./ledctl -x -L /sys/devices/pci0000:00/0000:00:17.0 (AHCI) /sys/devices/pci0000:3a/0000:3a:00.0/0000:3b:00.0 (Dell SSD) /sys/devices/pci0000:00/0000:00:11.5 (AHCI) /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0 (Dell SSD) /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0 (Dell SSD) [root@ceph-osd8 tmp]#
and try locate
[root@ceph-osd8 tmp]# ./ledctl locate=/dev/sdc ledctl: /dev/sdc: device not supported ledctl: IBPI LOCATE: missing block device(s)... pattern ignored. ledctl: main(): _ibpi_parse() failed (status=STATUS_NOT_SUPPORTED). [root@ceph-osd8 tmp]#
But apparently this is not enough? or did I assemble the package incorrectly?
Good job :)
There are 2 options:
gdb
to debug this (you can use cgdb, it is more friendly) - there are many tutorials in internet.I suspect that device is not added to block list to please take a look into block_device_init, you can simply make a break (using gdb) or add messages like:
log_info("Processing %s\n", path )
log_info("realpath failed for %s\n", path )
It is up to you.. the goal is to understand why device is rejected. We need to understand why.
You can also try to compile latest upstream ledmon to see it issue is fixed.
Hello. Sorry for another long answer. Job. In general, I tried to compile with the key that you indicated. Does not work. I tried to compile on different systems, and even tried to compile on a baremetal server. Does not work. It crashes with an error. I tried to compile under CentOS 8 Stream
[root@mks ledmon]# ./configure CFLAGS="-cgdb" configure: loading site script /usr/share/config.site checking for a BSD-compatible install... /bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... gawk checking whether make sets $(MAKE)... yes checking whether make supports nested variables... yes checking whether make supports the include directive... yes (GNU style) checking for gcc... gcc checking whether the C compiler works... no configure: error: in /root/ledmon: configure: error: C compiler cannot create executables See config.log for more details [root@mks ledmon]#
I am attaching the log config.log
I log you provided I can find following:
gcc: error: unrecognized command line option '-Wwrapv'; did you mean '-fwrapv'?
configure:13343: $? = 1
This option could be not supported by your compiler. What it gcc version? @pawpiatko could you please take a look deeper into this problem?
[root@mks ledmon]# gcc --version gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20) Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[root@mks ledmon]#
[root@mks ledmon]# g++ --version g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20) Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[root@mks ledmon]#
@pawpiatko can you help ?
@pawpiatko can you help ?
@napaster please retry with this change: https://github.com/intel/ledmon/pull/181
Ok, I’ll test it in just a couple of days
@napaster ping? Is the issue still valid?
Yes, sorry, there’s a lot of work, I don’t have time, I’ll try to assemble and check this week.
@napaster I close this bug, because it has no response from 2 months. Feel free to open if you will want to work on that again.
Good time of day. There is a Dell PowerEdge R540 server with PERC H330 Adapter (Embedded) installed. When trying to use ledmon\ledctl, it throws an error:
Version ledmon:
Vesrion OS system:
Which way to look ? and what to do ?