intel / ledmon

Enclosure LED Utilities
GNU General Public License v2.0
73 stars 47 forks source link

Status of AMD Support #200

Closed minorsatellite closed 3 months ago

minorsatellite commented 9 months ago

Not sure if its accurate to classify this inquiry as a bug report vs general discussion but a few years back I migrated off of Intel over to AMD and have not been able to use ledctl ever since, which I sorely miss. I get that its an Intel project, but nevertheless....

Back in April of 2020 I filed a bug report but I don't believe the issue ever got resolved. If ledctl can't run on AMD I want to remove it because its choking my system logs.

ledmon[2374]: Unsupported AMD interface #65

Jan  2 18:03:35 myhost ledmon[6514]: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2
Jan  2 18:03:35 myhost ledmon[6514]: ledmon[6514]: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan  2 18:03:35 myhost ledmon[6514]: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan  2 18:03:40 myhost ledmon[6514]: ledmon[6514]: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2
Jan  2 18:03:40 myhost ledmon[6514]: ledmon[6514]: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan  2 18:03:40 myhost ledmon[6514]: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2

Here is the version that I am running:

ledctl -v
Intel(R) Enclosure LED Control Application 0.94 
Copyright (C) 2009-2019 Intel Corporation.

This is free software; see the source for copying conditions. There is NO warranty;
not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ledctl
ledctl: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
ledctl: missing operand(s)... run ledctl --help for details.
ledctl: main(): _ibpi_parse() failed (status=STATUS_IBPI_DETERMINE_ERROR).
ktanska commented 9 months ago

Could you try with ledctl v0.95 or any newer, please? I see that fixed commit for this PR was not included in the ledctl version which you are using.

minorsatellite commented 9 months ago

Hi, thanks for the suggestion. It looks like I am already running the latest.

ledmon is already the newest version (0.95-2).

However, when running the following command, I see that its running an earlier version.

ledmon -v Intel(R) Enclosure LED Monitor Service 0.94 Copyright (C) 2009-2019 Intel Corporation.

Removing it via a purge for some reason does not remove ledctl, apparently. I ended up having to do a manual removal of all of the executables since the purge sub-command left behind a bunch of detritus. Even the ledmon service continued running though the systemd unit files appear to have been removed.

minorsatellite commented 9 months ago

@ktanska I managed to reinstall, however it appears that the issue remains.

sudo ledmon -v
Intel(R) Enclosure LED Monitor Service 0.95 
Copyright (C) 2009-2021 Intel Corporation.

This is free software; see the source for copying conditions. There is NO warranty;
not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ledmon[13244]: exit status is STATUS_SUCCESS.
sudo ledctl
ledctl: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
ledctl: missing operand(s)... run ledctl --help for details.
ledctl: main(): _ibpi_parse() failed (status=STATUS_IBPI_DETERMINE_ERROR).
minorsatellite commented 9 months ago

So my only conclusion is that a four year old bug still has not been fixed and there is no real incentive to fix it.

ktanska commented 9 months ago

Looks like the previous bug was closed as completed but not all of the cases were fixed. @nfont could you look at this, please?

nfont commented 9 months ago

Looking at the ledmon output provided I see two things

sudo ledctl ledctl: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.

This first part indicates the enclosure management is not supported on your system, ledmon needs enclosure management support inb order to work.

ledctl: missing operand(s)... run ledctl --help for details. ledctl: main(): _ibpi_parse() failed (status=STATUS_IBPI_DETERMINE_ERROR).

This second part seems odd, I'm not sure why you would get missing operands error message. A quick look at the ledmon source indicates this is a bug. From what I see, the library routines that end up discovering controller devices are void return functions. This may be fine since you could have multiple controllers on a system. The end result is that the call the led_scan() doesn't return an error if no controllers are found. The 'missing operands' output then occurs because ledctl continues on and in _cmdline_ibpi_parse() appears to expect block devices to be found and generates the error message.

minorsatellite commented 9 months ago

This first part indicates the enclosure management is not supported on your system, ledmon needs enclosure management support inb order to work.

Meaning that the JBOD does not support SES-3? Its a HGST 4U60G2 60bay unit that definitely does support SES.

As to your second point, yes, the JBOD has dual controllers for multi-pathing, which I am using and require.

So, am I dead in the water at this point, should I just give up any hope of using this solution?

minorsatellite commented 8 months ago

Any further recommendations here before I throw in the towel?

nfont commented 8 months ago

Would it be possible to get the log of the failing command with log level set to debug?

This may give some insight into why the command thinks enclosure management is not supported.

minorsatellite commented 8 months ago

Sure. Here are the contents of my current ledmon.conf file. Ironically debug was already enabled.

Its pretty much the same error repeating itself every 5 seconds with not a lot of new information:

Jan 08 22:30:21   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan 08 22:30:26    INFO: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2

Jan 08 22:30:26   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan 08 22:30:31    INFO: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2

Jan 08 22:30:31   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan 08 22:30:36    INFO: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2

Jan 08 22:30:36   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan 08 22:30:41    INFO: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2

Jan 08 22:30:41   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan 08 22:30:47    INFO: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2

Jan 08 22:30:47   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan 08 22:30:52    INFO: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2

Jan 08 22:30:52   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan 08 22:30:57    INFO: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2

Jan 08 22:30:57   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
Jan 08 22:31:02    INFO: SGPIO EM not supported for /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2

Jan 08 22:31:02   ERROR: controller discovery: /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2 - enclosure management not supported.
minorsatellite commented 8 months ago

@nfont , I don't suppose this is going to be the turning point that you had hoped for? Since this JBOD is EOS, I can't hope to get support from Western Digital.

nfont commented 8 months ago

@minorsatellite you're correct, it's not what I was hoping for. I'm going to dig through the code some this afternoon or tomorrow to see if there is any information it looks for that you could get from the cmdline that would help.

minorsatellite commented 8 months ago

Thank you, @nfont

Ryushin commented 8 months ago

I just encountered this problem this morning. Chassis is Supermicro 847BE1C-R1K23WB, with a Supermicro H12SSL-CT motherboard, and a AMD EPYC 7443P processor. Running Debian Booktworm. Running 0.95 from Debian package.

I replaced the motherboard in this chassis with a AMD board from an Intel board. It worked fine using an Intel board. With AMD it's not working:

ledctl --all -L
ledctl: IPMI Error: c7
ledctl: Unable to determine Dell Server type
ledctl: Couldn't find base EM path for /sys/devices/pci0000:00/0000:00:03.4/0000:02:00.0
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:03.4/0000:02:00.0 - enclosure management not supported.
ledctl: AMD Drive: port 1, ata port 1, drive bay 3, initiator 0
ledctl: IPMI Error: c7
ledctl: Unable to determine Dell Server type
ledctl: Couldn't find base EM path for /sys/devices/pci0000:00/0000:00:03.3/0000:01:00.0
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:03.3/0000:01:00.0 - enclosure management not supported.
ledctl: AMD Drive: port 7, ata port 15, drive bay 1, initiator 1
ledctl: (raid_device_init) path: md0, level=2, state=6, degraded=0, disks=2, type=1
ledctl: (_set_block_state): device: sdt, state: NORMAL
ledctl: (_set_block_state): device: sds, state: NORMAL
/sys/devices/pci0000:40/0000:40:08.2/0000:49:00.0 (AMD)
/sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0 (SCSI)
/sys/devices/pci0000:40/0000:40:08.3/0000:4a:00.0 (AMD)

Can I provide any other information? We are mostly buying EPYC at this point.

minorsatellite commented 8 months ago

Has there been any further progress on this matter?

mtkaczyk commented 8 months ago

@nfont any update?

minorsatellite commented 7 months ago

Given up @nfont ?

SaulGoodman1337 commented 7 months ago

push @nfont

nfont commented 7 months ago

Hi all, apologies for the delay. I do have some cycles and will start looking at this again.

nfont commented 7 months ago

@Ryushin , looking at the output you provided I do see there are two devices that show as not supported:

ledctl: Couldn't find base EM path for /sys/devices/pci0000:00/0000:00:03.4/0000:02:00.0 ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:03.4/0000:02:00.0 - enclosure management not supported.

ledctl: Couldn't find base EM path for /sys/devices/pci0000:00/0000:00:03.3/0000:01:00.0 ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:03.3/0000:01:00.0 - enclosure management not supported.

Then later reports two devices as supported.

/sys/devices/pci0000:40/0000:40:08.2/0000:49:00.0 (AMD) /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0 (SCSI) /sys/devices/pci0000:40/0000:40:08.3/0000:4a:00.0 (AMD)

Can you confirm that the devices showing as not supported are ones you expect to be supported.

Ryushin commented 7 months ago

@Ryushin , looking at the output you provided I do see there are two devices that show as not supported:

ledctl: Couldn't find base EM path for /sys/devices/pci0000:00/0000:00:03.4/0000:02:00.0 ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:03.4/0000:02:00.0 - enclosure management not supported.

ledctl: Couldn't find base EM path for /sys/devices/pci0000:00/0000:00:03.3/0000:01:00.0 ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:03.3/0000:01:00.0 - enclosure management not supported.

Then later reports two devices as supported.

/sys/devices/pci0000:40/0000:40:08.2/0000:49:00.0 (AMD) /sys/devices/pci0000:40/0000:40:01.1/0000:41:00.0 (SCSI) /sys/devices/pci0000:40/0000:40:08.3/0000:4a:00.0 (AMD)

Can you confirm that the devices showing as not supported are ones you expect to be supported.

I would "assume" that those are the ones I expect to be supported.

I have another machine using the same chassis as the AMD board but with an Intel board. Output from ledctl for that:

ledctl --all -L
ledctl: (raid_device_init) path: md0, level=2, state=6, degraded=0, disks=2, type=1
ledctl: (raid_device_init) path: md10, level=2, state=5, degraded=0, disks=2, type=1
ledctl: (_set_block_state): device: sdb, state: NORMAL
ledctl: (_set_block_state): device: sda, state: NORMAL
ledctl: (_set_block_state): device: sdb, state: NORMAL
ledctl: (_set_block_state): device: sda, state: NORMAL
/sys/devices/pci0000:00/0000:00:1f.2 (AHCI)
/sys/devices/pci0000:00/0000:00:02.0/0000:02:00.0 (SCSI)
/sys/devices/pci0000:00/0000:00:11.4 (AHCI)

Both machines are using the same LSI SAS3008 controller, so I don't think that could be it.

nfont commented 7 months ago

I'm not finding anything in the ledmon code that leads me to believe this is a ledmon issue. The notes from people stating that the issue occurs when switching to a new board leads me to think this is an issue with the board not providing enclosure management support or perhaps having it disabled in BIOS.

One item people can check is to see if any of the enclosure management interfaces in sysfs are available. These would be em_buffer, em_message, em_message_supported, and em-message_type. The ledmon code base defaults to using SGPIO for led control and these files provide the interface.

If the files are not present there is always a chance the board provides access via IPMI. You could do a quick check to see if any of the following device files exist to see if IPMI may be supported. If any exist it would you would need to rebuild the ledmon code base to make AMD default to IPMI instead of SGPIO. I can provide a patch to do that if anyone would like to try it.

/dev/ipmi0 /dev/ipmidev/0 /dev/ipmidev0 /dev/bmc

minorsatellite commented 7 months ago

sudo find /sys -type f -name em_buffer [sudo] password for xadmin: /sys/devices/pci0000:60/0000:60:03.3/0000:62:00.0/ata4/host5/scsi_host/host5/em_buffer /sys/devices/pci0000:60/0000:60:03.3/0000:62:00.0/ata2/host3/scsi_host/host3/em_buffer /sys/devices/pci0000:60/0000:60:03.3/0000:62:00.0/ata3/host4/scsi_host/host4/em_buffer /sys/devices/pci0000:00/0000:00:08.1/0000:05:00.2/ata1/host2/scsi_host/host2/em_buffer

minorsatellite commented 6 months ago

@nfont here are my findings. Thoughts?

minorsatellite commented 5 months ago

@nfont not sure if you saw my latest response but is a patch forthcoming?

mtkaczyk commented 4 months ago

@nfont could you please take a look?

mtkaczyk commented 3 months ago

No response from AMD team for 3 months, sorry @minorsatellite but I'm closing it.

@nfont fell free to reopen this if you are working on this or you plan to do it.