intel / ledmon

Enclosure LED Utilities
GNU General Public License v2.0
72 stars 44 forks source link

Strange behavior with AMD SGPIO controller #199

Closed mscdex closed 1 month ago

mscdex commented 6 months ago

I am attempting to use ledctl with a 12-bay Supermicro server (3 rows with 4 bays each), but only 2 of the bays ever "locate" properly. All 12 bays work fine as they are all filled with working SATA drives and the activity LEDs for all of them also work correctly.

I've also tried different indicators (e.g. failure=/dev/...) and while they seem to work as intended for the already working bays, they also do not work for the unresponsive bays.

The other strange thing is that when I attempt to list the slots via ledctl it shows nothing for either of the (AMD) controllers.

You can ignore the output below regarding /sys/devices/pci0000:00/0000:00:03.4/0000:01:00.0 as that is an M.2 NVMe slot and not a part of this issue.

Here is the output for various commands (using ledctl v0.97):

# ledctl --all -L
ledctl: AMD Drive: port 7, ata port 15, drive bay 1, initiator 1
ledctl: IPMI Error: c7
ledctl: Unable to determine Dell Server type
ledctl: Couldn't find base EM path for /sys/devices/pci0000:00/0000:00:03.4/0000:01:00.0
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:03.4/0000:01:00.0 - enclosure management not supported.
ledctl: AMD Drive: port 1, ata port 1, drive bay 3, initiator 0
/sys/devices/pci0000:40/0000:40:08.3/0000:49:00.0 (AMD)
/sys/devices/pci0000:40/0000:40:08.2/0000:48:00.0 (AMD)
# ledctl --all -P -c AMD
/dev/shm/ledmon.conf: does not exist, using global config file
/etc/ledmon.conf: does not exist, using built-in defaults
ledctl: AMD Drive: port 7, ata port 15, drive bay 1, initiator 1
ledctl: IPMI Error: c7
ledctl: Unable to determine Dell Server type
ledctl: Couldn't find base EM path for /sys/devices/pci0000:00/0000:00:03.4/0000:01:00.0
ledctl: controller discovery: /sys/devices/pci0000:00/0000:00:03.4/0000:01:00.0 - enclosure management not supported.
ledctl: AMD Drive: port 1, ata port 1, drive bay 3, initiator 0

For the locate output, I've removed all of the preceding locate_off commands that get sent for brevity but I can post the entire output for any of the devices if necessary.

# ledctl --all locate=/dev/sda
< snip >
ledctl: Setting LOCATE...
ledctl:         device: .../ata3/host2/target2:0:0/2:0:0:0/block/sda
ledctl:         buffer: .../ata1/host0/scsi_host/host0/em_buffer
ledctl: AMD Drive: port 3, ata port 3, drive bay 1, initiator 0
ledctl: AMD SGPIO Header: 00100030
ledctl:            message type: 3                 data size: 0
ledctl:            message size: 10
ledctl: AMD SGPIO Request Register: 00c08240 00000001
ledctl:              frame type: 40                 function: 82
ledctl:           register type: c0           register index: 0
ledctl:          register count: 1
ledctl: AMD SGPIO AMD Register: 00000060
ledctl:               initiator: 0                  polarity: 0
ledctl:           bypass enable: 1          return to normal: 1
ledctl: CFG SGPIO Header: 00140030
ledctl:            message type: 3                 data size: 0
ledctl:            message size: 14
ledctl: CFG SGPIO Request Register: 00008240 00000002
ledctl:              frame type: 40                 function: 82
ledctl:           register type: 0            register index: 0
ledctl:          register count: 2
ledctl: CFG SGPIO Configuration Register: 00800000 00217700
ledctl:                 version: 0         gp register count: 0
ledctl:      cfg register count: 0              gpio enabled: 1
ledctl:             drive count: 0          blink gen rate A: 7
ledctl:        blink gen rate B: 7        force activity off: 2
ledctl:         max activity on: 1      stretch activity off: 0
ledctl:     stretch activity on: 0
ledctl: TX SGPIO Header: 00100030
ledctl:            message type: 3                 data size: 0
ledctl:            message size: 10
ledctl: TX SGPIO Request Register: 00038240 00000001
ledctl:              frame type: 40                 function: 82
ledctl:           register type: 3            register index: 0
ledctl:          register count: 1
ledctl: TX SGPIO TX Register: a0a0c6a0
ledctl:         drive 0: error 0, locate 0, activity 5
ledctl:         drive 1: error 6, locate 0, activity 6
ledctl:         drive 2: error 0, locate 0, activity 5
ledctl:         drive 3: error 0, locate 0, activity 5
# ledctl --all locate=/dev/sdb
< snip >
ledctl: Setting LOCATE...
ledctl:         device: .../ata4/host3/target3:0:0/3:0:0:0/block/sdb
ledctl:         buffer: .../ata1/host0/scsi_host/host0/em_buffer
ledctl: AMD Drive: port 4, ata port 4, drive bay 0, initiator 0
ledctl: AMD SGPIO Header: 00100030
ledctl:            message type: 3                 data size: 0
ledctl:            message size: 10
ledctl: AMD SGPIO Request Register: 00c08240 00000001
ledctl:              frame type: 40                 function: 82
ledctl:           register type: c0           register index: 0
ledctl:          register count: 1
ledctl: AMD SGPIO AMD Register: 00000060
ledctl:               initiator: 0                  polarity: 0
ledctl:           bypass enable: 1          return to normal: 1
ledctl: CFG SGPIO Header: 00140030
ledctl:            message type: 3                 data size: 0
ledctl:            message size: 14
ledctl: CFG SGPIO Request Register: 00008240 00000002
ledctl:              frame type: 40                 function: 82
ledctl:           register type: 0            register index: 0
ledctl:          register count: 2
ledctl: CFG SGPIO Configuration Register: 00800000 00217700
ledctl:                 version: 0         gp register count: 0
ledctl:      cfg register count: 0              gpio enabled: 1
ledctl:             drive count: 0          blink gen rate A: 7
ledctl:        blink gen rate B: 7        force activity off: 2
ledctl:         max activity on: 1      stretch activity off: 0
ledctl:     stretch activity on: 0
ledctl: TX SGPIO Header: 00100030
ledctl:            message type: 3                 data size: 0
ledctl:            message size: 10
ledctl: TX SGPIO Request Register: 00038240 00000001
ledctl:              frame type: 40                 function: 82
ledctl:           register type: 3            register index: 0
ledctl:          register count: 1
ledctl: TX SGPIO TX Register: a0a0a0c6
ledctl:         drive 0: error 6, locate 0, activity 6
ledctl:         drive 1: error 0, locate 0, activity 5
ledctl:         drive 2: error 0, locate 0, activity 5
ledctl:         drive 3: error 0, locate 0, activity 5
# ledctl --all locate=/dev/sdc
< snip >
ledctl: Setting LOCATE...
ledctl:         device: .../ata5/host4/target4:0:0/4:0:0:0/block/sdc
ledctl:         buffer: .../ata1/host0/scsi_host/host0/em_buffer
ledctl: AMD Drive: port 5, ata port 5, drive bay 3, initiator 1
ledctl: AMD SGPIO Header: 00100030
ledctl:            message type: 3                 data size: 0
ledctl:            message size: 10
ledctl: AMD SGPIO Request Register: 00c08240 00000001
ledctl:              frame type: 40                 function: 82
ledctl:           register type: c0           register index: 0
ledctl:          register count: 1
ledctl: AMD SGPIO AMD Register: 00000061
ledctl:               initiator: 1                  polarity: 0
ledctl:           bypass enable: 1          return to normal: 1
ledctl: CFG SGPIO Header: 00140030
ledctl:            message type: 3                 data size: 0
ledctl:            message size: 14
ledctl: CFG SGPIO Request Register: 00008240 00000002
ledctl:              frame type: 40                 function: 82
ledctl:           register type: 0            register index: 0
ledctl:          register count: 2
ledctl: CFG SGPIO Configuration Register: 00800000 00217700
ledctl:                 version: 0         gp register count: 0
ledctl:      cfg register count: 0              gpio enabled: 1
ledctl:             drive count: 0          blink gen rate A: 7
ledctl:        blink gen rate B: 7        force activity off: 2
ledctl:         max activity on: 1      stretch activity off: 0
ledctl:     stretch activity on: 0
ledctl: TX SGPIO Header: 00100030
ledctl:            message type: 3                 data size: 0
ledctl:            message size: 10
ledctl: TX SGPIO Request Register: 00038240 00000001
ledctl:              frame type: 40                 function: 82
ledctl:           register type: 3            register index: 0
ledctl:          register count: 1
ledctl: TX SGPIO TX Register: c6a0a0a0
ledctl:         drive 0: error 0, locate 0, activity 5
ledctl:         drive 1: error 0, locate 0, activity 5
ledctl:         drive 2: error 0, locate 0, activity 5
ledctl:         drive 3: error 6, locate 0, activity 6
mscdex commented 5 months ago

Update: I think what's happening now is the backplane is wired up starting with the third onboard SATA connector and so the status LEDs being changed are offset by two.

However, the problem still remains for the second AMD controller. This particular motherboard (H12SSL-i) has eight onboard SATA ports, two 7/8-pin SGPIO connectors, and one SlimSAS x8 connector.

Four of the onboard SATA ports (and one of the SGPIO connectors I'm assuming since I don't have physical access yet) seemingly work fine (or will once I move the connections around so the correct status LEDs are lit up), but what I don't understand is how ledctl is supposed to be able to address the sideband portion of the SlimSAS x8 connection to control the status LEDs for those attached bays. I would have assumed a second SGPIO would have been necessary along with the SlimSAS connection, but no such cabling exists.

So my question is, should Linux/ledctl be able to somehow utilize the special signaling from the SAS connection via the AMD controller or is it not possible because the signaling may be simply disconnected between the SlimSAS connection and the AMD controller?

ktanska commented 5 months ago

Hi @nfont could you look at this issue, please?

mtkaczyk commented 1 month ago

No feedback from @nfont , closing.