stanford-rc / ibswinfo

Command-line tool to retrieve information and monitor Mellanox un-managed Infiniband switches
GNU General Public License v3.0
48 stars 8 forks source link

SB7790 switch support #6

Open kramanella opened 4 years ago

kramanella commented 4 years ago

ibswinfo supports Mellanox SB7790 unmanaged switches running firmware 11.1100.0072 or greater with 1 exception. The '-T' flag is unsupported. All other info and vitals are captured. Thanks! Mark

Sample output: `...

temperature (C) | 50 max temp (C) | 56 QSFP#01 (C) | 0 QSFP#02 (C) | 0 QSFP#03 (C) | 0 QSFP#04 (C) | 0 QSFP#05 (C) | 0 QSFP#06 (C) | 0 QSFP#07 (C) | 0 QSFP#08 (C) | 0 QSFP#09 (C) | 0 QSFP#10 (C) | 0 QSFP#11 (C) | 0 QSFP#12 (C) | 0 QSFP#13 (C) | 0 QSFP#14 (C) | 0 QSFP#15 (C) | 0 QSFP#16 (C) | 0 QSFP#17 (C) | 0 QSFP#18 (C) | 0 QSFP#19 (C) | 0 QSFP#20 (C) | 0 QSFP#21 (C) | 0 QSFP#22 (C) | 0 QSFP#23 (C) | 0 QSFP#24 (C) | 0 QSFP#25 (C) | 0 QSFP#26 (C) | 0 QSFP#27 (C) | 0 QSFP#28 (C) | 0 QSFP#29 (C) | 0 QSFP#30 (C) | 0 QSFP#31 (C) | 0 QSFP#32 (C) | 0 QSFP#33 (C) | 0 QSFP#34 (C) | 0 QSFP#35 (C) | 0 QSFP#36 (C) | 0

...`

kcgthb commented 4 years ago

Hi @kramanella

Ah, interesting!

Would you mind sending me the output of:

# ibswinfo.sh -d <device_id> -o inventory | egrep '^part_number|version'

as well as:

# mlxreg -d <device_id> --reg_name MTMP --get  --indexes "sensor_index=0x1"

And of course, you're positive that there are cables plugged in those ports, right?

kramanella commented 4 years ago

Yep, the switch is fully populated :) Here's the info: part_number : MSB7790-ES2F fw_version : 11.2007.0300

And the full output as well with the -T flag:

./ibswinfo.sh -d

/dev/mst/SW_MT52000_SwitchIB_Mellanox_Technologies_lid-0x000C -T

SwitchIB Mellanox Technologies

part number | MSB7790-ES2F serial number | MT...... product name | Scorpion IB EDR Unmanaged revision | AD ports | 36 PSID | 1_TM108830012 GUID | 0x..... firmware version | 11.2007.0300

uptime (d-h:m:s) | 8d-07:16:52

PSU0 status | OK P/N | MTEF-PSF-AC-A S/N | MT..... DC power | OK fan status | OK power (W) | 70 PSU1 status | OK P/N | MTEF-PSF-AC-A S/N | MT..... DC power | OK fan status | OK power (W) | 58

temperature (C) | 50 max temp (C) | 56 QSFP#01 (C) | 0 QSFP#02 (C) | 0 QSFP#03 (C) | 0 QSFP#04 (C) | 0 QSFP#05 (C) | 0 QSFP#06 (C) | 0 QSFP#07 (C) | 0 QSFP#08 (C) | 0 QSFP#09 (C) | 0 QSFP#10 (C) | 0 QSFP#11 (C) | 0 QSFP#12 (C) | 0 QSFP#13 (C) | 0 QSFP#14 (C) | 0 QSFP#15 (C) | 0 QSFP#16 (C) | 0 QSFP#17 (C) | 0 QSFP#18 (C) | 0 QSFP#19 (C) | 0 QSFP#20 (C) | 0 QSFP#21 (C) | 0 QSFP#22 (C) | 0 QSFP#23 (C) | 0 QSFP#24 (C) | 0 QSFP#25 (C) | 0 QSFP#26 (C) | 0 QSFP#27 (C) | 0 QSFP#28 (C) | 0 QSFP#29 (C) | 0 QSFP#30 (C) | 0 QSFP#31 (C) | 0 QSFP#32 (C) | 0 QSFP#33 (C) | 0 QSFP#34 (C) | 0 QSFP#35 (C) | 0 QSFP#36 (C) | 0

fan status | OK fan#1 (rpm) | 6399 fan#2 (rpm) | 5430 fan#3 (rpm) | 6399 fan#4 (rpm) | 5345 fan#5 (rpm) | 6281 fan#6 (rpm) | 5430 fan#7 (rpm) | 6399 fan#8 (rpm) | 5345

mlxreg -d /dev/mst/SW_MT52000_SwitchIB_Mellanox_Technologies_lid-0x000C

--reg_name MTMP --get --indexes "sensor_index=0x1" Sending access register...

Field Name | Data

sensor_index | 0x00000001 temperature | 0x000000e0 max_temperature | 0x000000f8 mtr | 0x00000000 mte | 0x00000000 temperature_threshold_hi | 0x000004b0 tee | 0x00000000 temperature_threshold_lo | 0x000004b0 sensor_name_hi | 0x00000000 sensor_name_lo | 0x00000000

On Thu, Jun 4, 2020 at 6:16 PM Kilian Cavalotti notifications@github.com wrote:

Hi @kramanella https://github.com/kramanella

Ah, interesting!

Would you mind sending me the output of:

ibswinfo.sh -d -o inventory | egrep '^part_number|version'

as well as:

mlxreg -d --reg_name MTMP --get --indexes "sensor_index=0x1"

And of course, you're positive that there are cables plugged in those ports, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanford-rc/ibswinfo/issues/6#issuecomment-639201304, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIL3QF7G6D5V3TSTGDAHJFLRVBBPDANCNFSM4NSCKQJQ .

kcgthb commented 4 years ago

Thanks for the output!

It looks like the registers are correctly showing the temperature, so I'm not 100% sure why the script shows 0.

Could you please try the version from the SB7790branch at https://github.com/stanford-rc/ibswinfo/blob/SB7790/ibswinfo.sh and see if that fixes the issue?

kramanella commented 4 years ago

Unfortunately the QSFP ports all report 0 still.

On Fri, Jun 5, 2020 at 5:22 PM Kilian Cavalotti notifications@github.com wrote:

Thanks for the output!

It looks like the registers are correctly showing the temperature, so I'm not 100% sure why the script shows 0.

Could you please try the version from the SB7790branch at https://github.com/stanford-rc/ibswinfo/blob/SB7790/ibswinfo.sh and see if that fixes the issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanford-rc/ibswinfo/issues/6#issuecomment-639913455, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIL3QF5ACGAY4H2N76QPZR3RVGD3BANCNFSM4NSCKQJQ .

kcgthb commented 4 years ago

Ah sorry, I didn't ask for the right index before. Could you please run these 2 commands instead?

# mlxreg -d <device_id> --reg_name MTMP --get --indexes "sensor_index=0x39"
# mlxreg -d <device_id> --reg_name MTMP --get --indexes "sensor_index=0x40
kramanella commented 4 years ago

Here you go, [root@dtn01 ~]# mlxreg -d /dev/mst/SW_MT52000_SwitchIB_Mellanox_Technologies_lid-0x000C --reg_name MTMP --get --indexes "sensor_index=0x39" Sending access register...

-E- Failed to send access register: Bad parameter [root@dtn01 ~]# mlxreg -d /dev/mst/SW_MT52000_SwitchIB_Mellanox_Technologies_lid-0x000C --reg_name MTMP --get --indexes "sensor_index=0x40" Sending access register...

Field Name | Data

sensor_index | 0x00000040 temperature | 0x00000000 max_temperature | 0x00000000 mtr | 0x00000000 mte | 0x00000000 temperature_threshold_hi | 0x00000000 tee | 0x00000000 temperature_threshold_lo | 0x00000000 sensor_name_hi | 0x00000000 sensor_name_lo | 0x00000000

On Wed, Jun 10, 2020 at 10:20 AM Kilian Cavalotti notifications@github.com wrote:

Ah sorry, I didn't ask for the right index before. Could you please run these 2 commands instead?

mlxreg -d --reg_name MTMP --get --indexes "sensor_index=0x39"

mlxreg -d --reg_name MTMP --get --indexes "sensor_index=0x40

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanford-rc/ibswinfo/issues/6#issuecomment-642147733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIL3QF2JTLPK2YIH4CWRF2DRV66EBANCNFSM4NSCKQJQ .

kcgthb commented 4 years ago

Thank you!

So that's the problem: temperature is 0x00000000 for sensor 0x40 (which is the first port of the switch). The fact that 0x39 doesn't exist confirms that the indexes are not shifted or anything.

Not much can be done about that unfortunately, that would be a firmware limitation on that model. :|

I added a note in the README to mention that limitation, thanks a lot for reporting it!