stanford-rc / ibswinfo

Command-line tool to retrieve information and monitor Mellanox un-managed Infiniband switches
GNU General Public License v3.0
48 stars 8 forks source link

Trying to fix fan speed readout for HDR switches #19

Open frantathefranta opened 1 week ago

frantathefranta commented 1 week ago

As per https://github.com/stanford-rc/ibswinfo/issues/17, it's not possible right now to check all fan speeds on HDR and NDR switches. What I found out (I think) is that these newer switches have more fans under the tacho_active_msb index in register MFCR. I tried a sort of hacky way of getting to those fans with existing methods but not really sure if I've achieved anything correct. This only works on HDR switches, where it enumerates 12 fans (the correct amount):

$ ./ibswinfo.sh -d SW_MT54000_Quantum_Mellanox_Technologies
[...]
fan status         | OK
fan#1 (rpm)        | 5906
fan#2 (rpm)        | 5379
fan#3 (rpm)        | 5959
fan#4 (rpm)        | 5209
fan#5 (rpm)        | 6068
fan#6 (rpm)        | 5293
fan#7 (rpm)        | 5803
fan#8 (rpm)        | 5293
fan#9 (rpm)        | 5906
fan#12 (rpm)       | 5293
fan#13 (rpm)       | 7808
fan#14 (rpm)       | 5312
-------------------------------------------------

Doing it on an NDR switch (14 fans) yields weird results (I think due to the fact it needs 17 bits to enumerate all the fans, if I understand the logic of it correctly):

$ ./ibswinfo.sh -d SW_MT54002_Quantum-2_Mellanox_Technologies
[...]
fan status         | OK
fan#1 (rpm)        | 6754
fan#2 (rpm)        | 5964
fan#3 (rpm)        | 6720
fan#4 (rpm)        | 5884
fan#5 (rpm)        | 6859
fan#6 (rpm)        | 6018
fan#7 (rpm)        | 6824
fan#8 (rpm)        | 5964
fan#9 (rpm)        | 6754
fan#12 (rpm)       | 5884
fan#13 (rpm)       | 6824
fan#14 (rpm)       | 5807
fan#15 (rpm)       | 0
-------------------------------------------------

I hope this can be in any way useful and not a dead end.

frantathefranta commented 1 week ago

This is the comparison of the tacho_active_msb field in MFCR register. Hopefully it can be helpful.

MSB7790 switch

[root@ufm1 ufm_reg_testing]# mlxreg_ext  -d SW_MT52000_SwitchIB_Mellanox_Technologies --reg_name MFCR --get
Sending access register...

Field Name          | Data
=================================
pwm_frequency       | 0x00000044
pwm_active          | 0x00000001
tacho_active        | 0x000001fe
tacho_active_msb    | 0x00000000
=================================

MSB8790

[root@ufm1 ufm_reg_testing]# mlxreg_ext  -d SW_MT53000_SwitchIB_Mellanox_Technologies --reg_name MFCR --get
Sending access register...

Field Name          | Data
=================================
pwm_frequency       | 0x00000044
pwm_active          | 0x00000001
tacho_active        | 0x000001fe
tacho_active_msb    | 0x00000000
=================================

MQM8790 (HDR)

[root@ufm1 ufm_reg_testing]# mlxreg_ext  -d SW_MT54000_Quantum_Mellanox_Technologies --reg_name MFCR --get
Sending access register...

Field Name          | Data
=================================
pwm_frequency       | 0x00000044
pwm_active          | 0x00000001
tacho_active        | 0x000003fe
tacho_active_msb    | 0x00000007
=================================

MQM9790 (NDR)

[root@ufm1 ufm_reg_testing]# mlxreg_ext  -d SW_MT54002_Quantum-2_Mellanox_Technologies --reg_name MFCR --get
Sending access register...

Field Name          | Data
=================================
pwm_frequency       | 0x00000044
pwm_active          | 0x00000001
tacho_active        | 0x000003fe
tacho_active_msb    | 0x0000001f
=================================