sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
743 stars 1.43k forks source link

fanshow and fan management are broken on Dell N3248TE-ON and current release #16666

Open justindthomas opened 1 year ago

justindthomas commented 1 year ago

Description

I'm new to SONiC and installed it on a Dell N3248TE-ON I received a couple of days ago. On initial boot, the switch actively managed the fans (i.e., speed was constantly changing, presumably in response to load, but was at an average pretty quiet speed).

After installing SONiC, the fans just run at a high (loud) speed constantly. Commands to show the fan status fail with Python errors.

root@sonic:~# show platform fan
Traceback (most recent call last):
  File "/usr/local/bin/fanshow", line 85, in <module>
    fanShow.show()
  File "/usr/local/bin/fanshow", line 75, in show
    table.append((data_dict[DRAWER_FIELD_NAME], data_dict[LED_STATUS_FIELD_NAME], name, speed, data_dict[DIRECTION_FIELD_NAME], presence, status,
KeyError: 'drawer_name'
root@sonic:~# show environment
  File "/usr/bin/platform_sensors.py", line 20
    print line
          ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(line)?

Steps to reproduce the issue:

  1. Turn it on.
  2. show platform fan
  3. show environment

Describe the results you received:

  1. Loud, constantly high fan speed.
  2. Error message (above) for show platform fan
  3. Error message (above) for show environment

Describe the results you expected:

  1. Managed fan speed like the OEM (Dell) software on the switch.
  2. Something showing me details about the fans.
  3. Something showing me details about the environment.

Output of show version:

root@sonic:~# show ver

SONiC Software Version: SONiC.master.366447-72341a7ee
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: 72341a7ee
Build date: Tue Sep 19 12:43:09 UTC 2023
Built by: AzDevOps@vmss-soni0021EJ

Platform: x86_64-dellemc_n3248te_c3338-r0
HwSKU: DellEMC-N3248TE
ASIC: broadcom
ASIC Count: 1
Serial Number: 4GNXV43
Model Number: 0WNWT9
Hardware Revision: 
Uptime: 00:12:28 up 1 day, 16:09,  1 user,  load average: 0.54, 0.54, 0.60
Date: Sat 23 Sep 2023 00:12:28

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-gbsyncd-broncos        latest                    864fedd2e6a4   349MB
docker-gbsyncd-broncos        master.366447-72341a7ee   864fedd2e6a4   349MB
docker-gbsyncd-credo          latest                    e9b286143afd   319MB
docker-gbsyncd-credo          master.366447-72341a7ee   e9b286143afd   319MB
docker-syncd-brcm             latest                    c77995c75fb4   673MB
docker-syncd-brcm             master.366447-72341a7ee   c77995c75fb4   673MB
docker-macsec                 latest                    123ab862c7ba   325MB
docker-dhcp-relay             latest                    874362975b61   307MB
docker-orchagent              latest                    3a670bf2f74c   335MB
docker-orchagent              master.366447-72341a7ee   3a670bf2f74c   335MB
docker-fpm-frr                latest                    3102e2ac34f7   355MB
docker-fpm-frr                master.366447-72341a7ee   3102e2ac34f7   355MB
docker-eventd                 latest                    445ee5b2391d   299MB
docker-eventd                 master.366447-72341a7ee   445ee5b2391d   299MB
docker-nat                    latest                    fe6b6887a1a9   326MB
docker-nat                    master.366447-72341a7ee   fe6b6887a1a9   326MB
docker-sflow                  latest                    0bee6fa20dea   325MB
docker-sflow                  master.366447-72341a7ee   0bee6fa20dea   325MB
docker-teamd                  latest                    b291ac9beef4   323MB
docker-teamd                  master.366447-72341a7ee   b291ac9beef4   323MB
docker-snmp                   latest                    529de93e25be   338MB
docker-snmp                   master.366447-72341a7ee   529de93e25be   338MB
docker-sonic-telemetry        latest                    ab34c1cab1d0   600MB
docker-sonic-telemetry        master.366447-72341a7ee   ab34c1cab1d0   600MB
docker-platform-monitor       latest                    b908f4ae5295   419MB
docker-platform-monitor       master.366447-72341a7ee   b908f4ae5295   419MB
docker-router-advertiser      latest                    7a55092937dd   299MB
docker-router-advertiser      master.366447-72341a7ee   7a55092937dd   299MB
docker-lldp                   latest                    5a25c9484782   341MB
docker-lldp                   master.366447-72341a7ee   5a25c9484782   341MB
docker-database               latest                    d6b0da49e72c   299MB
docker-database               master.366447-72341a7ee   d6b0da49e72c   299MB
docker-mux                    latest                    7dae38e79e3c   348MB
docker-mux                    master.366447-72341a7ee   7dae38e79e3c   348MB
docker-sonic-mgmt-framework   latest                    3a33b23aba95   416MB
docker-sonic-mgmt-framework   master.366447-72341a7ee   3a33b23aba95   416MB

Output of show techsupport:

techsupport.txt

Additional information you deem important (e.g. issue happens only occasionally):

The dump file is 31MB and GitHub rejects files over 25MB.

judyjoseph commented 1 year ago

@jeff-yin Could you check with this plartform in master ?

justindthomas commented 1 year ago

As I'm becoming more comfortable with SONiC, I wonder if the fact that I don't have a second PSU installed might be playing into this. I've noticed that sometimes small configuration changes can cause problems in modules that have different expectations.

I don't have a second PSU to plug in, but that might be something to investigate.

jeff-yin commented 1 year ago

As I'm becoming more comfortable with SONiC, I wonder if the fact that I don't have a second PSU installed might be playing into this. I've noticed that sometimes small configuration changes can cause problems in modules that have different expectations.

I don't have a second PSU to plug in, but that might be something to investigate.

Usually the fans will go to 100% when a FAN module is removed. I don't think lacking a PSU would trigger this. There may be some missing thermal policy code for this platform. I've asked a couple of people at Dell to look into it. @arunlk-dell @vpsubramaniam

justindthomas commented 1 year ago

I checked last night to see that all 3 fan modules were running and they're all moving air. The LEDs on all of them are off.

justindthomas commented 1 year ago

The speed of the fans does not seem to be dependent on the PSU presence, but the display of the status does. I picked up a second PSU and the command show platform fan now works.

jdt@sonic:~$ sudo show platform fan
  Drawer    LED            FAN    Speed    Direction    Presence    Status          Timestamp
--------  -----  -------------  -------  -----------  ----------  --------  -----------------
FanTray1    N/A  FanTray1-Fan1      57%       intake     Present        OK  20231014 02:51:38
FanTray2    N/A  FanTray2-Fan1      59%       intake     Present        OK  20231014 02:51:38
FanTray3    N/A  FanTray3-Fan1      58%       intake     Present        OK  20231014 02:51:38
     N/A    N/A       PSU1 Fan      15%       intake     Present        OK  20231014 02:51:38
     N/A    N/A       PSU2 Fan      15%       intake     Present        OK  20231014 02:51:39

So they aren't running at 100%, but they are running at a constant higher speed than the default software (Dell OS6, I believe) that came on the switch. Maybe that's normal? It seems like the temperatures could tolerate a less aggressive setting.

jdt@sonic:~$ sudo show platform temperature
                      Sensor    Temperature    High TH    Low TH    Crit High TH    Crit Low TH    Warning          Timestamp
----------------------------  -------------  ---------  --------  --------------  -------------  ---------  -----------------
 Front Panel PHY Temperature         30.687         75         0             N/A            N/A      False  20231014 02:51:39
 Middle Fan Tray Temperature         23.312         75         0             N/A            N/A      False  20231014 02:51:39
Near Front Panel Temperature         29.25          75         0             N/A            N/A      False  20231014 02:51:39
     Switch Near Temperature         29.75          75         0             N/A            N/A      False  20231014 02:51:39
     Switch Rear Temperature         24.5           75         0             N/A            N/A      False  20231014 02:51:39

Also, show environment is still broken as described in the original issue.

arunlk-dell commented 1 year ago

@justindthomas .. will be raising the pull request to fix the commands 'show environment' and 'show platform fan' by next week. For the fan speed will bring in the thermal manager changes sooner.

justindthomas commented 1 year ago

That's great, @arunlk-dell - thanks!

justindthomas commented 11 months ago

For the fan speed issues, I was able to tame them by adjusting these values:

/sys/bus/i2c/devices/7-002c/pwm1
/sys/bus/i2c/devices/7-002c/pwm2
/sys/bus/i2c/devices/7-002c/pwm3

By default, these are all set to 255, with /sys/bus/i2c/devices/7-002c/pwm#_enable set to 0, which results in the continuous ~58% speed for all of them. I changed those to 100 and the speeds dropped to between 20%-30% and seems more varied, like the system is properly responding to the changing temperature.

Does that parameter adjust the aggressiveness of the thermal algorithm? I experimented with changing the pwm_enable to 1, 2, and 3, but only 0 and 3 seem to be enabled. And 3 sets the fans to 4% and triggers a fault indicator, so that's clearly not appropriate.

justindthomas commented 11 months ago

Here's the corrected platform_sensors.py file to make show environment work properly. I changed the print statements to add parentheses, and I specified the iso-8859-1 encoding for the eeprom output, since that seems to be what the switch generates. I also changed that first check_output at the top to specify text=True.

Should I submit a PR? I assume this belongs in platform-specific code somewhere.

#!/usr/bin/python
# This provies support for the following objects:
#   * Onboard temperature sensors
#   * FAN trays
#   * PSU

import subprocess

output = ""
try:
    rc = 0
    output = subprocess.check_output('/usr/bin/sensors', text=True).splitlines()

    valid = False
    for line in output:
        if line.startswith('acpitz') or line.startswith('coretemp'):
            valid = True
        if valid:
            print(line)
            if line == '': valid = False

    print("Onboard Temperature Sensors:")
    idx = 0
    for line in output:
        if line.startswith('tmp75'):
            print('\t' + output[idx+2].split('(')[0])
        idx += 1

    print("\nFanTrays:")
    idx = 0
    found_emc = False
    for line in output:
        if line.startswith('emc'):
            found_emc = True
            with open('/sys/devices/platform/dell-n3248te-cpld.0/fan0_prs') as f:
                line = f.readline()
            present = int(line, 0)
            if present :
                print('\t' + 'FanTray1:')
                print('\t\t' + 'Fan Speed:' + (output[idx+2].split('(')[0]).split(':')[1])
                with open('/sys/devices/platform/dell-n3248te-cpld.0/fan0_dir') as f:
                    line = f.readline()
                dir = 'Intake' if line[:-1] == 'B2F' else 'Exhaust'
                print('\t\t' + 'Airflow:\t' + dir)
            else : print('\t' + 'FanTray1:\tNot Present')

            with open('/sys/devices/platform/dell-n3248te-cpld.0/fan1_prs') as f:
                line = f.readline()
            present = int(line, 0)
            if present :
                print('\t' + 'FanTray2:')
                print('\t\t' + 'Fan Speed:' + (output[idx+3].split('(')[0]).split(':')[1])
                with open('/sys/devices/platform/dell-n3248te-cpld.0/fan1_dir') as f:
                    line = f.readline()
                dir = 'Intake' if line[:-1] == 'B2F' else 'Exhaust'
                print('\t\t' + 'Airflow:\t' + dir)
            else : print('\t' + 'FanTray2:\tNot Present')

            with open('/sys/devices/platform/dell-n3248te-cpld.0/fan2_prs') as f:
                line = f.readline()
            present = int(line, 0)
            if present :
                print('\t' + 'FanTray3:')
                print('\t\t' + 'Fan Speed:' + (output[idx+4].split('(')[0]).split(':')[1])
                with open('/sys/devices/platform/dell-n3248te-cpld.0/fan2_dir') as f:
                    line = f.readline()
                dir = 'Intake' if line[:-1] == 'B2F' else 'Exhaust'
                print('\t\t' + 'Airflow:\t' + dir)
            else : print('\t' + 'FanTray3:\tNot Present')
        idx += 1
    if not found_emc :
        print('\t' + 'FanTray1:\tNot Present')
        print('\t' + 'FanTray2:\tNot Present')
        print('\t' + 'FanTray3:\tNot Present')

    print('\nPSUs:')
    idx = 0
    with open('/sys/devices/platform/dell-n3248te-cpld.0/psu0_prs') as f:
        line = f.readline()
    found_psu1 = int(line, 0)
    if not found_psu1 :
        print('\tPSU1:\tNot Present')
    with open('/sys/devices/platform/dell-n3248te-cpld.0/psu1_prs') as f:
        line = f.readline()
    found_psu2 = int(line, 0)
    for line in output:
        if line.startswith('dps460-i2c-10'):
            with open('/sys/devices/platform/dell-n3248te-cpld.0/psu0_status') as f:
                line = f.readline()
            status = int(line, 0)
            if not status :
                print('\tPSU1:\tNot OK')
                break
            with open('/sys/bus/i2c/devices/10-0056/eeprom', encoding='iso-8859-1') as f:
                line = f.readline()
            dir = 'Exhaust' if 'FORWARD' in line else 'Intake'
            print('\tPSU1:')
            print('\t\t' + output[idx+2].split('(')[0])
            print('\t\t' + output[idx+4].split('(')[0])
            print('\t\t' + output[idx+6].split('(')[0])
            print('\t\t' + output[idx+7].split('(')[0])
            print('\t\t' + output[idx+9].split('(')[0])
            print('\t\t' + output[idx+11].split('(')[0])
            print('\t\t' + output[idx+12].split('(')[0])
            print('\t\t' + output[idx+14].split('(')[0])
            print('\t\t' + output[idx+15].split('(')[0])
            print('\t\t' + 'Airflow:\t\t  ' + dir)
        if line.startswith('dps460-i2c-11'):
            with open('/sys/devices/platform/dell-n3248te-cpld.0/psu1_status') as f:
                line = f.readline()
            status = int(line, 0)
            if not status :
                print('\tPSU2:\tNot OK')
                break
            print('\tPSU2:')
            with open('/sys/bus/i2c/devices/11-0056/eeprom', encoding='iso-8859-1') as f:
                line = f.readline()
            dir = 'Exhaust' if 'FORWARD' in line else 'Intake'
            print('\t\t' + output[idx+2].split('(')[0])
            print('\t\t' + output[idx+4].split('(')[0])
            print('\t\t' + output[idx+6].split('(')[0])
            print('\t\t' + output[idx+7].split('(')[0])
            print('\t\t' + output[idx+9].split('(')[0])
            print('\t\t' + output[idx+11].split('(')[0])
            print('\t\t' + output[idx+12].split('(')[0])
            print('\t\t' + output[idx+14].split('(')[0])
            print('\t\t' + output[idx+15].split('(')[0])
            print('\t\t' + 'Airflow:\t\t  ' + dir)
        idx += 1
    if not found_psu2 :
        print('\tPSU2:\tNot Present')

except subprocess.CalledProcessError as err:
    print ("Exception when calling get_sonic_error -> %s\n" %(err))
    rc = err.returncode
justindthomas commented 11 months ago

PR submitted here: https://github.com/sonic-net/sonic-buildimage/pull/17508