sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
722 stars 1.38k forks source link

config load_minigraph failed with "Job for sflow.service failed" #4173

Open wangxin opened 4 years ago

wangxin commented 4 years ago

Description

After upgrade the switch and run "config load_minigraph -y", it failed with "Job for sflow.service failed".

Steps to reproduce the issue:

  1. Upgrade the switch
  2. Run "sudo config load_minigraph -y"

Describe the results you received: Config load_minigraph failed with "Job for sflow.service failed"

Describe the results you expected: Config load_minigraph should be successful

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**
$ show version

SONiC Software Version: SONiC.HEAD.164-12212467
Distribution: Debian 9.11
Kernel: 4.9.0-9-2-amd64
Build commit: 12212467
Build date: Sun Jan  5 05:53:09 UTC 2020
Built by: johnar@jenkins-worker-7

Platform: x86_64-mlnx_msn2700_simx-r0
HwSKU: ACS-MSN2700
ASIC: mellanox
/usr/bin/decode-syseeprom : ERROR : No syseeprom symlink or cache file found
Serial Number:
Uptime: 06:09:11 up  8:34,  2 users,  load average: 0.28, 0.95, 0.55

Docker images:
REPOSITORY                    TAG                 IMAGE ID            SIZE
docker-syncd-mlnx             HEAD.164-12212467   bb9a5d6982ab        377MB
docker-syncd-mlnx             latest              bb9a5d6982ab        377MB
docker-platform-monitor       HEAD.164-12212467   ee0256482773        569MB
docker-platform-monitor       latest              ee0256482773        569MB
docker-fpm-frr                HEAD.164-12212467   877caa29466a        325MB
docker-fpm-frr                latest              877caa29466a        325MB
docker-sflow                  HEAD.164-12212467   855c1c0629bc        305MB
docker-sflow                  latest              855c1c0629bc        305MB
docker-lldp-sv2               HEAD.164-12212467   bb358f8d389c        303MB
docker-lldp-sv2               latest              bb358f8d389c        303MB
docker-orchagent              HEAD.164-12212467   c63f0ed23e05        323MB
docker-orchagent              latest              c63f0ed23e05        323MB
docker-dhcp-relay             HEAD.164-12212467   ff963b77f067        290MB
docker-dhcp-relay             latest              ff963b77f067        290MB
docker-database               HEAD.164-12212467   579b8e5a9108        282MB
docker-database               latest              579b8e5a9108        282MB
docker-snmp-sv2               HEAD.164-12212467   648de813c3c8        339MB
docker-snmp-sv2               latest              648de813c3c8        339MB
docker-teamd                  HEAD.164-12212467   0e2c02696fe3        305MB
docker-teamd                  latest              0e2c02696fe3        305MB
docker-sonic-mgmt-framework   HEAD.164-12212467   be8fabb24421        330MB
docker-sonic-mgmt-framework   latest              be8fabb24421        330MB
docker-sonic-telemetry        HEAD.164-12212467   b5071cfd0dd8        343MB
docker-sonic-telemetry        latest              b5071cfd0dd8        343MB
docker-router-advertiser      HEAD.164-12212467   cbdd2a63a6f4        282MB
docker-router-advertiser      latest              cbdd2a63a6f4        282MB
        "Stopping service swss ...",
        "Stopping service lldp ...",
        "Stopping service bgp ...",
        "Stopping service hostcfgd ...",
        "Running command: /usr/local/bin/sonic-cfggen -H -m -j /etc/sonic/init_cfg.json --write-to-db",
        "Running command: pfcwd start_default",
        "Running command: config qos reload",
        "Running command: /usr/local/bin/sonic-cfggen -d -t /usr/share/sonic/device/x86_64-mlnx_msn2700_simx-r0/ACS-MSN2700/buffers.json.j2 >/tmp/buffers.json",
        "Running command: /usr/local/bin/sonic-cfggen -d -t /usr/share/sonic/device/x86_64-mlnx_msn2700_simx-r0/ACS-MSN2700/qos.json.j2 -y /etc/sonic/sonic_version.yml >/tmp/qos.json",
        "Running command: /usr/local/bin/sonic-cfggen -j /tmp/buffers.json --write-to-db",
        "Running command: /usr/local/bin/sonic-cfggen -j /tmp/qos.json --write-to-db",
        "",
        "Resetting failed status for service bgp ...",
        "Resetting failed status for service dhcp_relay ...",
        "Resetting failed status for service hostcfgd ...",
        "Resetting failed status for service hostname-config ...",
        "Resetting failed status for service interfaces-config ...",
        "Resetting failed status for service lldp ...",
        "Resetting failed status for service ntp-config ...",
        "Resetting failed status for service pmon ...",
        "Resetting failed status for service radv ...",
        "Resetting failed status for service rsyslog-config ...",
        "Resetting failed status for service snmp ...",
        "Resetting failed status for service swss ...",
        "Resetting failed status for service syncd ...",
        "Resetting failed status for service teamd ...",
        "Restarting service hostname-config ...",
        "Restarting service interfaces-config ...",
        "Restarting service ntp-config ...",
        "Restarting service rsyslog-config ...",
        "Restarting service swss ...",
        "Restarting service bgp ...",
        "Restarting service lldp ...",
        "Restarting service hostcfgd ...",
        "Restarting service sflow ..."
**Attach debug file `sudo generate_dump`:**

sonic_dump_dev-r-vrt-233-010_20200107_060656.tar.gz

abdosi commented 4 years ago

@wangxin Please check latest master. It should be fix now.

wangxin commented 4 years ago

I am closing this issue since it was not observed on SONiC-OS-HEAD.39-887ea003 from the 201911 branch.

mykolaf commented 4 years ago

@abdosi @wangxin What was the root cause/fix last time? We observe it again on

SONiC Software Version: SONiC.201911.132-a47add53
Distribution: Debian 9.12
Kernel: 4.9.0-11-2-amd64
Build commit: a47add53
Build date: Wed Jul  8 04:10:32 UTC 2020
Built by: johnar@jenkins-worker-8

Full log:

$ sudo config reload -y                                                                                                                                                                                          
Executing stop of service swss...
Executing stop of service lldp...
Executing stop of service bgp...
Executing stop of service hostcfgd...
Executing stop of service nat...
Running command: /usr/local/bin/sonic-cfggen -j /etc/sonic/init_cfg.json -j /etc/sonic/config_db.json --write-to-db
Running command: /usr/bin/db_migrator.py -o migrate
Executing reset-failed of service bgp...
Executing reset-failed of service dhcp_relay...
Executing reset-failed of service hostcfgd...
Executing reset-failed of service hostname-config...
Executing reset-failed of service interfaces-config...
Executing reset-failed of service lldp...
Executing reset-failed of service ntp-config...
Executing reset-failed of service pmon...
Executing reset-failed of service radv...
Executing reset-failed of service rsyslog-config...
Executing reset-failed of service snmp...
Executing reset-failed of service swss...
Executing reset-failed of service syncd...
Executing reset-failed of service teamd...
Executing reset-failed of service nat...
Executing reset-failed of service sflow...
Executing restart of service hostname-config...
Executing restart of service interfaces-config...
Executing restart of service ntp-config...
Executing restart of service rsyslog-config...
Executing restart of service swss...
Executing restart of service bgp...
Executing restart of service lldp...
Executing restart of service hostcfgd...
Executing restart of service nat...
Executing restart of service sflow...
Failed to restart sflow.service: Unit sflow.service is masked.

admin@sonic$ show log sflow
Jul 13 15:31:14.840518 r-tigon-04 ERR monit[853]: 'sflowmgrd' process is not running
Jul 13 15:31:42.140529 r-tigon-04 INFO hostcfgd: Running cmd: 'sudo systemctl stop sflow.service'
Jul 13 15:31:42.191434 r-tigon-04 INFO hostcfgd: Running cmd: 'sudo systemctl disable sflow.service'
Jul 13 15:31:42.322351 r-tigon-04 INFO hostcfgd: Running cmd: 'sudo systemctl mask sflow.service'
Jul 13 15:31:42.437031 r-tigon-04 INFO hostcfgd: Feature 'sflow' is stopped and disabled
Jul 13 15:31:52.773952 r-tigon-04 ERR config: Failed to execute restart of service sflow with error 1
Jul 13 15:32:14.972116 r-tigon-04 ERR monit[853]: 'sflowmgrd' process is not running
prsunny commented 4 years ago

@padmanarayana to provide update

padmanarayana commented 4 years ago

sflow feature is disabled by default ("show features"). However, monit is not excluding disabled features. Issue should not be seen if sflow is enabled in the config. Working on fix.

padmanarayana commented 4 years ago

@jleveque and @yozhao101 will be addressing this issue as part of monit infra.

yozhao101 commented 4 years ago

@jleveque and @yozhao101 will be addressing this issue as part of monit infra.

Yes, I am working on a solution to address this issue.