sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
723 stars 1.38k forks source link

[System ready][ZTP] If system health service starts after ZTP exits in disabled state, sysready status is shown as down. #18814

Closed dgsudharsan closed 3 months ago

dgsudharsan commented 4 months ago

Description

When ZTP is disabled if system-health services starts after sonic-ztp exit, system ready is show as no ready with ZTP shown as down.

If system-health starts before ZTP exit this issue is not seen

Good state

Apr 27 03:29:21.780881 r-lionfish-16 NOTICE healthd[8081]: Starting up...
Apr 27 03:29:29.479140 sonic INFO sonic-ztp[9050]: ZTP is administratively disabled.
Apr 27 03:30:53.115904 sonic NOTICE healthd: System is ready
redis-cli -n 6 hgetall "ALL_SERVICE_STATUS|ztp"
1) "app_ready_status"
2) "OK"
3) "fail_reason"
4) "-"
5) "service_status"
6) "OK"
7) "update_time"
8) "-"

redis-cli -n 6 hgetall "SYSTEM_READY|SYSTEM_STATE"
1) "Status"
2) "UP"

Issue state

Apr 26 01:25:34.208843 r-tigon-17 INFO sonic-ztp[8798]: ZTP is administratively disabled.
Apr 26 01:25:34.229295 r-tigon-17 NOTICE healthd[9964]: Starting up...

  "ALL_SERVICE_STATUS|ztp": {
    "expireat": 1714084454.4621081,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "app_ready_status": "Down",
      "fail_reason": "Inactive",
      "service_status": "Down",
      "update_time": "-"
    }
  },

In both scenarios ZTP is disabled

root@r-lionfish-16:~# show ztp status
ZTP Admin Mode : False
ZTP Service    : Inactive
ZTP Status     : Not Started

ZTP Service is not running

root@r-lionfish-16:~#
root@r-lionfish-16:~# service ztp status
● ztp.service - SONiC Zero Touch Provisioning service
     Loaded: loaded (/lib/systemd/system/ztp.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Sat 2024-04-27 03:29:29 IDT; 31min ago
   Main PID: 9049 (code=exited, status=0/SUCCESS)

Apr 27 03:30:47 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:47 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>
Apr 27 03:30:48 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:48 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>
Apr 27 03:30:48 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:48 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>
Apr 27 03:30:51 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:51 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>
Apr 27 03:30:51 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:9: Standard output type syslog is obsolete, automatically updating to journal. Ple>
Apr 27 03:30:51 r-lionfish-16 systemd[1]: /lib/systemd/system/ztp.service:10: Standard output type syslog+console is obsolete, automatically updating to jou>

This issue can be reproduced easily even if ztp starts after healthd. Restarting system-health service will result in problem state

root@r-lionfish-16:~# show system-health sysready-status
System is ready

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
auditd                  OK                OK                  -
bgp                     OK                OK                  -
caclmgrd                OK                OK                  -
config-chassisdb        OK                OK                  -
config-setup            OK                OK                  -
containerd              OK                OK                  -
cron                    OK                OK                  -
database                OK                OK                  -
determine-reboot-cause  OK                OK                  -
docker                  OK                OK                  -
eventd                  OK                OK                  -
gnmi                    OK                OK                  -
hw-management           OK                OK                  -
hw-management-tc        OK                OK                  -
kdump-tools             OK                OK                  -
lldp                    OK                OK                  -
lm-sensors              OK                OK                  -
mgmt-framework          OK                OK                  -
netfilter-persistent    OK                OK                  -
ntp                     OK                OK                  -
nv-syncd-shared         OK                OK                  -
pmon                    OK                OK                  -
procdockerstatsd        OK                OK                  -
radv                    OK                OK                  -
ras-mc-ctl              OK                OK                  -
rsyslog                 OK                OK                  -
smartmontools           OK                OK                  -
snmp                    OK                OK                  -
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     OK                OK                  -
root@r-lionfish-16:~#
root@r-lionfish-16:~#
root@r-lionfish-16:~# service system-health restart
root@r-lionfish-16:~#
root@r-lionfish-16:~# show system-health sysready-status
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
auditd                  OK                OK                  -
bgp                     OK                OK                  -
caclmgrd                OK                OK                  -
config-chassisdb        OK                OK                  -
config-setup            OK                OK                  -
containerd              OK                OK                  -
cron                    OK                OK                  -
database                OK                OK                  -
determine-reboot-cause  OK                OK                  -
docker                  OK                OK                  -
eventd                  OK                OK                  -
gnmi                    OK                OK                  -
hw-management           OK                OK                  -
hw-management-tc        OK                OK                  -
kdump-tools             OK                OK                  -
lldp                    OK                OK                  -
lm-sensors              OK                OK                  -
mgmt-framework          OK                OK                  -
netfilter-persistent    OK                OK                  -
ntp                     OK                OK                  -
nv-syncd-shared         OK                OK                  -
pmon                    OK                OK                  -
procdockerstatsd        OK                OK                  -
radv                    OK                OK                  -
ras-mc-ctl              OK                OK                  -
rsyslog                 OK                OK                  -
smartmontools           OK                OK                  -
snmp                    OK                OK                  -
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     Down              Down                Inactive

Steps to reproduce the issue:

  1. Disable ZTP
  2. Reboot system
  3. Restart system health service

Describe the results you received:

System is shown as not ready

Describe the results you expected:

System should be in ready state as ztp is administratively disabled.

Output of show version:

show version

SONiC Software Version: SONiC.202311_RC.39-c50d88168_Internal_ASAN
SONiC OS Version: 11
Distribution: Debian 11.9
Kernel: 5.10.0-23-2-amd64
Build commit: 1c7a9fb01
Build date: Fri Apr 26 05:36:05 UTC 2024
Built by: sw-r2d2-bot@r-build-sonic-ci03-244

Platform: x86_64-mlnx_msn3420-r0
HwSKU: ACS-MSN3420
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2019X13878
Model Number: MSN3420-CB2FO
Hardware Revision: A1
Uptime: 04:03:48 up 34 min,  1 user,  load average: 0.41, 0.47, 0.46
Date: Sat 27 Apr 2024 04:03:48

Docker images:
REPOSITORY                                         TAG                                    IMAGE ID       SIZE
docker-orchagent                                   202311_RC.39-c50d88168_Internal_ASAN   ea7c8629834a   552MB
docker-orchagent                                   latest                                 ea7c8629834a   552MB
docker-syncd-mlnx                                  202311_RC.39-c50d88168_Internal_ASAN   f662d69a28a0   867MB
docker-syncd-mlnx                                  latest                                 f662d69a28a0   867MB
docker-teamd                                       202311_RC.39-c50d88168_Internal_ASAN   3ede630da1bb   389MB
docker-teamd                                       latest                                 3ede630da1bb   389MB
docker-sflow                                       202311_RC.39-c50d88168_Internal_ASAN   53d7637749d2   390MB
docker-sflow                                       latest                                 53d7637749d2   390MB
docker-platform-monitor                            202311_RC.39-c50d88168_Internal_ASAN   3a97dbe9c972   821MB
docker-platform-monitor                            latest                                 3a97dbe9c972   821MB
docker-fpm-frr                                     202311_RC.39-c50d88168_Internal_ASAN   7757dd696268   420MB
docker-fpm-frr                                     latest                                 7757dd696268   420MB
docker-dhcp-relay                                  latest                                 ef76a9aad7cc   324MB
docker-nat                                         202311_RC.39-c50d88168_Internal_ASAN   250559162cc8   392MB
docker-nat                                         latest                                 250559162cc8   392MB
docker-snmp                                        202311_RC.39-c50d88168_Internal_ASAN   a279906b3fcb   354MB
docker-snmp                                        latest                                 a279906b3fcb   354MB
docker-macsec                                      latest                                 5643e32d9756   391MB
docker-eventd                                      202311_RC.39-c50d88168_Internal_ASAN   ee088c601422   315MB
docker-eventd                                      latest                                 ee088c601422   315MB
docker-lldp                                        202311_RC.39-c50d88168_Internal_ASAN   5a9a70bc2b26   357MB
docker-lldp                                        latest                                 5a9a70bc2b26   357MB
docker-sonic-gnmi                                  202311_RC.39-c50d88168_Internal_ASAN   7686e896871c   403MB
docker-sonic-gnmi                                  latest                                 7686e896871c   403MB
docker-database                                    202311_RC.39-c50d88168_Internal_ASAN   0a98bb5bc3aa   315MB
docker-database                                    latest                                 0a98bb5bc3aa   315MB
docker-mux                                         202311_RC.39-c50d88168_Internal_ASAN   66df1fc03c88   364MB
docker-mux                                         latest                                 66df1fc03c88   364MB
docker-router-advertiser                           202311_RC.39-c50d88168_Internal_ASAN   5700737fe03f   315MB
docker-router-advertiser                           latest                                 5700737fe03f   315MB
docker-sonic-mgmt-framework                        202311_RC.39-c50d88168_Internal_ASAN   0173e0ad3c90   417MB
docker-sonic-mgmt-framework                        latest                                 0173e0ad3c90   417MB```
#### Output of `show techsupport`:

(paste your output here or download and attach the file here )



#### Additional information you deem important (e.g. issue happens only occasionally):

<!--
     Also attach debug file produced by `sudo generate_dump`
-->
dgsudharsan commented 4 months ago

@adyeung @sg893052 @rajendra-dendukuri FYI This issue is blocking in some scenarios as sflow depends on system ready, else will wait for 3 minutes. Please refer to https://github.com/sonic-net/SONiC/pull/1627 .This results in sflow test failure.

@sflow FYI

sg893052 commented 4 months ago

@dgsudharsan @Junchao-Mellanox We could consider to ignore the ztp service for system ready. Sysmonitor has the logic in place to skip the services mentioned in the platform specific system_health configuration file under "services_to_ignore" field list.

/usr/share/sonic/device/{platform_name}/system_health_monitoring_config.json
{
    "services_to_ignore": ["ztp.service"],   
    "devices_to_ignore": [],
    "user_defined_checkers": [],
    "polling_interval": 60,
    "led_color": {
        "fault": "amber",
        "normal": "green",
        "booting": "orange_blink"
    }
}
dgsudharsan commented 4 months ago

@sg893052 This is not a platform specific issue and would occur in any platform since ZTP is common service. I prefer not adding this to platform directory. This needs to be handled in health monitor. For feature table we check if the feature is enabled or disabled and only consider it for system monitoring. Same should be done for ZTP through special handling

sg893052 commented 4 months ago

@dgsudharsan @adyeung https://github.com/sonic-net/sonic-buildimage/pull/18911 is the PR raised to address this issue.