sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
717 stars 1.38k forks source link

Telemetry | ERR systemd[1]: Failed to start Telemetry container #7243

Open TzvikNVDA opened 3 years ago

TzvikNVDA commented 3 years ago

Description

Telemetry container toggling

Steps to reproduce the issue:

  1. configure the switch
  2. config reload the switch
  3. check containers

Describe the results you received:

Telemetry container toggling

Output of show version:

SONiC Software Version: SONiC.SONIC.202012.57-324198e7_Internal Distribution: Debian 10.9 Kernel: 4.19.0-12-2-amd64 Build commit: 324198e7 Build date: Fri Apr 2 10:05:02 UTC 2021 Built by: sw-r2d2-bot@r-build-sonic-ci03

Platform: x86_64-mlnx_msn2700-r0 HwSKU: Mellanox-SN2700-D48C8 ASIC: mellanox ASIC Count: 1 Serial Number: MT1811X06319 Uptime: 07:46:18 up 1 day, 23:52, 1 user, load average: 6.14, 7.25, 7.48

Docker images: REPOSITORY TAG IMAGE ID SIZE docker-syncd-mlnx SONIC.202012.57-324198e7_Internal 07faff7929ac 686MB docker-syncd-mlnx latest 07faff7929ac 686MB docker-snmp SONIC.202012.57-324198e7_Internal 041b3da6b00c 462MB docker-snmp latest 041b3da6b00c 462MB docker-teamd SONIC.202012.57-324198e7_Internal 7d141b32b4dc 432MB docker-teamd latest 7d141b32b4dc 432MB docker-nat SONIC.202012.57-324198e7_Internal 0587932f3fa8 435MB docker-nat latest 0587932f3fa8 435MB docker-sonic-mgmt-framework SONIC.202012.57-324198e7_Internal 5db6fafba003 640MB docker-sonic-mgmt-framework latest 5db6fafba003 640MB docker-router-advertiser SONIC.202012.57-324198e7_Internal 5a322eae7e8f 421MB docker-router-advertiser latest 5a322eae7e8f 421MB docker-platform-monitor SONIC.202012.57-324198e7_Internal f77c96e74a07 712MB docker-platform-monitor latest f77c96e74a07 712MB docker-lldp SONIC.202012.57-324198e7_Internal 503212c33c33 461MB docker-lldp latest 503212c33c33 461MB docker-database SONIC.202012.57-324198e7_Internal d234e9416fa4 421MB docker-database latest d234e9416fa4 421MB docker-orchagent SONIC.202012.57-324198e7_Internal 68d2a5b4620b 450MB docker-orchagent latest 68d2a5b4620b 450MB docker-sonic-telemetry SONIC.202012.57-324198e7_Internal 572a02f08f23 511MB docker-sonic-telemetry latest 572a02f08f23 511MB docker-fpm-frr SONIC.202012.57-324198e7_Internal 4a889ea1d538 450MB docker-fpm-frr latest 4a889ea1d538 450MB docker-dhcp-relay SONIC.202012.57-324198e7_Internal 7e65064658c2 428MB docker-dhcp-relay latest 7e65064658c2 428MB docker-sflow SONIC.202012.57-324198e7_Internal ace9637c4a87 433MB docker-sflow latest ace9637c4a87 433MB

Output of show techsupport:

sonic_dump_ptr-sonic-n2-t3_20210404_134112.tar.gz sonic_dump_tgn-sonic-n2-t1_20210404_134107.tar.gz sonic_dump_ptr-sonic-n2-t3_20210404_134112.tar.gz

TzvikNVDA commented 3 years ago

the issues caused by "status": "disabled" in the FEATURE section in config

  "telemetry": {

        "has_per_asic_scope": "False",

        "high_mem_alert": "disabled",

        "auto_restart": "enabled",

        "state": "enabled",

        "has_global_scope": "True",

        "has_timer": "True",

        "state": "enabled",

        **"status": "disabled"**

    },
macikgozwa commented 3 years ago

Hi @TzvikNVDA,

I see the following error message in the provided logs. telemetry service rejects the null value for the -v argument and exits.

Apr  4 13:27:44.404764 ptr-sonic-n2-t3 INFO telemetry#/supervisord: telemetry invalid value "null" for flag -v: strconv.Atoi: parsing "null": invalid syntax
Apr  4 13:27:44.405551 ptr-sonic-n2-t3 INFO telemetry#/supervisord: telemetry Usage of /usr/sbin/telemetry:

This verbosity value come from the config db, which is processed by this start script

Do you see the verbosity value (log_level) in the config db? It can be checked like this:

sudo redis-cli -n 4 hgetall "TELEMETRY|gnmi"
1) "client_auth"
2) "true"
3) "log_level"
4) "2"
5) "port"
6) "50051"
maxiestudies commented 3 years ago

I'm having the same problem, the telemetry container is not started at startup. I was able though to restart the container manually. The telemetry service is enabled:

`root@sonic:~# show feature status
Feature         State           AutoRestart
--------------  --------------  --------------
bgp             enabled         enabled
database        always_enabled  always_enabled
dhcp_relay      enabled         enabled
lldp            enabled         enabled
macsec          disabled        enabled
mgmt-framework  enabled         enabled
nat             disabled        enabled
pmon            enabled         enabled
radv            enabled         enabled
sflow           disabled        enabled
snmp            enabled         enabled
swss            enabled         enabled
syncd           enabled         enabled
teamd           enabled         enabled
telemetry       enabled         enabled`

The problem seems to be in the monit config file. From journalctl:

-- A start job for unit monit.service has begun execution.
-- 
-- The job identifier is 3065.
Apr 16 13:16:32 sonic monit[13598]: Starting daemon monitor: monit/etc/monit/conf.d/sonic-host:9: syntax error 'repeat'
Apr 16 13:16:32 sonic monit[13601]: /etc/monit/conf.d/sonic-host:9: syntax error 'repeat'
Apr 16 13:16:32 sonic monit[13598]:  failed!
Apr 16 13:16:32 sonic systemd[1]: monit.service: Control process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- An ExecStart= process belonging to unit monit.service has exited.
-- 
-- The process' exit code is 'exited' and its exit status is 1.
Apr 16 13:16:32 sonic systemd[1]: monit.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- The unit monit.service has entered the 'failed' state with result 'exit-code'.
Apr 16 13:16:32 sonic systemd[1]: Failed to start LSB: service and resource monitoring daemon.
-- Subject: A start job for unit monit.service has failed
-- Defined-By: systemd
-- Support: https://www.debian.org/support
-- 
-- A start job for unit monit.service has finished with a failure.

Also starting the monit process manually gives the same error:

root@sonic:~# monit
/etc/monit/conf.d/sonic-host:9: syntax error 'repeat'

From what I can see the config file syntax is correct though

Plattform information: SONiC Software Version: SONiC.master.624-e11397df Distribution: Debian 10.9 Kernel: 4.19.0-12-2-amd64 Build commit: e11397df Build date: Wed Mar 31 10:07:31 UTC 2021 Built by: johnar@jenkins-worker-4

Platform: x86_64-accton_as7726_32x-r0 HwSKU: Accton-AS7726-32X ASIC: broadcom ASIC Count: 1

yozhao101 commented 3 years ago

@maxiestudies Thanks for your input. Monit is just a monitoring tool which was used to monitor the running status of critical processes and resource usage in SONiC. Monit is not related to the running status of telemetry container. From the logs you provided, I think there is a syntax error in Monit configuration file and will check why this error occurred.