sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
717 stars 1.38k forks source link

[Eventd] Sometimes eventd service exists with log "ERR eventd#eventd: :- run_eventd_service: Eventd service exiting" #17350

Closed dgsudharsan closed 4 months ago

dgsudharsan commented 9 months ago

Description

This is observed a few times where eventd service exists after the bootup. The issue is mainly because there is a hardcoded 1 sec wait for thread to init and set a variable or exit. This arbitrary 1 second may not be sufficient when all services boot up together and also on weaker CPU systems. Therefore the logic must be more robust and event driven.

Nov 28 20:20:29.071830 arc-switch1004 INFO eventd#eventd: :- set_control: Failed to init capture
Nov 28 20:20:29.071830 arc-switch1004 INFO eventd#eventd: :- set_control: last:errno=115
Nov 28 20:20:29.071830 arc-switch1004 INFO eventd#eventd: :- run_eventd_service: Failed to init capture
Nov 28 20:20:29.071830 arc-switch1004 INFO eventd#eventd: :- run_eventd_service: last:errno=115
Nov 28 20:20:29.081346 arc-switch1004 NOTICE eventd#eventd: :- ~RedisPipeline: RedisPipeline dtor is called from another thread, possibly due to exit(), Database: COUNTERS_DB
Nov 28 20:20:29.117357 arc-switch1004 INFO eventd#supervisord 2023-11-28 18:20:29,113 INFO success: eventd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Nov 28 20:20:29.117357 arc-switch1004 INFO eventd#supervisord 2023-11-28 18:20:29,114 INFO success: containercfgd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Nov 28 20:20:29.142106 arc-switch1004 NOTICE admin: Started teamd service...
Nov 28 20:20:29.171574 arc-switch1004 INFO systemd[1]: Started TEAMD container.
Nov 28 20:20:29.179894 arc-switch1004 INFO eventd#eventd: :- run: Stopped xpub/xsub proxy
Nov 28 20:20:29.179894 arc-switch1004 ERR eventd#eventd: :- run_eventd_service: Eventd service exiting
Nov 28 20:20:29.179894 arc-switch1004 INFO eventd#eventd: :- main: The eventd service exited
Nov 28 20:20:29.190561 arc-switch1004 INFO eventd#supervisord 2023-11-28 18:20:29,179 INFO exited: eventd (exit status 0; expected)

Steps to reproduce the issue:

  1. Reboot the switch
  2. Observe the error log sometimes

Describe the results you received:

Errors in syslog indicating eventd exited

Describe the results you expected:

No error log in syslog

Output of show version:

SONiC Software Version: SONiC.202305_RC.38-aa4d7bb67_Internal
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: aa4d7bb67
Build date: Tue Nov 28 03:47:56 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci02-241

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700-D48C8
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1805K20439
Model Number: MSN2700-CS2F
Hardware Revision: A2
Uptime: 20:36:34 up 11 min,  2 users,  load average: 1.37, 2.14, 1.70
Date: Tue 28 Nov 2023 20:36:34

Docker images:
REPOSITORY                                         TAG                               IMAGE ID       SIZE
docker-syncd-mlnx                                  202305_RC.38-aa4d7bb67_Internal   c949eea06037   838MB
docker-syncd-mlnx                                  latest                            c949eea06037   838MB
docker-orchagent                                   202305_RC.38-aa4d7bb67_Internal   1b39e14835c9   330MB
docker-orchagent                                   latest                            1b39e14835c9   330MB
docker-fpm-frr                                     202305_RC.38-aa4d7bb67_Internal   e9a8f56b0a82   350MB
docker-fpm-frr                                     latest                            e9a8f56b0a82   350MB
docker-teamd                                       202305_RC.38-aa4d7bb67_Internal   84c9f2f3cb1d   318MB
docker-teamd                                       latest                            84c9f2f3cb1d   318MB
docker-nat                                         202305_RC.38-aa4d7bb67_Internal   1b945a62406d   321MB
docker-nat                                         latest                            1b945a62406d   321MB
docker-sflow                                       202305_RC.38-aa4d7bb67_Internal   2399e3179383   319MB
docker-sflow                                       latest                            2399e3179383   319MB
docker-macsec                                      latest                            beabc6df388c   320MB
docker-platform-monitor                            202305_RC.38-aa4d7bb67_Internal   26e1a250cc68   828MB
docker-platform-monitor                            latest                            26e1a250cc68   828MB
docker-sonic-telemetry                             202305_RC.38-aa4d7bb67_Internal   1afe42655bd2   388MB
docker-sonic-telemetry                             latest                            1afe42655bd2   388MB
docker-dhcp-relay                                  latest                            61d2b01a024d   308MB
docker-eventd                                      202305_RC.38-aa4d7bb67_Internal   80065d33163c   300MB
docker-eventd                                      latest                            80065d33163c   300MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

sysdump_test_qos_reload_ports (1).tar.gz sysdump_sonic_dump_arc-switch1004_20231128_203605.tar.gz

dgsudharsan commented 9 months ago

@zbud-msft @lguohan @StormLiangMS We observed this issue in eventd in 202305. Can someone please investigate?

zbud-msft commented 9 months ago

@dgsudharsan I will take a look.

zbud-msft commented 7 months ago

ETA for fix 3/8

Tracking PR: https://github.com/sonic-net/sonic-buildimage/pull/18138