sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
737 stars 1.42k forks source link

[Telemetry] After ONIE install, the telemetry process inside telemetry container exits but docker stays up #16533

Closed dgsudharsan closed 1 year ago

dgsudharsan commented 1 year ago

Description

After installing through onie, the telemetry process inside the telemetry container exits and sometimes its FATAL.

root@r-bulldog-03:~# docker exec -it telemetry bash
root@r-bulldog-03:/# supervisorctl status
containercfgd                    RUNNING   pid 16, uptime 0:10:16
dependent-startup                EXITED    Sep 13 02:33 AM
dialout                          RUNNING   pid 22, uptime 0:10:13
rsyslogd                         RUNNING   pid 11, uptime 0:10:18
start                            EXITED    Sep 13 02:33 AM
supervisor-proc-exit-listener    RUNNING   pid 8, uptime 0:10:19
telemetry                        EXITED    Sep 13 02:33 AM
root@r-anaconda-51:/home/admin# docker exec telemetry supervisorctl status
containercfgd                    RUNNING   pid 16, uptime 0:04:43
dependent-startup                RUNNING   pid 7, uptime 0:04:46
dialout                          STOPPED   Not started
rsyslogd                         RUNNING   pid 11, uptime 0:04:45
start                            EXITED    Sep 07 03:27 PM
supervisor-proc-exit-listener    RUNNING   pid 8, uptime 0:04:46
telemetry                        FATAL     Exited too quickly (process log may have details)

Sep 7 18:27:45.653772 r-anaconda-51 INFO telemetry#supervisord 2023-09-07 15:27:45,652 INFO exited: telemetry (exit status 0; not expected)

CONTAINER ID   IMAGE                                COMMAND                  CREATED         STATUS         PORTS     NAMES
ef27c6674bd1   da0d5011c828                         "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             what-just-happened
88975307ca20   docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             telemetry
406b86ebed27   docker-snmp:latest                   "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             snmp
d8928328285d   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             mgmt-framework
eb47aa6353e9   docker-lldp:latest                   "/usr/bin/docker-lld…"   5 minutes ago   Up 5 minutes             lldp
b07e18a80aa7   17676f080268                         "/usr/local/bin/supe…"   5 minutes ago   Up 5 minutes             doai
9ec1e4932275   1c536017f212                         "/usr/bin/docker_ini…"   5 minutes ago   Up 5 minutes             dhcp_relay
3442ac024c5c   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   8 minutes ago   Up 6 minutes             radv
a5fd6055df9f   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   8 minutes ago   Up 6 minutes             pmon
43d29c7f45a6   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   8 minutes ago   Up 6 minutes             syncd
efb481585b52   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   8 minutes ago   Up 7 minutes             bgp
a2a815188981   docker-teamd:latest                  "/usr/local/bin/supe…"   9 minutes ago   Up 7 minutes             teamd
c3d0ab212a1b   docker-orchagent:latest              "/usr/bin/docker-ini…"   9 minutes ago   Up 7 minutes             swss
5c61e39a9fd9   docker-eventd:latest                 "/usr/local/bin/supe…"   9 minutes ago   Up 7 minutes             eventd
fce8c3979214   docker-database:latest               "/usr/local/bin/dock…"   9 minutes ago   Up 7 minutes             database
root@r-bulldog-03:~# sonic-cfggen -d -v TELEMETRY

root@r-bulldog-03:~#

Steps to reproduce the issue:

  1. Perform onie install
  2. Check telemetry status.

Describe the results you received:

Telemetry process exits. However docker stays up even though its a critical process.

Describe the results you expected:

Telemetry main process should not exit. If it exits the docker should exit as well

Output of show version:

show version

SONiC Software Version: SONiC.202305_RC.4-a4fbef8bc_Internal
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: a4fbef8bc
Build date: Tue Sep 12 16:31:36 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-243

Platform: x86_64-mlnx_msn2100-r0
HwSKU: ACS-MSN2100
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1752X06330
Model Number: MSN2100-CB2F
Hardware Revision: A1
Uptime: 03:17:46 up 47 min,  1 user,  load average: 0.57, 0.53, 0.65
Date: Wed 13 Sep 2023 03:17:46

Docker images:
REPOSITORY                                         TAG                              IMAGE ID       SIZE
docker-orchagent                                   202305_RC.4-a4fbef8bc_Internal   7acdadbb064c   328MB
docker-orchagent                                   latest                           7acdadbb064c   328MB
docker-fpm-frr                                     202305_RC.4-a4fbef8bc_Internal   26f22a12fc79   348MB
docker-fpm-frr                                     latest                           26f22a12fc79   348MB
docker-nat                                         202305_RC.4-a4fbef8bc_Internal   fa385a23398e   319MB
docker-nat                                         latest                           fa385a23398e   319MB
docker-sflow                                       202305_RC.4-a4fbef8bc_Internal   2ff9bf1e70a9   318MB
docker-sflow                                       latest                           2ff9bf1e70a9   318MB
docker-teamd                                       202305_RC.4-a4fbef8bc_Internal   5d9f9ae038aa   317MB
docker-teamd                                       latest                           5d9f9ae038aa   317MB
docker-macsec                                      latest                           55e56b22516d   319MB
docker-syncd-mlnx                                  202305_RC.4-a4fbef8bc_Internal   616ccd12a441   823MB
docker-syncd-mlnx                                  latest                           616ccd12a441   823MB
docker-dhcp-relay                                  latest                           343d390dae33   306MB
docker-eventd                                      202305_RC.4-a4fbef8bc_Internal   2b7aec4ae7a0   299MB
docker-eventd                                      latest                           2b7aec4ae7a0   299MB
docker-platform-monitor                            202305_RC.4-a4fbef8bc_Internal   3ba68825f54c   815MB
docker-platform-monitor                            latest                           3ba68825f54c   815MB
docker-snmp                                        202305_RC.4-a4fbef8bc_Internal   81ccd0cf706e   338MB
docker-snmp                                        latest                           81ccd0cf706e   338MB
docker-sonic-telemetry                             202305_RC.4-a4fbef8bc_Internal   3fa87969b07c   599MB
docker-sonic-telemetry                             latest                           3fa87969b07c   599MB
docker-lldp                                        202305_RC.4-a4fbef8bc_Internal   9412a37cc891   341MB
docker-lldp                                        latest                           9412a37cc891   341MB
docker-mux                                         202305_RC.4-a4fbef8bc_Internal   49c55dacb4fb   348MB
docker-mux                                         latest                           49c55dacb4fb   348MB
docker-database                                    202305_RC.4-a4fbef8bc_Internal   448e222d1079   299MB
docker-database                                    latest                           448e222d1079   299MB
docker-router-advertiser                           202305_RC.4-a4fbef8bc_Internal   62a30b600998   299MB
docker-router-advertiser                           latest                           62a30b600998   299MB
docker-sonic-mgmt-framework                        202305_RC.4-a4fbef8bc_Internal   2b97e20df004   415MB
docker-sonic-mgmt-framework                        latest                           2b97e20df004   415MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.0.0-202305-2                   07eeec349434   432MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/doai        1.0.0-202305-1                   17676f080268   277MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_r-bulldog-03_20230913_023753.tar.gz sonic_dump_r-anaconda-51_20230907_183233.tar.gz

prgeor commented 1 year ago

@dgsudharsan could you please capture the difference in behavior across the two sonic version.

dgsudharsan commented 1 year ago

In 202211 when installing from ONIE, the telemetry process exits. However along with it the telemetry docker exits too since the telemetry process is defined as a critical process. In 202305 the telemetry docker however doesn't exit.

root@r-anaconda-51:/home/admin# docker exec telemetry bash -c '[ -f /etc/supervisor/critical_processes ] && cat /etc/supervisor/critical_processes'
program:telemetry
FengPan-Frank commented 1 year ago

Reproduce the issue locally on 20230531.03 version.

After ONIE installation, telemetry process is exited indeed.

admin@sonic:/var/log$ docker exec telemetry supervisorctl status containercfgd RUNNING pid 16, uptime 0:34:55 dependent-startup EXITED Sep 20 07:45 AM dialout RUNNING pid 22, uptime 0:34:50 rsyslogd RUNNING pid 11, uptime 0:34:58 start EXITED Sep 20 07:45 AM supervisor-proc-exit-listener RUNNING pid 8, uptime 0:35:03 telemetry EXITED Sep 20 07:46 AM

Snippet telemetry.log: Sep 20 07:45:57.973354 sonic INFO telemetry#supervisord: telemetry Traceback (most recent call last): Sep 20 07:45:57.974320 sonic INFO telemetry#supervisord: telemetry File "/usr/local/bin/sonic-cfggen", line 452, in Sep 20 07:45:57.975525 sonic INFO telemetry#supervisord: telemetry main() Sep 20 07:45:57.976782 sonic INFO telemetry#supervisord: telemetry File "/usr/local/bin/sonic-cfggen", line 416, in main Sep 20 07:45:57.977365 sonic INFO telemetry#supervisord: telemetry template_data = template.render(data) Sep 20 07:45:57.977883 sonic INFO telemetry#supervisord: telemetry File "/usr/local/lib/python3.9/dist-packages/jinja2/environment.py", line 1301, in render Sep 20 07:45:57.980311 sonic INFO telemetry#supervisord: telemetry self.environment.handle_exception() Sep 20 07:45:57.981591 sonic INFO telemetry#supervisord: telemetry File "/usr/local/lib/python3.9/dist-packages/jinja2/environment.py", line 936, in handle_exception Sep 20 07:45:57.982543 sonic INFO telemetry#supervisord: telemetry raise rewrite_traceback_stack(source=source) share/sonic/templates/telemetry_vars.j2", line 2, in top-level template code Sep 20 07:45:57.986648 sonic INFO telemetry#supervisord: telemetry "certs": {% if "certs" in TELEMETRY.keys() %}{{ TELEMETRY["certs"] }}{% else %}""{% endif %}, Sep 20 07:45:57.987416 sonic INFO telemetry#supervisord: telemetry File "/usr/local/lib/python3.9/dist-packages/jinja2/environment.py", line 485, in getattr Sep 20 07:45:57.988104 sonic INFO telemetry#supervisord: telemetry return getattr(obj, attribute) Sep 20 07:45:57.988721 sonic INFO telemetry#supervisord: telemetry jinja2.exceptions.UndefinedError: 'TELEMETRY' is undefined Sep 20 07:46:00.291148 sonic INFO telemetry#supervisord: telemetry Incorrect threshold value, expecting positive integers

investigating on a proper fix.

qnos commented 1 year ago

This is because telemetry service introduce the cert authentication but no telemetry config in Config DB.

127.0.0.1:6379[4]> keys TELEMETRY*
(empty array)
127.0.0.1:6379[4]>

Therefore, we need to manually load the TELEMETRY config into config DB:

telemetry.json

  1. no client auth
{
    "TELEMETRY": {
        "gnmi": {
            "client_auth": "false",
            "port": "50051",
            "log_level": "2"
        }
    }
}
  1. With client_auth and specify the cert path, this requires to generate CA and cert key first.
    {
    "TELEMETRY": {
        "certs": {
            "server_crt": "/etc/sonic/telemetry/streamingtelemetryserver.cer",
            "server_key": "/etc/sonic/telemetry/streamingtelemetryserver.key",
            "ca_crt": "/etc/sonic/telemetry/dsmsroot.cer"
        },
        "gnmi": {
            "client_auth": "true",
            "port": "50051",
            "log_level": "2"
        }
    }
    }

Load telemetry config into CONFIG DB:

sudo config load telemetry.json -y

Then, start telemetry process

docker exec telemetry supervisorctl start telemetry

After that, the above telemetry issue will be resolved. It requires a mechanism to generate a default TELEMETRY config into config db.

qnos commented 1 year ago

It still suggests to load customized TELEMETRY configs, if no TELEMETRY configuration in redis DB, after the fix, it will uses the default TELEMETRY configurations.