sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
736 stars 1.42k forks source link

hostcfgd race condition with config reload #17306

Open stepanblyschak opened 11 months ago

stepanblyschak commented 11 months ago

Description

The issue happens when docker is started by systemd and in the middle of the operation hostcfgd configures it's desired state.

Steps to reproduce the issue:

  1. config feature disabled
  2. config save -y # feature is disabled in config_db.json
  3. config feature enabled
  4. config reload -y # load from config_db.json (feature is disabled)
  5. Observe feature docker is UP but the desired state is disabled according to config DB

Describe the results you received:

Example log for arbitrary service X:

Nov 21 22:24:36.665181 sonic INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'stop', 'X.service']'

Nov 21 22:24:36.688430 sonic INFO systemd[1]: Stopped X container.
Nov 21 22:24:36.699137 sonic INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'disable', 'X.service']'

Nov 21 22:24:36.691220 sonic INFO systemd[1]: Starting X service...    <===== Start triggered by WantedBy=syncd.service

Nov 21 22:24:36.926058 sonic INFO hostcfgd: Running cmd: '['sudo', 'systemctl', 'mask', 'X.service']'

And this container X remains running as it was started by syncd.service but masked by hostcfgd only after that.

Describe the results you expected:

Feature container does not start.

Ideally, we'd like to see the following boot/config reload flow:

  1. Configure desired states of services
  2. Start sonic.target

Therefore, we could eliminate the need of having systemd-sonic-generator and mask_disabled_services.py scripts that configure initial service states.

Need to consider all flows - upgrade, first boot. Ideally, with this approach, service state is synced at very early stage in the boot.

Output of show version:

(paste your output here)

Output of show techsupport:

SONiC Software Version: SONiC.202305_RC.36-4e4396e96_Internal
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 4e4396e96
Build date: Sun Nov 26 09:28:13 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci03-244

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700-D48C8
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1822K07815
Model Number: MSN2700-CS2FO
Hardware Revision: A1
Uptime: 16:52:44 up  1:49,  1 user,  load average: 0.57, 0.70, 0.87
Date: Mon 27 Nov 2023 16:52:44

Docker images:
REPOSITORY                                         TAG                               IMAGE ID       SIZE
docker-syncd-mlnx                                  202305_RC.36-4e4396e96_Internal   5fa17071be2a   836MB
docker-syncd-mlnx                                  latest                            5fa17071be2a   836MB
docker-platform-monitor                            202305_RC.36-4e4396e96_Internal   6bd3faaaaf54   827MB
docker-platform-monitor                            latest                            6bd3faaaaf54   827MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.6.0-202305-12                   3b67dd4aebad   433MB
docker-orchagent                                   202305_RC.36-4e4396e96_Internal   dc8a72449afd   328MB
docker-orchagent                                   latest                            dc8a72449afd   328MB
docker-fpm-frr                                     202305_RC.36-4e4396e96_Internal   e028f1635caa   348MB
docker-fpm-frr                                     latest                            e028f1635caa   348MB
docker-nat                                         202305_RC.36-4e4396e96_Internal   d26fe14af4fb   320MB
docker-nat                                         latest                            d26fe14af4fb   320MB
docker-sflow                                       202305_RC.36-4e4396e96_Internal   469d8a988bab   318MB
docker-sflow                                       latest                            469d8a988bab   318MB
docker-teamd                                       202305_RC.36-4e4396e96_Internal   cd8e61bdb85f   317MB
docker-teamd                                       latest                            cd8e61bdb85f   317MB
docker-macsec                                      202305_RC.35-4e4396e96_Internal   4c3075927439   319MB
docker-dhcp-relay                                  202305_RC.35-4e4396e96_Internal   2a276664f14d   307MB
docker-eventd                                      202305_RC.36-4e4396e96_Internal   1a925ba903eb   299MB
docker-eventd                                      latest                            1a925ba903eb   299MB
docker-sonic-telemetry                             202305_RC.36-4e4396e96_Internal   b9abaa617279   386MB
docker-sonic-telemetry                             latest                            b9abaa617279   386MB
docker-snmp                                        202305_RC.36-4e4396e96_Internal   db8e6dcbb985   338MB
docker-snmp                                        latest                            db8e6dcbb985   338MB
docker-lldp                                        202305_RC.36-4e4396e96_Internal   7147b2ceb97f   341MB
docker-lldp                                        latest                            7147b2ceb97f   341MB
docker-mux                                         202305_RC.36-4e4396e96_Internal   a64edb0e0ecf   348MB
docker-mux                                         latest                            a64edb0e0ecf   348MB
docker-router-advertiser                           202305_RC.36-4e4396e96_Internal   01f823df9295   299MB
docker-router-advertiser                           latest                            01f823df9295   299MB
docker-database                                    202305_RC.36-4e4396e96_Internal   e7ab4d434eff   299MB
docker-database                                    latest                            e7ab4d434eff   299MB
docker-sonic-mgmt-framework                        202305_RC.36-4e4396e96_Internal   9f630d481095   415MB
docker-sonic-mgmt-framework                        latest                            9f630d481095   415MB

Additional information you deem important (e.g. issue happens only occasionally):

bingwang-ms commented 11 months ago

@qiluo-msft Can you please help take a look? Thanks!

liat-grozovik commented 11 months ago

@stepanblyschak is this kind of change in the sonic design in 202305?

stepanblyschak commented 11 months ago

@liat-grozovik

Are you talking about this idea?

Ideally, the we'd like to see the following boot/config reload flow:

Configure desired states of services
Start sonic.target

I'd say it is rather a big change that requires some small design, rather then a simple bug fix, however, per my understanding we can solve a couple of issues at once.

qiluo-msft commented 11 months ago

Could you give detailed command lines used in step "config feature disabled" and "config feature enabled"? Is this issue a regression or day one issue?

stepanblyschak commented 11 months ago

@qiluo-msft I think it is a day one issue. The commands are regular sonic commands "config feature state disabled" and "config feature state enabled". Are you asking which feature is affected?

qiluo-msft commented 9 months ago

@dgsudharsan @vivekrnv Are you able to help resolve this issue?

dgsudharsan commented 9 months ago

@dgsudharsan @vivekrnv Are you able to help resolve this issue?

Hi @qiluo-msft I don't think it is trivial. Needs a discussion in subgroup to understand how can we address this.

volodymyrsamotiy commented 9 months ago

@prsunny will check on what subgroup meeting we can raise this issue

dgsudharsan commented 8 months ago

@prsunny Any update on which subgroup to discuss this issue?

bingwang-ms commented 6 months ago

The group name is sonic-common-infra https://lists.sonicfoundation.dev/g/sonic-common-infra . @arlakshm FYI.

liat-grozovik commented 4 months ago

following the workgroup discussion, @arlakshm is there a community owner who is taking it?