sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
737 stars 1.43k forks source link

[ntp] manully added serveron frontpanel and mgmt not connecting after applying mgmt vrf with no ntp from dhcp #6858

Closed Hedgehog-Guru closed 3 years ago

Hedgehog-Guru commented 3 years ago

Description ntp server connected to frontpanel ports stopped working after applying mgmt vrf configuration

Steps to reproduce the issue

  1. remove "ntp-servers" /usr/share/sonic/templates/dhclient.conf.j2
  2. replace string "interface ignore wildcard" with "interface listen wildcard" on /usr/share/sonic/templates/ntp.conf.j2
  3. reboot
  4. configure servers reachable from mgmt(eth0) and frontpanel port (Ethernet64)
  5. switch is able to sync with all ntp servers
  6. apply mgmt vrf confguration
  7. reboot
  8. no server is seen anymore, device is not sync.

adding mgmt vrf done by merging this into existing configuration:

{
"MGMT_PORT": {
    "eth0": {
        "alias": "eth0",
        "admin_status": "up"
    }
  },
"MGMT_VRF_CONFIG": {
    "vrf_global": {
        "mgmtVrfEnabled": "true"
     }
  }
}

Describe the results you received show ntp command is empty

    admin@r-qa-sw-eth-2142:~$ show ntp
unsynchronised
   polling server every 8 s

Describe the results you expected expected that all servers configured will be seen:

synchronised to NTP server (10.7.77.135) at stratum 4  
   time correct to within 285 ms
   polling server every 64 s

      remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+1.0.0.2         10.211.0.124     4 u   39   64    1    0.099   -2.932   1.192
*10.7.77.135     192.114.62.250   3 u   38   64    1    5.258    1.074   0.352
+1::2            10.211.0.124     4 u   37   64    1    0.188   -2.520   1.277

Output of show version ''' sonic_dump_r-qa-sw-eth-2142_20210223_075752.tar.gz

SONiC Software Version: SONiC.SONIC.202012.21-4ec748f_Internal Distribution: Debian 10.8 Kernel: 4.19.0-12-2-amd64 Build commit: 4ec748f3 Build date: Thu Feb 18 19:22:36 UTC 2021 Built by: sw-r2d2-bot@r-build-sonic-ci02

Platform: x86_64-mlnx_msn3700c-r0 HwSKU: ACS-MSN3700C ASIC: mellanox ASIC Count: 1 Serial Number: MT1935X01905 Uptime: 08:26:29 up 1:08, 1 user, load average: 2.23, 1.98, 1.78

Docker images: REPOSITORY TAG IMAGE ID SIZE docker-teamd SONIC.202012.21-4ec748f_Internal 7760ab5f4867 410MB docker-teamd latest 7760ab5f4867 410MB docker-nat SONIC.202012.21-4ec748f_Internal 0734519b4db2 413MB docker-nat latest 0734519b4db2 413MB docker-orchagent SONIC.202012.21-4ec748f_Internal 0e32e6c78fce 428MB docker-orchagent latest 0e32e6c78fce 428MB docker-fpm-frr SONIC.202012.21-4ec748f_Internal b3933c2dd313 428MB docker-fpm-frr latest b3933c2dd313 428MB docker-sflow SONIC.202012.21-4ec748f_Internal db0f7ce0b317 411MB docker-sflow latest db0f7ce0b317 411MB docker-syncd-mlnx SONIC.202012.21-4ec748f_Internal fb89f8ef4ba6 542MB docker-syncd-mlnx latest fb89f8ef4ba6 542MB docker-snmp SONIC.202012.21-4ec748f_Internal 8d78574b5fb7 438MB docker-snmp latest 8d78574b5fb7 438MB docker-sonic-mgmt-framework SONIC.202012.21-4ec748f_Internal a7a3146a4aff 615MB docker-sonic-mgmt-framework latest a7a3146a4aff 615MB docker-router-advertiser SONIC.202012.21-4ec748f_Internal f811608edb05 397MB docker-router-advertiser latest f811608edb05 397MB docker-platform-monitor SONIC.202012.21-4ec748f_Internal 1f8965b709d9 689MB docker-platform-monitor latest 1f8965b709d9 689MB docker-lldp SONIC.202012.21-4ec748f_Internal 6528498f9d34 437MB docker-lldp latest 6528498f9d34 437MB docker-database SONIC.202012.21-4ec748f_Internal 8d5e7b90e3b6 397MB docker-database latest 8d5e7b90e3b6 397MB docker-dhcp-relay SONIC.202012.21-4ec748f_Internal 7469a84d51da 404MB docker-dhcp-relay latest 7469a84d51da 404MB docker-sonic-telemetry SONIC.202012.21-4ec748f_Internal 1c52c9fdba6e 472MB docker-sonic-telemetry latest 1c52c9fdba6e 472MB '''

ghost commented 3 years ago

Investigating the issue.

ghost commented 3 years ago

@Hedgehog-Guru, current investigation status.

The root cause of the defect is conflict between cgroups v1(net_prio, net_cls) and v2. Currently, the system uses v1 cgroups, for example, for docker, but ip vrf utility, which is used for running ntp daemon in case with Mgmt-VRF enabled, requires v2 cgroups for normal work. As the system currently uses net_prio, net_cls, the Linux kernel disable cgroup2 socker matching on startup. Here is the related syslog: Feb 24 15:17:14.725376 sonic INFO kernel: [ 14.057746] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation

As there was no obvious changes, which could have been made the system to start to use those net cgroups, I have tested different builds from Jenkins(both master and 202012 branches) and it seems the issue has introduced in February 17th. I have found that there was a Linux kernel security update, which theoretically could cause such issue: https://github.com/Azure/sonic-linux-kernel/commit/11f0da688d5bd7e206c3e50fd408d0717e9626d1. @lguohan, could I ask you to clarify does this security update causes the issue?

I have an idea how to fix the issue. We can disable net_prio and net_cls in Linux kernel config. I have tested the similar solution on newest master build by adding the next boot parameter to /host/grub/grub.cfg configuration file: cgroup_no_v1=net_prio,net_cls. It is just a workaround, but it confirms that it is a root cause. Ntp daemon works fine with this boot parameter. A more right solution is to exclude those cgroups on stage of builing of Linux kernel. Here is a draft PR with those exclusions: https://github.com/Azure/sonic-linux-kernel/pull/198

Currently, I have a couple of concerns:

  1. I know that the best way to solve the issue is to stop to use those v1 cgroups, but, for example, it seems our revision of docker works only with cgroups v1 and I am not sure that we can change this. Maybe, somebody knows how to make the system not to use net_prio, net_cls cgroups or use only cgroups v2?
  2. I am not sure that disabling of net_prio,net_cls will not affect some another functionality of the system. It would be great if somebody will confirm that the solution is OK.
ghost commented 3 years ago

3rd party source with comment regarding the conflict between cgroup1 and cgroup2: https://elixir.bootlin.com/linux/v4.19.156/source/include/linux/cgroup-defs.h#L745

anshuv-mfst commented 3 years ago

@chitra-raghavan/Kannan/DELL team could you please look into issue.