sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
711 stars 1.36k forks source link

[sflow] sflow samples are not received any more after warm-reboot in sonic 202012 #10442

Open maulik-marvell opened 2 years ago

maulik-marvell commented 2 years ago

Description

When the testcase sflow/test_sflow.py::TestReboot::testWarmreboot was run, it fails because no sflow samples were received after warmboot.

Root cause:

Host sflow deamon running in sflow docker is not listening on multicast group ‘psample’ anymore after warm-reboot

Detailed flow of test:

  1. Test testcase sflow/test_sflow.py::TestReboot::testWarmreboot configures sflow in DUT
  2. SAI Hostif of type netlink is created with name psample, ASIC driver registers netlink to multicast group ‘packets’ in linux netlink subsystem
  3. Host sflow daemon running in sflow docker establishes the client connection to this netlink and listen for multicast packet events triggered by ASIC driver when sflow samples are sent.
  4. All good so far. Now, Test performs warm-reboot
  5. As per the design, db_migrator deletes all the copp configs from the APP db during warm reboot.
  6. DUT goes for warm-reboot, kernel restarts, ASIC driver re-establish the netlink state as it was before the warm shutdown
  7. sFlow docker restarts and Host sflow daemon running in sflow docker re-establishes the client connection back by joining to multicast group
  8. Now, As part of warm-reboot in NOS, sonic compares the temporary view and current view of objects and finds following mismatches:
14455 Mar 16 15:38:29.800927 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF on current view 33 is different than on temporary view: 32
14456 Mar 16 15:38:29.800927 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP_GROUP on current view 6 is different than on temporary view: 1
14457 Mar 16 15:38:29.800927 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_POLICER on current view 4 is different than on temporary view: 0
14458 Mar 16 15:38:29.801055 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_FDB_ENTRY on current view 1 is different than on temporary view: 0
14459 Mar 16 15:38:29.801055 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TRAP on current view 13 is different than on temporary view: 1
14460 Mar 16 15:38:29.801055 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count for SAI_OBJECT_TYPE_HOSTIF_TABLE_ENTRY on current view 2 is different than on temporary view: 1
14461 Mar 16 15:38:29.804960 sonic-dut WARNING syncd#syncd: :- logViewObjectCount: object count is different on both view, there will be ASIC OPERATIONS!

**As mentioned above, SONIC finds that there is one extra hostif(this is netlink only) after warm-reboot and deletes it now

  1. SAI hostif remove is called and as part of that, ASIC driver de-registers the netlink group from linux netlink subsystem, sflow packet PATH is broken here**
  2. Orchagent Copp mgr recreates all these deleted trap/trap-groups/hostif by reading the /etc/sonic/copp_cfg.json post warm init
  3. ASIC driver again registers the netlink interface ‘psample’ to the multicast group ‘packets’ in linux netlink subsystem. 12. But, since sflow docker is already up and running(connected to socket), Host sflow daemon running in sflow docker won’t re-establish the client connection to this multicast group again
  4. Hence packets are not received by agent running in sflow docker.

As mentioned in step#12 above, host sFlow deamon running in sflow docker should have triggered the fresh socket connection to the multicast group ‘packets’, which is not happening.

Workaround tried:

After warm-boot and before sending traffic, if we disable the sflow feature in DUT and reenable it, host sflow establishes the connection to multicast group and it starts working.

Steps to reproduce the issue:

  1. Run sflow/test_sflow.py::TestReboot::testWarmreboot pytest on t0 topology with ‘--enable_sflow_feature’

Describe the results you received:

E "sflow_test.SflowTest ... FAIL", E "", E "======================================================================", E "FAIL: sflow_test.SflowTest", E "----------------------------------------------------------------------", E "Traceback (most recent call last):", E " File \"ptftests/sflow_test.py\", line 267, in runTest", E " self.packet_analyzer(self.collector0_samples,'collector0',self.poll_tests)", E " File \"ptftests/sflow_test.py\", line 170, in packet_analyzer", E " self.analyze_flow_sample(data,collector)", E " File \"ptftests/sflow_test.py\", line 208, in analyze_flow_sample", E " \"Expected Number of samples are not collected collected from Interface %s in collector %s , Received %s\" %(port,collector,data['flow_port_count'][index]))", E "AssertionError: Expected Number of samples are not collected collected from Interface Ethernet232 in collector collector0 , Received 46", E "", E "----------------------------------------------------------------------", E "Ran 1 test in 82.339s", E "", E "FAILED (failures=1)"

Describe the results you expected:

Expected the test to PASS as DUT should be able to receive the sflow samples after warm-reboot

Output of show version:

root@sonic-device1-dut:~# show version
SONiC Software Version: SONiC.202012.Innovium.2.0.0.20220208.095204
Distribution: Debian 10.11
Kernel: 4.19.0-12-2-amd64
Build commit: 743561321
Build date: Tue Feb  8 20:58:58 UTC 2022
Built by: admin@sonic 

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

After warm-boot and before sending traffic, if we disable the sflow feature in DUT and reenable it(or restart the slow docker), host sflow establishes the connection to multicast group and it starts working.

prsunny commented 2 years ago

[Issue Triage] Is this issue seen with master image?

maulik-marvell commented 2 years ago

[Issue Triage] Is this issue seen with master image?

Did not try with master image, seen in 202012 build.