sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
734 stars 1.41k forks source link

Telemetry Swss Events are not Triggered #19603

Open wumiaont opened 3 months ago

wumiaont commented 3 months ago

Description

Currently Telemetry test involves swss events tests. There are 3 kind of swss events under test now. 1) sonic-events-swss:if-state 2) sonic-events-swss:pfc-storm 3) sonic-events-swss:chk_crm_threshold

It's found that all 3 tests failed because there's no events gets created and then be caught by the subscribers to those events.

Steps to reproduce the issue:

  1. i) Run on ptf server: python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t 10.250.6.231 -p 8080 -m subscribe -x all[heartbeat=2] -xt EVENTS -o ndastreamingservertest -n --subscribe_mode 0 --submode 1 --interval 0 --update_count 1 --filter_event_regex sonic-events-swss:if-state ii) Get a list of ports which admin and operational states are all up. Execute: config interface -n asic0/1 shutdown Ethernet8 (8 is up for example, for multi asic card), or config interface shutdown Ethernet8 (if not multi asic). Wait for the port is down, then config interface -n asic0/1 startup Ethernet8 or config interface startup Ethernet8 iii) Should expect swss event of if-state be generated and got by the above subscriber. It did not happen.

  2. i) Run on ptf server: python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t 10.250.6.231 -p 8080 -m subscribe -x all[heartbeat=2] -xt EVENTS -o ndastreamingservertest -n --subscribe_mode 0 --submode 1 --interval 0 --update_count 1 --filter_event_regex sonic-events-swss:pfc-storm ii) Choose one port with admin and operational status are up (Ethernet8 as example). Find the oid of that port. Exec: sonic-db-cli COUNTERS_DB HSET "COUNTERS:xxxxxx" "DEBUG_STORM" "enabled" xxxxxx is oid of that port pfcwd start --action drop Ethernet8 300 --restoration-time 300 pfcwd stop sonic-db-cli COUNTERS_DB HDEL "COUNTERS:xxxxxx" "DEBUG_STORM" iii) Swss pfc-storm event should be generated and got by above subscriber. Which is not happening.

  3. i) Run on ptf server: python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t 10.250.6.231 -p 8080 -m subscribe -x all[heartbeat=2] -xt EVENTS -o ndastreamingservertest -n --subscribe_mode 0 --submode 1 --interval 0 --update_count 1 --filter_event_regex sonic-events-swss:chk_crm_threshold ii) Execute the following command from shell: (The corresponding parameters can be found from sonic-mgmt/tests/telemetry/events/swss_events.py) duthost.shell("crm config polling interval {}".format(CRM_TEST_POLLING_INTERVAL)) duthost.shell("crm config thresholds acl group high {}".format(CRM_TEST_ACL_GROUP_HIGH)) duthost.shell("crm config thresholds ipv4 route type free") duthost.shell("crm config thresholds ipv4 route low {}".format(CRM_TEST_IPV4_ROUTE_FREE_LOW)) duthost.shell("crm config thresholds ipv4 route high {}".format(CRM_TEST_IPV4_ROUTE_FREE_HIGH)) duthost.shell("crm config thresholds ipv4 route type used") duthost.shell("crm config thresholds ipv4 route low {}".format(CRM_TEST_IPV4_ROUTE_USED_LOW)) duthost.shell("crm config thresholds ipv4 route high {}".format(CRM_TEST_IPV4_ROUTE_USED_HIGH)) time.sleep(WAIT_TIME) # give time for crm threshold exceed to be detected duthost.shell("crm config polling interval {}".format(CRM_DEFAULT_POLLING_INTERVAL)) duthost.shell("crm config thresholds acl group high {}".format(CRM_DEFAULT_ACL_GROUP_HIGH))

    duthost.shell("crm config thresholds ipv4 route low {}".format(CRM_DEFAULT_IPV4_ROUTE_LOW)) duthost.shell("crm config thresholds ipv4 route high {}".format(CRM_DEFAULT_IPV4_ROUTE_HIGH)) duthost.shell("crm config thresholds ipv4 route type free") duthost.shell("crm config thresholds ipv4 route low {}".format(CRM_DEFAULT_IPV4_ROUTE_LOW)) duthost.shell("crm config thresholds ipv4 route high {}".format(CRM_DEFAULT_IPV4_ROUTE_HIGH)) iii) Should expect swss chk_crm_threshold event be created and got by above subscriber. It does not happen.

For detailed information about the above swss tests, sonic-mgmt/tests/telemetry/events/swss_events.py

Describe the results you received:

swss_events tests failed

Describe the results you expected:

swss_events tests pass

Output of show version:

SONiC Software Version: SONiC.202405.0-8e1b8cb6a SONiC OS Version: 12 Distribution: Debian 12.6 Kernel: 6.1.0-11-2-amd64 Build commit: 8e1b8cb6a Build date: Thu Jul 11 00:25:18 UTC 2024 Built by: fountzou@wfsonicdev1

deepak-singhal0408 commented 2 months ago

New Testcase, failing on Chassis/Multi-asic. To Be assigned.

judyjoseph commented 2 months ago

@wumiaont Please can you check further -- is it related to namespace for multi-asic.

judyjoseph commented 2 months ago

@zbud-msft can you check if this handles multi-asic platforms as well

wumiaont commented 2 months ago

Form log we can see the events for if_state is triggered. But client did not receive it. 2024 Aug 1 01:11:59.807917 ixre-egl-board29 NOTICE swss0#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:if-state":{"ifname":"Ethernet80","status":"down","timestamp":"2024-08-01T01:11:59.807716Z"}}

2024 Aug 1 01:12:34.218469 ixre-egl-board29 NOTICE swss0#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:if-state":{"ifname":"Ethernet80","status":"up","timestamp":"2024-08-01T01:12:34.218029Z"}}

I ran gnmi client from the ptf server. No if-state events received. root@6b065efa96a8:~# python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t 10.250.6.231 -p 8080 -m subscribe -x all[heartbeat=2] -xt EVENTS -o ndastreamingservertest --subscribe_mode 0 --submode 1 --interval 0 --update_count 1 --filter_event_regex sonic-events-swss:if-state Sending SubscribeRequest subscribe { prefix { target: "EVENTS" } subscription { path { elem { name: "all" key { key: "heartbeat" value: "2" } } } mode: ON_CHANGE } }

2024-08-01 01:15:17.837697 response received: 2024-08-01 01:15:19.643316 response received: 2024-08-01 01:15:21.449377 response received: 2024-08-01 01:15:23.254555 response received:

wumiaont commented 2 months ago

With the multi-asic support changes for pfc-storm, we can see from syslog the pfc-storm event are published.

2024 Aug 1 13:51:52.068577 ixre-egl-board29 NOTICE swss0#orchagent: :- report_pfc_storm: PFC Watchdog detected PFC storm on port Ethernet80, queue index 4, queue id 0x150000000003e6 and port id 0x100000000000c 2024 Aug 1 13:51:52.068701 ixre-egl-board29 NOTICE swss0#orchagent: :- publish: EVENT_PUBLISHED: {"sonic-events-swss:pfc-storm":{"ifname":"Ethernet80","port_id":"281474976710668","queue_id":"5910974510924774","queue_index":"4","timestamp":"2024-08-01T13:51:52.68480Z"}}

gnmi client did not received this event so test failed.

root@6b065efa96a8:~# python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t 10.250.6.231 -p 8080 -m subscribe -x all[heartbeat=2] -xt EVENTS -o ndastreamingservertest --subscribe_mode 0 --submode 1 --interval 0 --update_count 1 --filter_event_regex sonic-events-swss:pfc-storm Sending SubscribeRequest subscribe { prefix { target: "EVENTS" } subscription { path { elem { name: "all" key { key: "heartbeat" value: "2" } } } mode: ON_CHANGE } }

2024-08-01 14:00:28.315384 response received: 2024-08-01 14:00:30.121351 response received: 2024-08-01 14:00:31.926300 response received: 2024-08-01 14:00:33.731765 response received: 2024-08-01 14:00:35.536942 response received:

wumiaont commented 2 months ago

@zbud-msft What could be wrong here? It seems the event has been published.

wumiaont commented 2 months ago

I can get heartbeat events if not do filter on gnmi client. Do not see swss events even can see from log the swss event was published.

root@6b065efa96a8:~# python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t 10.250.6.231 -p 8080 -m subscribe -x all[heartbeat=2] -xt EVENTS -o ndastreamingservertest --subscribe_mode 0 --submode 1 --interval 0 --update_count 100 Sending SubscribeRequest subscribe { prefix { target: "EVENTS" } subscription { path { elem { name: "all" key { key: "heartbeat" value: "2" } } } mode: ON_CHANGE } }

2024-08-01 14:50:48.240529 response received: update { timestamp: 1722523848226213469 prefix { target: "EVENTS" } update { path { elem { name: "all" key { key: "heartbeat" value: "2" } } } val { json_ietf_val: "{\"sonic-events-eventd:heartbeat\":{\"timestamp\":\"2024-08-01T14:50:48.226150Z\"}}" } } } ......................................

wumiaont commented 2 months ago

@zbud-msft It looks to me that if the event is published by a global service that will work. It it's published by service under certain namespace then client will not receive it. Below log you can see bgp-state events are published from bgp0 or bgp1. Which failed to be received by client. notification is published by rsyslog_plugin, which works.

swss has similar issue of failure as swss is with each namespace.

Looks the publish code has an issue to handle services under namespace.

2024 Aug 1 16:53:04.209148 ixre-egl-board29 NOTICE bgp0#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"fc00::2","status":"down","timestamp":"2024-08-01T16:53:04.208989Z"}} 2024 Aug 1 16:53:04.209213 ixre-egl-board29 NOTICE bgp0#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"fc00::a","status":"down","timestamp":"2024-08-01T16:53:04.209080Z"}} 2024 Aug 1 16:53:04.218601 ixre-egl-board29 NOTICE bgp1#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"10.0.0.7","status":"down","timestamp":"2024-08-01T16:53:04.218323Z"}} 2024 Aug 1 16:53:04.218802 ixre-egl-board29 NOTICE bgp1#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"fc00::16","status":"down","timestamp":"2024-08-01T16:53:04.218654Z"}} 2024 Aug 1 16:53:04.218907 ixre-egl-board29 NOTICE bgp1#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"fc00::e","status":"down","timestamp":"2024-08-01T16:53:04.218753Z"}} 2024 Aug 1 16:53:05.290234 ixre-egl-board29 NOTICE bgp0#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"10.0.0.1","status":"up","timestamp":"2024-08-01T16:53:05.289958Z"}} 2024 Aug 1 16:53:05.293026 ixre-egl-board29 NOTICE bgp1#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"10.0.0.11","status":"up","timestamp":"2024-08-01T16:53:05.292833Z"}} 2024 Aug 1 16:53:05.406129 ixre-egl-board29 NOTICE bgp0#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"10.0.0.5","status":"up","timestamp":"2024-08-01T16:53:05.405738Z"}} 2024 Aug 1 16:53:05.406342 ixre-egl-board29 NOTICE bgp0#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"fc00::2","status":"up","timestamp":"2024-08-01T16:53:05.405859Z"}} 2024 Aug 1 16:53:05.406460 ixre-egl-board29 NOTICE bgp0#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"fc00::a","status":"up","timestamp":"2024-08-01T16:53:05.406214Z"}} 2024 Aug 1 16:53:05.411090 ixre-egl-board29 NOTICE bgp1#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"10.0.0.7","status":"up","timestamp":"2024-08-01T16:53:05.410553Z"}} 2024 Aug 1 16:53:05.411447 ixre-egl-board29 NOTICE bgp1#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"fc00::16","status":"up","timestamp":"2024-08-01T16:53:05.411204Z"}} 2024 Aug 1 16:53:05.411546 ixre-egl-board29 NOTICE bgp1#rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:bgp-state":{"ip":"fc00::e","status":"up","timestamp":"2024-08-01T16:53:05.411296Z"}} 2024 Aug 1 16:59:47.205043 ixre-egl-board29 NOTICE rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:notification":{"ip":"3.3.3.1","is_sent":"true","major_code":"5","minor_code":"0","timestamp":"2024-08-01T16:59:47.204266Z"}} 2024 Aug 1 16:59:47.205251 ixre-egl-board29 NOTICE rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:notification":{"ip":"3.3.3.2","is_sent":"true","major_code":"5","minor_code":"0","timestamp":"2024-08-01T16:59:47.204888Z"}} 2024 Aug 1 16:59:49.207720 ixre-egl-board29 NOTICE rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:notification":{"ip":"3.3.3.2","is_sent":"true","major_code":"6","minor_code":"7","timestamp":"2024-08-01T16:59:49.207207Z"}} 2024 Aug 1 16:59:49.207871 ixre-egl-board29 NOTICE rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-bgp:notification":{"ip":"3.3.3.1","is_sent":"true","major_code":"6","minor_code":"7","timestamp":"2024-08-01T16:59:49.207769Z"}}