[T2][202405] Orchagent crashes when running `acl/test_acl.py` sonic-mgmt test

arista-nwolfe commented 4 weeks ago

When running acl/test_acl.py, specifically TestAclWithPortToggle, we've seen orchagent crashes due to attempts to delete nexthops that don't exist: Backtrace:

#0  0x00007f7e66107e3c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f7e660b8fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f7e660a3472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x0000556e9607ce9d in handleSaiFailure (abort_on_failure=abort_on_failure@entry=true) at ./orchagent/saihelper.cpp:834
#4  0x0000556e9631afc5 in handleSaiRemoveStatus (api=api@entry=SAI_API_NEXT_HOP_GROUP, status=<optimized out>, context=context@entry=0x0)
    at ./orchagent/saihelper.cpp:733
#5  0x0000556e961a6fe7 in RouteOrch::removeNextHopGroup (this=this@entry=0x556e9835c430, nexthops=...) at ./orchagent/routeorch.cpp:1470
#6  0x0000556e961afa1b in RouteOrch::doTask (this=0x556e9835c430, consumer=...) at ./orchagent/routeorch.cpp:1038
#7  0x0000556e9617441d in Consumer::drain (this=0x556e9835dca0) at ./orchagent/orch.cpp:258
#8  Consumer::drain (this=0x556e9835dca0) at ./orchagent/orch.cpp:255
#9  Consumer::execute (this=0x556e9835dca0) at ./orchagent/orch.cpp:252
#10 0x0000556e9616034a in OrchDaemon::start (this=this@entry=0x556e9829a480) at ./orchagent/orchdaemon.cpp:873
#11 0x0000556e960cdd59 in main (argc=<optimized out>, argv=<optimized out>) at ./orchagent/main.cpp:810

Syslog:

2024 Oct 23 18:11:23.271424 cmp206-5 NOTICE swss0#orchagent: :- removeNextHopGroup: Delete next hop group 10.0.0.93@Ethernet-IB0,10.0.0.99@Ethernet-IB0,10.0.0.101@Ethernet-IB0,10.0.0.103@Ethernet-IB0,10.0.0.105@Ethernet-IB0,10.0.0.107@Ethernet-IB0,10.0.0.109@Ethernet-IB0,10.0.0.111@Ethernet-IB0,10.0.0.113@Ethernet-IB0,10.0.0.115@Ethernet-IB0,10.0.0.117@Ethernet-IB0,10.0.0.119@Ethernet-IB0,10.0.0.121@Ethernet-IB0,10.0.0.123@Ethernet-IB0,10.0.0.125@Ethernet-IB0,10.0.0.127@Ethernet-IB0,10.0.0.167@Ethernet-IB0,10.0.0.169@Ethernet-IB0,10.0.0.171@Ethernet-IB0,10.0.0.173@Ethernet-IB0,10.0.0.175@Ethernet-IB0,10.0.0.177@Ethernet-IB0,10.0.0.179@Ethernet-IB0,10.0.0.181@Ethernet-IB0,10.0.0.183@Ethernet-IB0,10.0.0.185@Ethernet-IB0,10.0.0.187@Ethernet-IB0,10.0.0.189@Ethernet-IB0,10.0.0.191@Ethernet-IB0

2024 Oct 23 18:11:23.271424 cmp206-5 ERR swss0#orchagent: :- meta_sai_validate_oid: object key SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER:oid:0x2d000000001ee7 doesn't exist
2024 Oct 23 18:11:23.271424 cmp206-5 ERR swss0#orchagent: :- flush_removing_entries: ObjectBulker.flush remove entries failed, number of entries to remove: 29, status: SAI_STATUS_ITEM_NOT_FOUND
2024 Oct 23 18:11:23.271448 cmp206-5 ERR swss0#orchagent: :- removeNextHopGroup: Failed to remove next hop group member[0] 2d0000000020fd, rv:-23
2024 Oct 23 18:11:23.271448 cmp206-5 ERR swss0#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP_GROUP, status: SAI_STATUS_NOT_EXECUTED

When I look for that object Id oid:0x2d000000001ee7 in the sairedis logs I see that it was created and deleted ~20s after each other

2024-10-23.18:08:35.276522|C|SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER||oid:0x2d000000001ed1|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_GROUP_ID=oid:0x5000000001ed0|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_ID=oid:0x4000000000c27||…||oid:0x2d000000001ee7|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_GROUP_ID=oid:0x5000000001ed0|SAI_NEXT_HOP_GROUP_MEMBER_ATTR_NEXT_HOP_ID=oid:0x400000000100c||…||

2024-10-23.18:08:55.274258|r|SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER:oid:0x2d000000001ee7

So this justifies the SAI error (the entry isn't in ASIC_DB), but I can't seem to find any obvious reason for the nexthop being removed at 18:08:55 in the syslog. The only thing note-worthy is that TestAclWithPortToggle triggered the shutdown of all ports at 18:08:08:

2024 Oct 23 18:08:08.458143 cmp206-5 INFO python[23588]: ansible-shell_cmds Invoked with cmds=['config interface -n asic0 shutdown Ethernet0', 'config interface -n asic0 shutdown Ethernet8', ...

And I see this nexthop brought down around the same time, but it's not the same as the next hop listed at the crash:

2024 Oct 23 18:08:53.408694 cmp206-5 NOTICE swss0#orchagent: :- removeNextHopGroup: Delete next hop group 10.0.0.1@Ethernet-IB0,10.0.0.5@Ethernet-IB0,10.0.0.9@Ethernet-IB0,10.0.0.13@Ethernet-IB0,10.0.0.17@Ethernet-IB0,10.0.0.21@Ethernet-IB0,10.0.0.25@Ethernet-IB0,10.0.0.29@Ethernet-IB0,10.0.0.33@Ethernet-IB0,10.0.0.35@Ethernet-IB0,10.0.0.37@Ethernet-IB0,10.0.0.39@Ethernet-IB0,10.0.0.41@Ethernet-IB0,10.0.0.43@Ethernet-IB0,10.0.0.47@Ethernet-IB0,10.0.0.49@Ethernet-IB0,10.0.0.51@Ethernet-IB0,10.0.0.53@Ethernet-IB0,10.0.0.55@Ethernet-IB0,10.0.0.57@Ethernet-IB0,10.0.0.59@Ethernet-IB0,10.0.0.61@Ethernet-IB0,10.0.0.63@Ethernet-IB0
2024 Oct 23 18:08:53.408880 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.1@Ethernet-IB0 with next_hop_id 2d0000000010ac
2024 Oct 23 18:08:53.409005 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.5@Ethernet-IB0 with next_hop_id 2d0000000010ad
2024 Oct 23 18:08:53.409125 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.9@Ethernet-IB0 with next_hop_id 2d0000000010ae
2024 Oct 23 18:08:53.409243 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.13@Ethernet-IB0 with next_hop_id 2d0000000010af
2024 Oct 23 18:08:53.409361 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.17@Ethernet-IB0 with next_hop_id 2d0000000010b0
2024 Oct 23 18:08:53.409478 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.21@Ethernet-IB0 with next_hop_id 2d0000000010b1
2024 Oct 23 18:08:53.409593 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.25@Ethernet-IB0 with next_hop_id 2d0000000010b2
2024 Oct 23 18:08:53.409730 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.29@Ethernet-IB0 with next_hop_id 2d0000000010b3
2024 Oct 23 18:08:53.409847 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.33@Ethernet-IB0 with next_hop_id 2d0000000010b4
2024 Oct 23 18:08:53.409962 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.35@Ethernet-IB0 with next_hop_id 2d0000000010b5
2024 Oct 23 18:08:53.410096 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.37@Ethernet-IB0 with next_hop_id 2d0000000010b6
2024 Oct 23 18:08:53.410213 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.39@Ethernet-IB0 with next_hop_id 2d0000000010b7
2024 Oct 23 18:08:53.410331 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.41@Ethernet-IB0 with next_hop_id 2d0000000010b8
2024 Oct 23 18:08:53.410448 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.43@Ethernet-IB0 with next_hop_id 2d0000000010b9
2024 Oct 23 18:08:53.410563 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.47@Ethernet-IB0 with next_hop_id 2d0000000010ba
2024 Oct 23 18:08:53.410678 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.49@Ethernet-IB0 with next_hop_id 2d0000000010bb
2024 Oct 23 18:08:53.410793 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.51@Ethernet-IB0 with next_hop_id 2d0000000010bc
2024 Oct 23 18:08:53.410907 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.53@Ethernet-IB0 with next_hop_id 2d0000000010bd
2024 Oct 23 18:08:53.411024 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.55@Ethernet-IB0 with next_hop_id 2d0000000010be
2024 Oct 23 18:08:53.411140 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.57@Ethernet-IB0 with next_hop_id 2d0000000010bf
2024 Oct 23 18:08:53.411257 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.59@Ethernet-IB0 with next_hop_id 2d0000000010c0
2024 Oct 23 18:08:53.411375 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.61@Ethernet-IB0 with next_hop_id 2d0000000010c1
2024 Oct 23 18:08:53.411491 cmp206-5 WARNING swss0#orchagent: :- removeNextHopGroup: NHFLAGS_IFDOWN set for next hop group member 10.0.0.63@Ethernet-IB0 with next_hop_id 2d0000000010c2

Steps to reproduce the issue:

run acl/test_acl.py::TestAclWithPortToggle on a T2 system (Not 100% reproducible, may take a few tries)

NOTE: This is also seen on t2-min topologies

arlakshm commented 4 weeks ago

@Javier-Tan

Javier-Tan commented 4 weeks ago

Hi @arista-nwolfe, do you have any specific testcases to trigger this? We see failures but no crash when running consecutive tests targeting the same ACL e.g.

acl/test_acl.py::TestAclWithReboot::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]
acl/test_acl.py::TestAclWithPortToggle::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]

arista-nwolfe commented 3 weeks ago

Hi @arista-nwolfe, do you have any specific testcases to trigger this? We see failures but no crash when running consecutive tests targeting the same ACL e.g.

acl/test_acl.py::TestAclWithReboot::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]

acl/test_acl.py::TestAclWithPortToggle::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]

This is what has worked for me. I've also reproduced it running all the ACL tests but only the TestAclWithReboot and TestAclWithPortToggle classes. I.E. I've commented out TestIncrementalAcl and TestBasicAcl (need to copy setup_rules from TestBasicAcl to the other classes and change it's parent to BaseAclTest to do this)

Could you confirm that you're running 202405? I forgot to mention that in the original post

arista-nwolfe commented 3 weeks ago

Update here, I don't think the TestAclWithReboot test is important in reproducing this issue. I was able to reproduce the orchagent crash by just running acl/test_acl.py::TestAclWithPortToggle::test_ingress_unmatched_blocked. I.E. sudo ./run_tests.sh -n ardut -u -c acl/test_acl.py::TestAclWithPortToggle::test_ingress_unmatched_blocked -x -e '--pdb'

I'll update the bug description.

arista-nwolfe commented 3 weeks ago

Adding syslog and sairedis logs from the reproduced issue in the description

sairedis.asic0.rec.22.gz sairedis.asic0.rec.23.gz syslog.8.gz syslog.9.gz

Javier-Tan commented 3 weeks ago

Hi @arista-nwolfe, do you have any specific testcases to trigger this? We see failures but no crash when running consecutive tests targeting the same ACL e.g.

acl/test_acl.py::TestAclWithReboot::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]

acl/test_acl.py::TestAclWithPortToggle::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]

This is what has worked for me. I've also reproduced it running all the ACL tests but only the TestAclWithReboot and TestAclWithPortToggle classes. I.E. I've commented out TestIncrementalAcl and TestBasicAcl (need to copy setup_rules from TestBasicAcl to the other classes and change it's parent to BaseAclTest to do this)

Could you confirm that you're running 202405? I forgot to mention that in the original post

@arista-nwolfe Can confirm I'm running 202405, will try testing on an alternate SKU to see if I can reproduce. Test failures are consistent even without crashing so there's possibility it's linked anyway. Will keep triaging.

Javier-Tan commented 3 weeks ago

Hi @arista-nwolfe, You mention "(Not 100% reproducible, may take a few tries)", I'm having trouble reproducing an orchagent crash, how often are you experiencing these crashes, do you use any specific commands to reproduce?

Some commands tested:

sudo ./run_tests.sh -c "acl/test_acl.py::TestAclWithReboot::test_icmp_match_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]" -c "acl/test_acl.py::TestAclWithPortToggle::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]" -O ....

sudo ./run_tests.sh -c "acl/test_acl.py::TestAclWithPortToggle" -O ..

sudo ./run_tests.sh -c "acl/test_acl.py::TestAclWithReboot" -c "acl/test_acl.py::TestAclWithPortToggle" -O ...

sudo ./run_tests.sh -c "acl/test_acl.py::TestAclWithPortToggle::test_ingress_unmatched_blocked" ...

Javier-Tan commented 3 weeks ago

As an update, the only orchagent problems I've managed to recreate a few times is TestAclWithReboot orchagent crash on reboot acl_syslog.txt

arista-nwolfe commented 3 weeks ago

The crash you saw, it doesn't look quite the same as the crash I'm seeing.

As I'm able to reproduce this on my side is there any other files you want me to grab from the failure state or we could jump on a call and look at the failure state together if that would be helpful?

arista-nwolfe commented 3 weeks ago

Uploading syslog and sairedis files from the reproduced failure we debugged today:

sairedis.asic0.rec.1.gz sairedis.asic0.rec.2.gz syslog.gz

arlakshm commented 1 week ago

Crash not seen with acl tests anymore with the fix to support > 64 members in ECMP group. However, the same crash is seen on other tests. @arista-nwolfe to add more logs

arista-nwolfe commented 1 week ago

We are seeing this crash when running pc/test_po_update.py::test_po_update now. Here are the relevant logs:

2024 Nov 13 17:16:51.309956 cmp206-4 INFO python[836388]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 member del PortChannel102 Ethernet0
2024 Nov 13 17:16:52.335950 cmp206-4 INFO python[836416]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 member del PortChannel102 Ethernet8
2024 Nov 13 17:16:53.995323 cmp206-4 INFO python[836457]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config interface -n asic0 ip remove PortChannel102 10.0.0.0/31

2024 Nov 13 17:17:31.628176 cmp206-4 INFO python[837560]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 add PortChannel999
2024 Nov 13 17:17:32.951272 cmp206-4 INFO python[837591]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 member add PortChannel999 Ethernet0
2024 Nov 13 17:17:34.169119 cmp206-4 INFO python[837628]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 member add PortChannel999 Ethernet8
2024 Nov 13 17:17:35.356295 cmp206-4 INFO python[837737]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config interface -n asic0 ip add PortChannel999 10.0.0.0/31

2024 Nov 13 17:17:39.283207 cmp206-4 NOTICE swss1#orchagent: :- removeNextHopGroup: Delete next hop group 10.0.0.1@Ethernet-IB1,10.0.0.5@Ethernet-IB1,10.0.0.9@Ethernet-IB1,10.0.0.13@Ethernet-IB1,10.0.0.17@Ethernet-IB1,10.0.0.21@Ethernet-IB1,10.0.0.25@Ethernet-IB1,10.0.0.29@Ethernet-IB1,10.0.0.33@Ethernet-IB1,10.0.0.35@Ethernet-IB1,10.0.0.37@Ethernet144,10.0.0.41@Ethernet168,10.0.0.43@Ethernet176,10.0.0.45@Ethernet184,10.0.0.47@Ethernet192,10.0.0.49@Ethernet200,10.0.0.51@Ethernet208,10.0.0.53@Ethernet224,10.0.0.55@Ethernet232,10.0.0.57@Ethernet240,10.0.0.59@Ethernet264,10.0.0.61@Ethernet272,10.0.0.63@Ethernet280
2024 Nov 13 17:17:39.283907 cmp206-4 ERR swss1#orchagent: :- meta_sai_validate_oid: object key SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER:oid:0x12d0000000012df doesn't exist
2024 Nov 13 17:17:39.284118 cmp206-4 ERR swss1#orchagent: :- flush_removing_entries: ObjectBulker.flush remove entries failed, number of entries to remove: 23, status: SAI_STATUS_ITEM_NOT_FOUND
2024 Nov 13 17:17:39.284320 cmp206-4 ERR swss1#orchagent: :- removeNextHopGroup: Failed to remove next hop group member[0] 12d0000000012df, rv:-23
2024 Nov 13 17:17:39.284517 cmp206-4 ERR swss1#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP_GROUP, status: SAI_STATUS_NOT_EXECUTED
2024 Nov 13 17:17:39.284714 cmp206-4 NOTICE swss1#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2024 Nov 13 17:17:39.285471 cmp206-4 NOTICE syncd1#syncd: :- processNotifySyncd: Invoking SAI failure dump
2024 Nov 13 17:17:39.298181 cmp206-4 NOTICE swss1#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded

I'll also attach the syslog and sairedis logs syslog.txt sairedis.asic1.rec.2.gz

sonic-net / sonic-buildimage

[T2][202405] Orchagent crashes when running `acl/test_acl.py` sonic-mgmt test #20605

Steps to reproduce the issue: