Open arista-nwolfe opened 4 weeks ago
@Javier-Tan
Hi @arista-nwolfe, do you have any specific testcases to trigger this? We see failures but no crash when running consecutive tests targeting the same ACL e.g.
Hi @arista-nwolfe, do you have any specific testcases to trigger this? We see failures but no crash when running consecutive tests targeting the same ACL e.g.
- acl/test_acl.py::TestAclWithReboot::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]
- acl/test_acl.py::TestAclWithPortToggle::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]
This is what has worked for me. I've also reproduced it running all the ACL tests but only the TestAclWithReboot
and TestAclWithPortToggle
classes. I.E. I've commented out TestIncrementalAcl
and TestBasicAcl
(need to copy setup_rules
from TestBasicAcl
to the other classes and change it's parent to BaseAclTest
to do this)
Could you confirm that you're running 202405? I forgot to mention that in the original post
Update here, I don't think the TestAclWithReboot
test is important in reproducing this issue.
I was able to reproduce the orchagent crash by just running acl/test_acl.py::TestAclWithPortToggle::test_ingress_unmatched_blocked
.
I.E. sudo ./run_tests.sh -n ardut -u -c acl/test_acl.py::TestAclWithPortToggle::test_ingress_unmatched_blocked -x -e '--pdb'
I'll update the bug description.
Adding syslog and sairedis logs from the reproduced issue in the description
sairedis.asic0.rec.22.gz sairedis.asic0.rec.23.gz syslog.8.gz syslog.9.gz
Hi @arista-nwolfe, do you have any specific testcases to trigger this? We see failures but no crash when running consecutive tests targeting the same ACL e.g.
- acl/test_acl.py::TestAclWithReboot::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]
- acl/test_acl.py::TestAclWithPortToggle::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]
This is what has worked for me. I've also reproduced it running all the ACL tests but only the
TestAclWithReboot
andTestAclWithPortToggle
classes. I.E. I've commented outTestIncrementalAcl
andTestBasicAcl
(need to copysetup_rules
fromTestBasicAcl
to the other classes and change it's parent toBaseAclTest
to do this)Could you confirm that you're running 202405? I forgot to mention that in the original post
@arista-nwolfe Can confirm I'm running 202405, will try testing on an alternate SKU to see if I can reproduce. Test failures are consistent even without crashing so there's possibility it's linked anyway. Will keep triaging.
Hi @arista-nwolfe, You mention "(Not 100% reproducible, may take a few tries)", I'm having trouble reproducing an orchagent crash, how often are you experiencing these crashes, do you use any specific commands to reproduce?
Some commands tested:
sudo ./run_tests.sh -c "acl/test_acl.py::TestAclWithReboot::test_icmp_match_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]" -c "acl/test_acl.py::TestAclWithPortToggle::test_egress_unmatched_forwarded[ipv6-egress-downlink->uplink-default-no_vlan]" -O ....
sudo ./run_tests.sh -c "acl/test_acl.py::TestAclWithPortToggle" -O ..
sudo ./run_tests.sh -c "acl/test_acl.py::TestAclWithReboot" -c "acl/test_acl.py::TestAclWithPortToggle" -O ...
sudo ./run_tests.sh -c "acl/test_acl.py::TestAclWithPortToggle::test_ingress_unmatched_blocked" ...
As an update, the only orchagent problems I've managed to recreate a few times is TestAclWithReboot orchagent crash on reboot acl_syslog.txt
The crash you saw, it doesn't look quite the same as the crash I'm seeing.
As I'm able to reproduce this on my side is there any other files you want me to grab from the failure state or we could jump on a call and look at the failure state together if that would be helpful?
Uploading syslog and sairedis files from the reproduced failure we debugged today:
Crash not seen with acl
tests anymore with the fix to support > 64 members in ECMP group. However, the same crash is seen on other tests. @arista-nwolfe to add more logs
We are seeing this crash when running pc/test_po_update.py::test_po_update
now.
Here are the relevant logs:
2024 Nov 13 17:16:51.309956 cmp206-4 INFO python[836388]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 member del PortChannel102 Ethernet0
2024 Nov 13 17:16:52.335950 cmp206-4 INFO python[836416]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 member del PortChannel102 Ethernet8
2024 Nov 13 17:16:53.995323 cmp206-4 INFO python[836457]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config interface -n asic0 ip remove PortChannel102 10.0.0.0/31
2024 Nov 13 17:17:31.628176 cmp206-4 INFO python[837560]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 add PortChannel999
2024 Nov 13 17:17:32.951272 cmp206-4 INFO python[837591]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 member add PortChannel999 Ethernet0
2024 Nov 13 17:17:34.169119 cmp206-4 INFO python[837628]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config portchannel -n asic0 member add PortChannel999 Ethernet8
2024 Nov 13 17:17:35.356295 cmp206-4 INFO python[837737]: ansible-ansible.legacy.command Invoked with _raw_params=sudo config interface -n asic0 ip add PortChannel999 10.0.0.0/31
2024 Nov 13 17:17:39.283207 cmp206-4 NOTICE swss1#orchagent: :- removeNextHopGroup: Delete next hop group 10.0.0.1@Ethernet-IB1,10.0.0.5@Ethernet-IB1,10.0.0.9@Ethernet-IB1,10.0.0.13@Ethernet-IB1,10.0.0.17@Ethernet-IB1,10.0.0.21@Ethernet-IB1,10.0.0.25@Ethernet-IB1,10.0.0.29@Ethernet-IB1,10.0.0.33@Ethernet-IB1,10.0.0.35@Ethernet-IB1,10.0.0.37@Ethernet144,10.0.0.41@Ethernet168,10.0.0.43@Ethernet176,10.0.0.45@Ethernet184,10.0.0.47@Ethernet192,10.0.0.49@Ethernet200,10.0.0.51@Ethernet208,10.0.0.53@Ethernet224,10.0.0.55@Ethernet232,10.0.0.57@Ethernet240,10.0.0.59@Ethernet264,10.0.0.61@Ethernet272,10.0.0.63@Ethernet280
2024 Nov 13 17:17:39.283907 cmp206-4 ERR swss1#orchagent: :- meta_sai_validate_oid: object key SAI_OBJECT_TYPE_NEXT_HOP_GROUP_MEMBER:oid:0x12d0000000012df doesn't exist
2024 Nov 13 17:17:39.284118 cmp206-4 ERR swss1#orchagent: :- flush_removing_entries: ObjectBulker.flush remove entries failed, number of entries to remove: 23, status: SAI_STATUS_ITEM_NOT_FOUND
2024 Nov 13 17:17:39.284320 cmp206-4 ERR swss1#orchagent: :- removeNextHopGroup: Failed to remove next hop group member[0] 12d0000000012df, rv:-23
2024 Nov 13 17:17:39.284517 cmp206-4 ERR swss1#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP_GROUP, status: SAI_STATUS_NOT_EXECUTED
2024 Nov 13 17:17:39.284714 cmp206-4 NOTICE swss1#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2024 Nov 13 17:17:39.285471 cmp206-4 NOTICE syncd1#syncd: :- processNotifySyncd: Invoking SAI failure dump
2024 Nov 13 17:17:39.298181 cmp206-4 NOTICE swss1#orchagent: :- sai_redis_notify_syncd: invoked DUMP succeeded
I'll also attach the syslog and sairedis logs syslog.txt sairedis.asic1.rec.2.gz
When running
acl/test_acl.py
, specificallyTestAclWithPortToggle
, we've seen orchagent crashes due to attempts to delete nexthops that don't exist: Backtrace:Syslog:
When I look for that object Id
oid:0x2d000000001ee7
in the sairedis logs I see that it was created and deleted ~20s after each otherSo this justifies the SAI error (the entry isn't in ASIC_DB), but I can't seem to find any obvious reason for the nexthop being removed at 18:08:55 in the syslog. The only thing note-worthy is that
TestAclWithPortToggle
triggered the shutdown of all ports at 18:08:08:And I see this nexthop brought down around the same time, but it's not the same as the next hop listed at the crash:
Steps to reproduce the issue:
acl/test_acl.py::TestAclWithPortToggle
on a T2 system (Not 100% reproducible, may take a few tries)NOTE: This is also seen on t2-min topologies