sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
196 stars 716 forks source link

[Bug]: platform_tests/api/test_sfp.py:test_tx_disable_channel #13945

Open snider-nokia opened 2 months ago

snider-nokia commented 2 months ago

Issue Description

This test fails (can often fail) when being run for interface(s) with link already oper up. Failure is occurring because the test is operating on transceiver modules in parallel with PMON Xcvrd CMIS Manager production code. If Xcvrd CMIS Manager production code is not quiesced/disabled prior to module test(s) commencing then module tests will compete with CMIS Manager task in the context of module provisioning. Multiple, if not all, tests in this transceiver module related suite are at risk of failing if Xcvrd CMIS Manager production code isn't explicitly quiesced/disabled prior to testing commencement. This could also occur with SFF Manager, which on some platforms does dynamic provisioning of non-CMIS modules.

Results you see

Test failure occurs as test_tx_disable_channel walks the bit vector of applicable transceiver module channels to disable them and then verify their state. Meanwhile, the production PMON Xcvrd CMIS Manager task gets a Redis subscription callback message indicating that the link has gone oper down. Production CMIS Manager task then attempts to re-provision the protagonist transceiver module because that is its job. The resource contention model here is untenable and must be reconciled.

The DUT syslog snapshot below captures the resource contention in play here for SFP index 23 (AKA Ethernet176) at test failure time.

image

Results you expected to see

When it is desired to run transceiver module tests then production PMON Xcvrd CMIS Manager task (at least) must first be intelligently quiesced or disabled so that critical resource contention doesn't occur when test code and production CMIS Manager code, each unaware of the other's existence, end up crashing into each other. SFF Manager should also be considered similarly.

Is it platform specific

generic

Relevant log output

Output of show version

No response

Attach files (if any)

No response

snider-nokia commented 2 months ago

@judyjoseph, @prgeor, @vikshaw-Nokia