sonic-net / sonic-swss

SONiC Switch State Service (SwSS)
https://azure.github.io/SONiC
Other
170 stars 510 forks source link

Do not apply QoS mapping object until it is resolved #3163

Closed stephenxs closed 3 months ago

stephenxs commented 3 months ago

What I did

Do not apply the global DSCP to TC map to the switch object until the mapping object has been created.

Why I did it

Fix issue: if orchagent handles tables in the following order, it will fail in step 1 and the configure will never applied.

  1. PORT_QOS_MAP|global object
  2. and then DSCP_TO_TC object

How I verified it

Mock and manual test

Details if related

stephenxs commented 3 months ago

vs test failure is not relevant to my change.

test_chassis_system_lag_id_allocator_table_full failed (1 runs remaining out of 2).
    <class 'AssertionError'>
    LAG ID allocator table full error is not returned
assert '0' == '1'
  - 0
  + 1
    [<TracebackEntry /agent/_work/1/s/tests/test_virtual_chassis.py:695>]
test_chassis_system_lag_id_allocator_table_full failed; it passed 0 out of the required 1 times.
    <class 'AssertionError'>
    LAG ID allocator table full error is not returned
assert '0' == '1'
  - 0
  + 1
    [<TracebackEntry /agent/_work/1/s/tests/test_virtual_chassis.py:695>]
test_chassis_system_lag_id_allocator_del_id failed (1 runs remaining out of 2).
    <class 'AssertionError'>
    Unexpected number of keys: expected=1, received=2 (('oid:0x200000000098c', 'oid:0x200000000098b')), table="ASIC_STATE:SAI_OBJECT_TYPE_LAG"
    [<TracebackEntry /agent/_work/1/s/tests/test_virtual_chassis.py:778>, <TracebackEntry /agent/_work/1/s/tests/dvslib/dvs_database.py:402>]
test_chassis_system_lag_id_allocator_del_id failed; it passed 0 out of the required 1 times.
    <class 'AssertionError'>
    Unexpected number of keys: expected=1, received=0 ([]), table="ASIC_STATE:SAI_OBJECT_TYPE_LAG_MEMBER"
    [<TracebackEntry /agent/_work/1/s/tests/test_virtual_chassis.py:763>, <TracebackEntry /agent/_work/1/s/tests/dvslib/dvs_database.py:402>]
test_chassis_add_remove_ports passed 1 out of the required 1 times. Success!
test_voq_egress_queue_counter passed 1 out of the required 1 times. Success!
test_chassis_wred_profile_on_system_ports passed 1 out of the required 1 times. Success!
test_nonflaky_dummy passed 1 out of the required 1 times. Success!
stephenxs commented 3 months ago

vs failed due to installing .net core

Hit:6 http://security.ubuntu.com/ubuntu focal-security InRelease
Fetched 3632 B in 1s (5025 B/s)
Reading package lists...
+ sudo apt-get install -y dotnet-sdk-7.0
Reading package lists...
##[debug]Agent environment resources - Disk: / Available 21392.00 MB out of 29598.00 MB, Memory: Used 1360.00 MB out of 32114.00 MB, CPU: Usage 10.12%
Building dependency tree...
Reading state information...
dotnet-sdk-7.0 is already the newest version (7.0.410-1).
0 upgraded, 0 newly installed, 0 to remove and 9 not upgraded.
+ sudo dotnet tool install dotnet-reportgenerator-globaltool --tool-path /usr/bin
Tool 'dotnet-reportgenerator-globaltool' is already installed.

##[debug]Exit code 1 received from tool '/usr/bin/bash'
##[debug]STDIO streams have closed for tool '/usr/bin/bash'
##[error]Bash exited with code '1'.
##[debug]Processed: ##vso[task.issue type=error;source=TaskInternal;]Bash exited with code '1'.
##[debug]task result: Failed
##[debug]Processed: ##vso[task.complete result=Failed;done=true;]
bingwang-ms commented 3 months ago

The change LGTM. Just wondering how the issue was triggered (PORT_QOS_MAP|global before DSCP_TO_TC_MAP) ?

stephenxs commented 3 months ago

The change LGTM. Just wondering how the issue was triggered (PORT_QOS_MAP|global before DSCP_TO_TC_MAP) ?

Theoretically, the order is not guaranteed between redos db and orchagnet. We observed it in the regression only once. I believe it occurred by chance. But we can reproduce it by setting PORT QOS MAP first and then the QoS mapping, with a delay in between.

mssonicbld commented 3 months ago

Cherry-pick PR to 202311: https://github.com/sonic-net/sonic-swss/pull/3184