sonic-net / sonic-sairedis

SAI object interface to Redis database, as used in the SONiC project
Other
56 stars 259 forks source link

Fix flow counter out-of-order issue by notifying counter operations using SelectableChannel #1362

Closed stephenxs closed 4 months ago

stephenxs commented 5 months ago

What I did

Fix flow counter out-of-order issue by notifying counter operations using SelectableChannel

Signed-off-by: Stephen Sun stephens@nvidia.com

Why I did it

Currently, the operations of SAI objects and their counters (if any) are triggered by different channels, which introduces racing conditions:

The new solution is to extend SAI redis attributes on the SAI_SWITCH_OBJECT to notify counter polling. As a result, all the objects and their counters are notified using a unified channel, which is the SelectableChannel.

How I verified it

Unit test Manual test Regressions test

Details if related

There are two SAI Redis attributes introduced as below. There are some fields with const char * type for each attribute. Passing a field as nullptr means not to change it.

Both SAI attributes are terminated by the RedisRemoteSaiInterface object in the swss context, which serializes the SAI API call into the selectable channel.

The Syncd will call flex counter functions to handle them on receiving the above-extended commands (representing both SAI extended attributes).

Gearbox flex counter database Pass the Phy OID, an OID of a SAI switch object in syntax, when calling the SAI set API to set the extended attributes. By doing so, the SAI redis objects can choose in which context the SAI API call should be invoked and the corresponding gearbox syncd docker container will handle it. (ps: THE ORIGINAL GEARBOX FLEX COUNTER IMPLEMENTATION IS BUGGY)

Context and critical section analysis It does not change the critical section hierarchy

Performance analysis The counter operations are handled in the same thread in both the new and old solutions. In swss, the counter operation was asynchronous in the old solution and is synchronous now, which can introduce a bit more latency. However, as the number of counter operations is small, no performance degradation is observed.

kcudnik commented 5 months ago

please satisfy code coverage test

stephenxs commented 4 months ago

Hi @kcudnik can you help to merge it? thanks