Open alpeshspatel opened 2 months ago
Analysis:
Sonic uses a moving average over a 10 second period, with polling every 1 second to calculate the moving average for Rx/Tx bps and pps and link utilization.
The script that calculates the rate is at: https://github.com/sonic-net/sonic-swss/blob/master/orchagent/port_rates.lua
With line rate of 400Gbps, subtracting the interframe gap of 0.24ns, one cannot achieve 100 utilization rate, since the utilization rates are for L2 rates and not L1.
A 0.24 ns adds an equivalent of 12B overhead for every packet, in addition to 7 byte preamble and 1 byte start-of-frame delimiter Ref: Interpacket gap
Thus, for 400G interface, the Rx Util can be: 8192/(8192+20) = 99.75%
SONiC logic:
Default values of parameters (configurable)
Counterpoll interval: 1 sec
Smoothing interval for port: 10 seconds (config rate smoothing-interval
Calculated values
port_alpha = 2.0/(smooth_interval + 1) = 0.18, using the default smooth_interval of 10 seconds
The (rx) bps is calculated using the moving average as:
new_rx_bps = (current SAI_PORT_STAT_IF_IN_OCTETS - previous SAI_PORT_STAT_IF_IN_OCTETS ) / counter_poll_interval_sec * 8 # read every counter_poll interval
retrieve old_rx_bps from last iteration
rx_bps = port_alpha new_rx_bps + (1 – port_alpha) old_rx_bps
Save the calculated value as old_rx_bps for next run
Accuracy can be changed by config tuning:
Tune port_alpha (Increase smoothing interval to reduce recency bias. Decrease smoothing interval to reduce increase recency bias)
Tune counter_poll interval
larger interval will average out the traffic and not show the peak/instantaneous rates
small interval (e.g. 1 second causes more variability as the slight drift in polling time has a larger impact)
Some examples:
counterpoll port interval 1 second, smoothing interval of 10 seconds (port alpha 0.18), RX_UTIL oscillates in [83%, 107%]
counterpoll port interval 10 second, smoothing interval of 1 seconds (with port alpha 1), RX_UTIL oscillates in [99.74%, 99.79%]. This is very close to 99.75% theoretical value.
viz: @neethajohn @kevinwangsk
@prsunny to followup @alpeshspatel are you seeing this only with 400G interfaces and 100% util ?
Description
On platform(s) with 400G interfaces, show interface counters sometimes shows port utilization above 100%
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
See the analysis below.