sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
725 stars 1.39k forks source link

RX and TX util sometimes shows util values above 100% #19644

Open alpeshspatel opened 2 months ago

alpeshspatel commented 2 months ago

Description

On platform(s) with 400G interfaces, show interface counters sometimes shows port utilization above 100%

$ show interface counters | grep -v "0.00 B/s"

Last cached time was 2024-02-27T22:49:51.688583

      IFACE    STATE          RX_OK         RX_BPS    RX_UTIL    RX_ERR    RX_DRP    RX_OVR          TX_OK         TX_BPS    TX_UTIL    TX_ERR    TX_DRP    TX_OVR

-----------  -------  -------------  -------------  ---------  --------  --------  --------  -------------  -------------  ---------  --------  --------  --------

  Ethernet0        U  1,910,629,135  50905.16 MB/s    101.81%         0         1         0  1,910,628,891  50908.28 MB/s    101.82%         0         0         0

  Ethernet4        U  1,910,616,855  51016.26 MB/s    102.03%         0         1         0  1,910,616,871  51019.40 MB/s    102.04%         0         0         0

  Ethernet8        U  1,910,614,764  51227.79 MB/s    102.46%         0         1         0  1,910,615,069  51230.94 MB/s    102.46%         0         0         0

 Ethernet12        U  1,910,613,326  51102.78 MB/s    102.21%         0         1         0  1,910,613,346  51105.92 MB/s    102.21%         0         0         0

 Ethernet16        U  1,910,620,738  51110.81 MB/s    102.22%         0         1         0  1,910,620,757  51113.95 MB/s    102.23%         0         0         0

 Ethernet20        U  1,910,618,653  51110.10 MB/s    102.22%         0         1         0  1,910,618,490  51113.22 MB/s    102.23%         0         0         0

 Ethernet32        U  1,910,757,295  51809.41 MB/s    103.62%         0         1         0  1,910,757,568  51812.59 MB/s    103.63%         0         0         0

 Ethernet36        U  1,910,755,618  51401.27 MB/s    102.80%         0         1         0  1,910,755,633  51404.42 MB/s    102.81%         0         0         0

 Ethernet40        U  1,910,754,215  51366.47 MB/s    102.73%         0         1         0  1,910,753,951  51369.62 MB/s    102.74%         0         0         0

 Ethernet44        U  1,910,751,846  51338.15 MB/s    102.68%         0         1         0  1,910,751,866  51341.30 MB/s    102.68%         0         0         0

 Ethernet48        U  1,910,749,918  51330.52 MB/s    102.66%         0         1         0  1,910,749,936  51333.68 MB/s    102.67%         0         0         0

 Ethernet52        U  1,910,749,525  51320.27 MB/s    102.64%         0         1         0  1,910,749,725  51323.43 MB/s    102.65%         0         0         0

Ethernet192        U  1,910,581,369  53512.57 MB/s    107.03%         0         1         0  1,910,581,436  53515.86 MB/s    107.03%         0         0         0

Ethernet196        U  1,910,579,936  53509.45 MB/s    107.02%         0         1         0  1,910,579,993  53512.74 MB/s    107.03%         0         0         0

Ethernet208        U  1,910,576,076  53521.53 MB/s    107.04%         0         1         0  1,910,576,065  53524.81 MB/s    107.05%         0         0         0

Ethernet212        U  1,910,574,407  53518.76 MB/s    107.04%         0         1         0  1,910,574,424  53522.05 MB/s    107.04%         0         0         0

Ethernet216        U  1,910,572,597  53516.62 MB/s    107.03%         0         1         0  1,910,572,622  53519.91 MB/s    107.04%         0         0         0

Ethernet220        U  1,910,570,916  53496.57 MB/s    106.99%         0         1         0  1,910,570,933  53499.85 MB/s    107.00%         0         0         0

Ethernet224        U  1,910,568,758  53487.16 MB/s    106.97%         0         1         0  1,910,568,722  53490.44 MB/s    106.98%         0         0         0

Ethernet228        U  1,910,566,892  53484.85 MB/s    106.97%         0         1         0  1,910,566,876  53488.13 MB/s    106.98%         0         0         0

Ethernet240        U  1,910,562,214  53477.48 MB/s    106.95%         0         1         0  1,910,562,263  53480.76 MB/s    106.96%         0         0         0

Ethernet244        U  1,910,560,539  53472.34 MB/s    106.94%         0         1         0  1,910,560,555  53475.62 MB/s    106.95%         0         0         0

Ethernet248        U  1,910,558,880  53469.98 MB/s    106.94%         0         1         0  1,910,558,889  53473.26 MB/s    106.95%         0         0         0

Ethernet252        U  1,910,557,061  53467.93 MB/s    106.94%         0         1         0  1,910,557,084  53471.21 MB/s    106.94%         0         0         0

Steps to reproduce the issue:

  1. Send continous stream of traffic at line rate and periodically run show interface counters

Describe the results you received:

  1. See above on a system

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

See the analysis below.

alpeshspatel commented 2 months ago

Analysis:

Sonic uses a moving average over a 10 second period, with polling every 1 second to calculate the moving average for Rx/Tx bps and pps and link utilization.

The script that calculates the rate is at: https://github.com/sonic-net/sonic-swss/blob/master/orchagent/port_rates.lua

With line rate of 400Gbps, subtracting the interframe gap of 0.24ns, one cannot achieve 100 utilization rate, since the utilization rates are for L2 rates and not L1.

A 0.24 ns adds an equivalent of 12B overhead for every packet, in addition to 7 byte preamble and 1 byte start-of-frame delimiter Ref: Interpacket gap

Thus, for 400G interface, the Rx Util can be: 8192/(8192+20) = 99.75%

SONiC logic:

Default values of parameters (configurable)

Counterpoll interval: 1 sec

Smoothing interval for port: 10 seconds (config rate smoothing-interval port), determines the weight in calculating averages

Calculated values :

port_alpha = 2.0/(smooth_interval + 1) = 0.18, using the default smooth_interval of 10 seconds

The (rx) bps is calculated using the moving average as:

new_rx_bps = (current SAI_PORT_STAT_IF_IN_OCTETS - previous SAI_PORT_STAT_IF_IN_OCTETS ) / counter_poll_interval_sec * 8 # read every counter_poll interval

retrieve old_rx_bps from last iteration

rx_bps = port_alpha new_rx_bps + (1 – port_alpha) old_rx_bps

Save the calculated value as old_rx_bps for next run

Accuracy can be changed by config tuning:

Tune port_alpha (Increase smoothing interval to reduce recency bias. Decrease smoothing interval to reduce increase recency bias)

Tune counter_poll interval

larger interval will average out the traffic and not show the peak/instantaneous rates

small interval (e.g. 1 second causes more variability as the slight drift in polling time has a larger impact)

Some examples:

counterpoll port interval 1 second, smoothing interval of 10 seconds (port alpha 0.18), RX_UTIL oscillates in [83%, 107%]

counterpoll port interval 10 second, smoothing interval of 1 seconds (with port alpha 1), RX_UTIL oscillates in [99.74%, 99.79%]. This is very close to 99.75% theoretical value.

alpeshspatel commented 2 months ago

viz: @neethajohn @kevinwangsk

judyjoseph commented 1 month ago

@prsunny to followup @alpeshspatel are you seeing this only with 400G interfaces and 100% util ?