sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
711 stars 1.36k forks source link

Cached counter file is not cleared during config reload, causing portstat give wrong counter values #9817

Closed stephenxs closed 2 years ago

stephenxs commented 2 years ago

Description

Steps to reproduce the issue:

  1. send some traffic to generate a non-zero counter value. let's say the amount of traffic is A. The counter content is:
    root@arc-mtbc-1001:/tmp/portstat-0# portstat 
      IFACE    STATE    RX_OK      RX_BPS    RX_UTIL    RX_ERR    RX_DRP    RX_OVR    TX_OK      TX_BPS    TX_UTIL    TX_ERR    TX_DRP    TX_OVR
    -----------  -------  -------  ----------  ---------  --------  --------  --------  -------  ----------  ---------  --------  --------  --------
    Ethernet0        X        0    0.00 B/s      0.00%         0         0         0        0    0.00 B/s      0.00%         0         0         0
    Ethernet4        U      159    0.83 B/s      0.00%         0         0         0    1,432   14.85 B/s      0.00%         0         0         0
    Ethernet8        U      159    0.83 B/s      0.00%         0         0         0    1,433   14.85 B/s      0.00%         0         0         0
    Ethernet12        U      159    0.83 B/s      0.00%         0         0         0    1,433   14.90 B/s      0.00%         0         0         0
    Ethernet16        U      159    0.83 B/s      0.00%         0         0         0    1,431   14.90 B/s      0.00%         0         0         0
    Ethernet20        U      159    0.83 B/s      0.00%         0         0         0    1,431   14.90 B/s      0.00%         0         0         0
    Ethernet24        U      159    0.83 B/s      0.00%         0         0         0    1,433   14.90 B/s      0.00%         0         0         0
    Ethernet28        U      159    0.83 B/s      0.00%         0         0         0    1,434   14.90 B/s      0.00%         0         0         0
  2. portstat -c to clear counter
  3. config reload -y
  4. send some traffic to generate a non-zero counter value. let's say the amount of traffic is B. B should be less than A
  5. portstat The content of admin user is like this (which imo is the correct value since we didn't clear counter on admin)
    admin@arc-mtbc-1001:~$ portstat 
      IFACE    STATE    RX_OK      RX_BPS    RX_UTIL    RX_ERR    RX_DRP    RX_OVR    TX_OK      TX_BPS    TX_UTIL    TX_ERR    TX_DRP    TX_OVR
    -----------  -------  -------  ----------  ---------  --------  --------  --------  -------  ----------  ---------  --------  --------  --------
    Ethernet0        X        0    0.00 B/s      0.00%         0         0         0        0    0.00 B/s      0.00%         0         0         0
    Ethernet4        U        3    0.00 B/s      0.00%         0         0         0       29    3.53 B/s      0.00%         0         0         0
    Ethernet8        U        3    0.00 B/s      0.00%         0         0         0       29    3.53 B/s      0.00%         0         0         0
    Ethernet12        U        3    0.00 B/s      0.00%         0         0         0       29    4.31 B/s      0.00%         0         0         0
    Ethernet16        U        3    0.00 B/s      0.00%         0         0         0       29    4.32 B/s      0.00%         0         0         0
    Ethernet20        U        3    0.00 B/s      0.00%         0         0         0       29    3.54 B/s      0.00%         0         0         0
    Ethernet24        U        3    0.00 B/s      0.00%         0         0         0       29    9.56 B/s      0.00%         0         0         0

    However, for root user the output is:

    root@arc-mtbc-1001:/home/admin# portstat 
    Last cached time was 2022-01-20 14:44:31.430297
      IFACE    STATE    RX_OK      RX_BPS    RX_UTIL    RX_ERR    RX_DRP    RX_OVR    TX_OK      TX_BPS    TX_UTIL    TX_ERR    TX_DRP    TX_OVR
    -----------  -------  -------  ----------  ---------  --------  --------  --------  -------  ----------  ---------  --------  --------  --------
    Ethernet0        X        0    0.00 B/s      0.00%         0         0         0        0    0.00 B/s      0.00%         0         0         0
    Ethernet4        U        0    0.00 B/s      0.00%         0         0         0        0   14.16 B/s      0.00%         0         0         0
    Ethernet8        U        0    0.00 B/s      0.00%         0         0         0        0   14.16 B/s      0.00%         0         0         0
    Ethernet12        U        0    0.00 B/s      0.00%         0         0         0        0   17.31 B/s      0.00%         0         0         0
    Ethernet16        U        0    0.00 B/s      0.00%         0         0         0        0   17.34 B/s      0.00%         0         0         0
    Ethernet20        U        0    0.00 B/s      0.00%         0         0         0        0   14.19 B/s      0.00%         0         0         0
    Ethernet24        U        0    0.00 B/s      0.00%         0         0         0        0   38.34 B/s      0.00%         0         0         0

We observe the issue on 202012 but I think it should be reproduced on other branches as well.

Describe the results you received:

In portstat in step 5, all counter contains 0, which is wrong since traffic has been sent.

Describe the results you expected:

In portstat in step 5, counters should contain the correct value.

Output of show version:

root@arc-mtbc-1001:/home/admin# show version

SONiC Software Version: SONiC.202012.208-8e924b9a7_Internal
Distribution: Debian 10.11
Kernel: 4.19.0-12-2-amd64
Build commit: 8e924b9a7
Build date: Wed Jan 19 10:16:34 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci02-244

Platform: x86_64-mlnx_msn2410-r0
HwSKU: ACS-MSN2410
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1848K10624
Uptime: 15:17:27 up  6:51,  1 user,  load average: 2.41, 3.80, 2.68

Docker images:
REPOSITORY                    TAG                                     IMAGE ID            SIZE
docker-syncd-mlnx             202012.208-8e924b9a7_Internal           91e0857fe0eb        970MB
docker-syncd-mlnx             latest                                  91e0857fe0eb        970MB
docker-platform-monitor       202012.208-8e924b9a7_Internal           bf71face5dd7        685MB
docker-platform-monitor       latest                                  bf71face5dd7        685MB
docker-snmp                   202012.208-8e924b9a7_Internal           c56ccfac940b        427MB
docker-snmp                   latest                                  c56ccfac940b        427MB
docker-teamd                  202012.208-8e924b9a7_Internal           df70c714874b        395MB
docker-teamd                  latest                                  df70c714874b        395MB
docker-wjh                    202012.202012.0-dirty-20211230.065251   592ae65e7d63        520MB
docker-wjh                    latest                                  592ae65e7d63        520MB
docker-sflow                  202012.208-8e924b9a7_Internal           69fb819114b5        391MB
docker-sflow                  latest                                  69fb819114b5        391MB
docker-router-advertiser      202012.208-8e924b9a7_Internal           7c14699b69ca        380MB
docker-router-advertiser      latest                                  7c14699b69ca        380MB
docker-nat                    202012.208-8e924b9a7_Internal           d5d60742e3ce        393MB
docker-nat                    latest                                  d5d60742e3ce        393MB
docker-lldp                   202012.208-8e924b9a7_Internal           4086facc8fc3        420MB
docker-lldp                   latest                                  4086facc8fc3        420MB
docker-database               202012.208-8e924b9a7_Internal           eb326db92704        380MB
docker-database               latest                                  eb326db92704        380MB
docker-sonic-mgmt-framework   202012.208-8e924b9a7_Internal           ad183bf610a8        793MB
docker-sonic-mgmt-framework   latest                                  ad183bf610a8        793MB
docker-orchagent              202012.208-8e924b9a7_Internal           9fa6d1f72aa2        409MB
docker-orchagent              latest                                  9fa6d1f72aa2        409MB
docker-sonic-telemetry        202012.208-8e924b9a7_Internal           e03068e245e7        469MB
docker-sonic-telemetry        latest                                  e03068e245e7        469MB
docker-mux                    202012.208-8e924b9a7_Internal           7dcd0b71d7dd        432MB
docker-mux                    latest                                  7dcd0b71d7dd        432MB
docker-fpm-frr                202012.208-8e924b9a7_Internal           c280b5311557        409MB
docker-fpm-frr                latest                                  c280b5311557        409MB
docker-dhcp-relay             202012.208-8e924b9a7_Internal           edcf26f69c29        393MB
docker-dhcp-relay             latest                                  edcf26f69c29        393MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

In SONiC, the utilities like "portstat" show counter on a per-user basis especially after a user clears the counters.

It works like this: if a user clears counters, it caches the current values fetched from COUNTER_DB in the /tmp/ folder. Each time it is executed, it fetches the counters from COUNTER_DB, and then compares them with the cache, and then displays the difference. Eg. the cached counter is 10, and the counter in COUNTER_DB is 15, it will display 5 which is 15 - 10. However, when a user executes "config reload", the COUNTER_DB will be zeroed, which means the cached counter will be greater than the one fetched from COUNTER_DB. After that, the utilities can give wrong data.

sonic_dump_arc-mtbc-1001_20220120_151843.tar.gz

stephenxs commented 2 years ago

This kind of cached files should be cleared during config reload.

nazariig commented 2 years ago

The bug is also valid for ACL/PBH counters:

  1. /tmp/.pbh_counters.txt
  2. /tmp/.counters_acl.p
stepanblyschak commented 2 years ago

I see that for some counters (e.g dropstat) the cache file is cleared on config reload, code. This has to be done for every counter cache file, however there is another issue, e.g swss restart due to a crash, counters cache isn't cleared.

To solve this I propose:

  1. Have a single cache registry/directory where every application has to put their cache files. To do this, in utilities_common we add a UserCache class that provides a per-application, per-user directory in a global cache directory (e.g /tmp/cache/portstat/1000/, where 1000 is an UID of a user). Every stats application uses this API.
  2. swss.sh script flushing COUNTERS DB and after that also clears the entire /tmp/cache/. This should solve swss restarts as well.
nazariig commented 2 years ago

@stepanblyschak how about /var/cache/sonic/ path?