sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
728 stars 1.4k forks source link

[DPB][FLEX Counters] Error seen in SDK reading counters for removed ports #19105

Open pavannaregundi opened 4 months ago

pavannaregundi commented 4 months ago

Description

Errors are seen in SDK reading the Queue and Buffer counters for deleted ports after dynamic breakout CLI execution.

2024 May 28 04:17:07.173224 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.173291 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 19.
2024 May 28 04:17:07.173291 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060013: -19
2024 May 28 04:17:07.188896 sonic ERR syncd#syncd: xpSaiQueue.c:130 Error: Entry not found |retVal: 0
2024 May 28 04:17:07.188896 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.188896 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 20.
2024 May 28 04:17:07.188896 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060014: -19
2024 May 28 04:17:07.188946 sonic ERR syncd#syncd: xpSaiQueue.c:130 Error: Entry not found |retVal: 0
2024 May 28 04:17:07.188946 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.188974 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 21.
2024 May 28 04:17:07.188974 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060015: -19
2024 May 28 04:17:07.188974 sonic ERR syncd#syncd: xpSaiQueue.c:130 Error: Entry not found |retVal: 0
2024 May 28 04:17:07.189002 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.189056 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 22.
2024 May 28 04:17:07.189056 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060016: -19
2024 May 28 04:17:07.189056 sonic ERR syncd#syncd: xpSaiQueue.c:130 Error: Entry not found |retVal: 0
2024 May 28 04:17:07.189086 sonic ERR syncd#syncd: xpSaiQueue.c:2355 Error: Queue does not exist, xpStatus: 23
2024 May 28 04:17:07.189136 sonic ERR syncd#syncd: xpSaiQueue.c:2550 Could not store the statistics for the port 6 queue 23.
2024 May 28 04:17:07.189136 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Queue Counter 0x15000000060017: -19
2024 May 28 04:17:09.879108 sonic ERR syncd#syncd: xpSaiBuffer.c:1855 Could not Get Ingress Priority Group Info for 7318349394477064
2024 May 28 04:17:09.879108 sonic ERR syncd#syncd: xpSaiBuffer.c:1806 Error: Ingress priority group entry does not exist: oid - 7318349394477064
2024 May 28 04:17:09.879108 sonic ERR syncd#syncd: xpSaiBuffer.c:4977 Error: Failed to get state data for ingress pg 0x1a000000000008, saiStatus: -7
2024 May 28 04:17:09.879176 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Priority Group Counter 0x1a000000000008: -7
2024 May 28 04:17:09.879176 sonic ERR syncd#syncd: xpSaiBuffer.c:1855 Could not Get Ingress Priority Group Info for 7318349394477065
2024 May 28 04:17:09.879176 sonic ERR syncd#syncd: xpSaiBuffer.c:1806 Error: Ingress priority group entry does not exist: oid - 7318349394477065
2024 May 28 04:17:09.879195 sonic ERR syncd#syncd: xpSaiBuffer.c:4977 Error: Failed to get state data for ingress pg 0x1a000000000009, saiStatus: -7
2024 May 28 04:17:09.879209 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Priority Group Counter 0x1a000000000009: -7
2024 May 28 04:17:09.882426 sonic ERR syncd#syncd: xpSaiBuffer.c:1855 Could not Get Ingress Priority Group Info for 7318349394477066
2024 May 28 04:17:09.882470 sonic ERR syncd#syncd: xpSaiBuffer.c:1806 Error: Ingress priority group entry does not exist: oid - 7318349394477066
2024 May 28 04:17:09.882470 sonic ERR syncd#syncd: xpSaiBuffer.c:4977 Error: Failed to get state data for ingress pg 0x1a00000000000a, saiStatus: -7
2024 May 28 04:17:09.882470 sonic ERR syncd#syncd: :- collectData: Failed to get stats of Priority Group Counter 0x1a00000000000a: -7
2024 May 28 04:17:09.882495 sonic ERR syncd#syncd: xpSaiBuffer.c:1855 Could not Get Ingress Priority Group Info for 7318349394477067

Steps to reproduce the issue:

  1. Collect redis db dumps for reference

    redis-dump -d 5 -y -o redis_flex_before_breakout.txt
    redis-dump -d 1 -y -o redis_asic_before_breakout.txt
  2. Run DPB which removed more ports than it creates. Example below shows converting from 4x100G to 1x400G.

 # config interface breakout Ethernet0 1x400G -v -f
Do you want to Breakout the port, continue? [y/N]: y

Running Breakout Mode : 4x100G
Target Breakout Mode : 1x400G

Ports to be deleted :
 {
    "Ethernet0": "100000",
    "Ethernet2": "100000",
    "Ethernet4": "100000",
    "Ethernet6": "100000"
}
Ports to be added :
 {
    "Ethernet0": "400000"
}
  1. Collect redis db dumps again

    redis-dump -d 5 -y -o redis_flex_after_breakout.txt
    redis-dump -d 1 -y -o redis_asic_after_breakout.txt
  2. check syslogs for errors.

Describe the results you received:

From the collected logs in redis-dump,

redis_asic_before_breakout.txt 

  "ASIC_STATE:SAI_OBJECT_TYPE_HOSTIF:oid:0xd000000002bb2": {
    "expireat": 1716868800.1728191,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "SAI_HOSTIF_ATTR_NAME": "Ethernet2",
      "SAI_HOSTIF_ATTR_OBJ_ID": "oid:0x10000000005a2",
      "SAI_HOSTIF_ATTR_OPER_STATUS": "false",
      "SAI_HOSTIF_ATTR_TYPE": "SAI_HOSTIF_TYPE_NETDEV"
    }
  },

"ASIC_STATE:SAI_OBJECT_TYPE_PORT:oid:0x10000000005a2": {
    "expireat": 1716868800.1954298,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "NULL": "NULL",
      "SAI_PORT_ATTR_ADMIN_STATE": "true",
      "SAI_PORT_ATTR_AUTO_NEG_MODE": "true",
      "SAI_PORT_ATTR_FEC_MODE": "SAI_PORT_FEC_MODE_RS",
      "SAI_PORT_ATTR_HW_LANE_LIST": "2:2,3",
      "SAI_PORT_ATTR_MTU": "9122",
      "SAI_PORT_ATTR_SPEED": "100000"
    }
  },

"VIDTORID": {
    "expireat": 1716868800.2920315,
    "ttl": -0.001,
    "type": "hash",
    "value": {

port:
      "oid:0x10000000005a2": "oid:0x1000000000002",

queue:
      "oid:0x150000000006b2": "oid:0x15000000020000",
      "oid:0x150000000006b3": "oid:0x15000000020001",
      "oid:0x150000000006b4": "oid:0x15000000020002",
      "oid:0x150000000006b5": "oid:0x15000000020003",
      "oid:0x150000000006b6": "oid:0x15000000020004",
      "oid:0x150000000006b7": "oid:0x15000000020005",
      "oid:0x150000000006b8": "oid:0x15000000020006",
      "oid:0x150000000006b9": "oid:0x15000000020007",

redis_flex_after_breakout.txt: Stale entries present in FLEX_DB for queues mapped to deleted port.

"FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b2": {
    "expireat": 1716870051.3594844,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

  "FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b3": {
    "expireat": 1716870051.7402668,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

"FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b4": {
    "expireat": 1716870051.3640532,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

  "FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b5": {
    "expireat": 1716870051.7292657,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

  "FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b6": {
    "expireat": 1716870051.728027,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }
  },

  "FLEX_COUNTER_TABLE:QUEUE_STAT_COUNTER:oid:0x150000000006b7": {
    "expireat": 1716870051.3757164,
    "ttl": -0.001,
    "type": "hash",
    "value": {
      "QUEUE_COUNTER_ID_LIST": "SAI_QUEUE_STAT_DROPPED_BYTES,SAI_QUEUE_STAT_DROPPED_PACKETS,SAI_QUEUE_STAT_BYTES,SAI_QUEUE_STAT_PACKETS"
    }

Describe the results you expected:

Output of show version:

port_breakout_info.txt redis_asic_after_breakout.txt redis_asic_before_breakout.txt redis_flex_after_breakout.txt redis_flex_before_breakout.txt syslog.txt

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

arlakshm commented 4 months ago

@dgsudharsan to start a offline discussion on the change needed in Sairedis.

arlakshm commented 4 months ago

this PR https://github.com/sonic-net/sonic-swss/pull/3076 has the fix for this issue. Please retest with latest master image.

pavannaregundi commented 4 months ago

this PR sonic-net/sonic-swss#3076 has the fix for this issue. Please retest with latest master image.

@arlakshm Thanks for your comment. Using following master commit: https://github.com/sonic-net/sonic-buildimage/tree/a7ab698f1c7218b4ddc4db63c42918a8c3eb9eb4 I see that above PR is already part of this master commit.

dgsudharsan commented 4 months ago

@pavannaregundi From the internally attached PR I see the backref was missing and hence the queue removal didn't happen in the first place. The PR 3076 in SWSS addresses a different race condition which is a statistical issue. I believe we need the yang fix that is linked to this bug.

pavannaregundi commented 4 months ago

@pavannaregundi From the internally attached PR I see the backref was missing and hence the queue removal didn't happen in the first place. The PR 3076 in SWSS addresses a different race condition which is a statistical issue. I believe we need the yang fix that is linked to this bug.

I had directly patched changes to /usr/local/yang-models/sonic-buffer-queue.yang in sonic switch and tried that change. It did not work either. So internally we are still checking it.

stephenxs commented 3 months ago

@pavannaregundi From the internally attached PR I see the backref was missing and hence the queue removal didn't happen in the first place. The PR 3076 in SWSS addresses a different race condition which is a statistical issue. I believe we need the yang fix that is linked to this bug.

I had directly patched changes to /usr/local/yang-models/sonic-buffer-queue.yang in sonic switch and tried that change. It did not work either. So internally we are still checking it.

can you try configuring create_only_config_db_buffers in DEVICE_METADATA|localhost? I think it should work with it configured currently, DPB doesn't remove queue/PG counters after the port is removed if it is not configured.

pavannaregundi commented 3 months ago

@pavannaregundi From the internally attached PR I see the backref was missing and hence the queue removal didn't happen in the first place. The PR 3076 in SWSS addresses a different race condition which is a statistical issue. I believe we need the yang fix that is linked to this bug.

I had directly patched changes to /usr/local/yang-models/sonic-buffer-queue.yang in sonic switch and tried that change. It did not work either. So internally we are still checking it.

can you try configuring create_only_config_db_buffers in DEVICE_METADATA|localhost? I think it should work with it configured currently, DPB doesn't remove queue/PG counters after the port is removed if it is not configured.

@stephenxs Thanks. We will try this and get back.

pavannaregundi commented 3 months ago

@stephenxs Adding create_only_config_db_buffers.json is working. However I am not sure if this is how it is supposed to work. In general if a port is removed from ASIC DB, its FLEX_COUNTER entry should also get removed. Also, is there any other implications of using 'create_only_config_db_buffers'?

stephenxs commented 3 months ago

@stephenxs Adding create_only_config_db_buffers.json is working. However I am not sure if this is how it is supposed to work. In general if a port is removed from ASIC DB, its FLEX_COUNTER entry should also get removed. Also, is there any other implications of using 'create_only_config_db_buffers'?

Hi The orchagent should remove the PG, queue counters when a port is removed. I think it is a missing logic in the DPB feature. We fixed it partially when create_only_config_db_buffers is true when we were fixing another issue. But in general, we should expect it to be fixed by the owner of DPB especially when the flag is not set. When create_only_config_db_buffers is set, it only create counters for queues/PGs that are configured in BUFFER_PG and BUFFER_QUEUE tables.