sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
741 stars 1.44k forks source link

Buffer Queue Not Changed to Use Queue Data from Config #19624

Open wumiaont opened 4 months ago

wumiaont commented 4 months ago

Description

Issue is found when running Telemetry test case of test_telemetry_queue_buffer_cnt. This test sets "create_only_config_db_buffers" to true in config db, to create only relevant counters. Then remove one of the buffer queues. After that it Uses gnmi to query COUNTERS_QUEUE_NAME_MAP for Ethernet0 compare number of queue counters on Ethernet0. It is expected that it will less than previous count.

There were multi-asic support issues with the sonic-mgmt for test_telemetry_queue_buffer_cnt. After issues have been resolved on sonic-mgmt side it found that COUNTERS_QUEUE_NAME_MAP from COUNTERS_DB under proper namespace still uses maximum number of buffer queue SAI support (16) even when create_only_config_db_buffersis set to true in CONFIG file. This causes test failed.

For example: With multi-asic chassis: sonic-db-cli -n asic0 COUNTERS_DB hgetall "COUNTERS_QUEUE_NAME_MAP"

Steps to reproduce the issue:

1) On Multi-Asic chassis, set create_only_config_db_buffers in config to be true. "DEVICE_METADATA": { "localhost": { "create_only_config_db_buffers ": "true", ................................ } }

2) Reload the config. (sudo config reload -y)

3) After reload check COUNTERS_QUEUE_NAME_MAP sonic-db-cli -n asic0 COUNTER_DB hgetall COUNTERS_QUEUE_NAME_MAP.

You will see Ethernet0:0 to Ethernet0:15 exists. From config under BUFFER_QUEUE it's defined only from 0-6.

Describe the results you received:

test_telemetry_queue_buffer_cnt failed without able to detect COUNTERS_QUEUE_NAME_MAP changes with config change.

Describe the results you expected:

test_telemetry_queue_buffer_cnt successes

Output of show version:

SONiC Software Version: SONiC.HEAD.770825-nokia-master-470b90a084 SONiC OS Version: 12 Distribution: Debian 12.6 Kernel: 6.1.0-11-2-amd64 Build commit: 470b90a084 Build date: Tue Jul 16 17:31:07 UTC 2024 Built by: gitlab-runner@sonic-bld2

-->

wumiaont commented 3 months ago

It is found that GNMI db_client does support multi-asic handling. This issue is on the client side. Need to handle on the gnmi client to indicate which namespace. The fix will be on sonic-mgmt side. Close this one.

There's really one issue still exist for COUNTERS_QUEUE_NAME_MAP on COUNTERS_DB. From the design if create_only_config_db_buffers is set to true, the COUNTERS_QUEUE_NAME_MAP should get the data from config. Otherwise it's get from SAI maximum numbers of buffer queue support.

With current observation, no matter create_only_config_db_buffers is set to true or not, COUNTERS_QUEUE_NAME_MAP always contains maximum number of queues (16). This makes the telemetry test test_telemetry_queue_buffer_cnt fails even with the multi-asic support code added into sonic-mgmt.

judyjoseph commented 3 months ago

@qiluo-msft could you check @wumiaont finding here on the flag create_only_config_db_buffers.

qiluo-msft commented 3 months ago

@wumiaont Do you intend to close this issue?

wumiaont commented 3 months ago

@wumiaont Do you intend to close this issue?

I would like to use this one to track the COUNTERS_QUEUE_NAME_MAP issue mentioned in the comments. I will modify this Issue description to have more accurate info.

wumiaont commented 3 months ago

@qiluo-msft Please help to narrow down the issue to see what could be not right here.

wumiaont commented 3 months ago

Further analysis found that FlexCounterOrch need support for multi-asic/multi-linecard scenario. For example, in a multi-linecard and multi-asic scenario. The keys of BUFFER_QUEUE looks like this: [ixre-egl-board40|asic0|Ethernet24|5-6]. In a single card single asic scenario this buffer queue keys is like this: Ethernet24|5-6. There could be various different combinations such as single card multi asic | multi cards single asic etc. The class needs to handle those situation properly.

More of a concern is not just FlexCounterOrch. Looks to me Orchagent needs to support those above different scenarios from an infrastructure level.

wumiaont commented 3 months ago

Does not see the code logic is correct for PortsOrch::generateQueueMapPerPort(const Port& port, FlexCounterQueueStates& queuesState, bool voq). The port is coming from generateQueueMap() which is walking through all ports. queuesState contains configured data from BUFFER_QUEUE. But it seems this queuesState is not get used properly at all so no matter what you configured on BUFFER_QUEUE it would not take effect. @qiluo-msft

saksarav-nokia commented 3 months ago

let me take a look at it

saksarav-nokia commented 3 months ago

@wumiaont The way BUFFER_QUEUE works in Voq system is different than the pizza boxes. In Voq systems, there are voqs and egress queues. The voqs are fixed and set up during switch create. The flex counters for voq and egress queue are added to FLEX_COUNTER_DB when the flex_counter status is enabled via the following code path. FlexCounterOrch::doTask -> generateQueueMap -> generateQueueMapPerPort (true) -> addQueueFlexCountersPerPortPerQueueIndex

If you look at the code in BufferOrch::processQueue, we added the following code under ~voq check after the discussion with MSFT, Nokia and Arista.(https://github.com/sonic-net/sonic-swss/pull/3050) // create/remove a port queue counter for the queue buffer. // For VOQ chassis, flexcounterorch adds the Queue Counters for all egress and VOQ queues of all front panel and system ports // to the FLEX_COUNTER_DB irrespective of BUFFER_QUEUE configuration. So Port Queue counter needs to be updated only for non VOQ switch. else if (gMySwitchType != "voq") { auto flexCounterOrch = gDirectory.get<FlexCounterOrch*>(); auto queues = tokens[1]; if (op == SET_COMMAND && (flexCounterOrch->getQueueCountersState() || flexCounterOrch->getQueueWatermarkCountersState())) { gPortsOrch->createPortBufferQueueCounters(port, queues); } else if (op == DEL_COMMAND && (flexCounterOrch->getQueueCountersState() || flexCounterOrch->getQueueWatermarkCountersState())) { gPortsOrch->removePortBufferQueueCounters(port, queues); } }

So when the Port BUFFER_QUEUE config is added or removed, you will not see change in FLEX_COUNTER_DB and COUNTERS_DB. I believe Sanjay already updated the sonic-mgmt test to handle this. Please check with him

-Sakthi

saksarav-nokia commented 3 months ago

@sanjair-git

wumiaont commented 3 months ago

In this case test_telemetry_queue_buffer test won't apply when the switch type is voq. I will add check in test_telemetry_queue_buffer to check switch type and skip test if type is 'voq'.

wumiaont commented 3 months ago

@vmittal-msft When I am doing research on test failure of test_telemetry_queue_buffer_cnt, I found that the current swss portOrch code does not provide support for this. The test is trying to check when create_only_config_db_buffers is set to true in config, the relevant counters should be created based on the BUFFER_QUEUE in config. When I look into swss code to see why it's not working I found 2 issues.

The code flow I am checking is this: FlexCounterOrch::doTask --> getQueueConfigurations() (This is the only place to check if create_only_config_db_buffers is set or not, then get queueState info accordingly) ---> PortsOrch::generateQueueMap() ----> PortsOrch::generateQueueMapPerPort(const Port& port, FlexCounterQueueStates& queuesState, bool voq)

1) Issues with getQueueConfigurations(). Following code does not take multi line card /multi asic scenario into consideration. As in our Nokia 7250 case, the key is set as ixre-egl-board40|asic0|Ethernet24|5-6. The following code check bypasses everything from config in multi LC and asic case. if (toks.size() != 2) { SWSS_LOG_ERROR("Invalid BUFFER_QUEUE key: [%s]", portQueueKey.c_str()); continue; } 2) Even we can put a fix above to let getQueueConfigurations() create queuesStateVector from config by adding proper check, PortsOrch::generateQueueMapPerPort does not use the passed in parameter queueState at all. generateQueueMapPerPort will generate the full map no matter create_only_config_db_buffers is true or false. It writes into database COUNTERS_DB table COUNTERS_QUEUE_NAME_MAP or COUNTERS_VOQ_NAME_MAP if switch_type is 'voq'.

I do not see create_only_config_db_buffers could take effect based on the above code logic. My question to you is that is this a bug or it's designed this way?

https://github.com/sonic-net/sonic-buildimage/issues/19624 was created for this issue. I closed this one yesterday as I am told that in voq scenario the queue name map should always have full queue mapping it will never get from config. I am double checking with you to get it confirmed. Also want to know for non voq switch_type case should this flag be supported or not?

Please advice. Thanks.

wumiaont commented 3 months ago

Reopen this one as we currently skip test because of this PR. When this is closed, the test seems to not be skipped and result in failure. We need answer for the above concern.