sonic-net / sonic-mgmt

Configuration management examples for SONiC
Other
176 stars 700 forks source link

Logic on checking whether flooding has stopped for Fast reboot- in advanced_reboot.py #5016

Open vikneels opened 2 years ago

vikneels commented 2 years ago

We are hitting an intermittent issue where in advanced_reboot:fast-reboot test case would bail out because of "Data plane didn't stop flooding within warm up timeout". It happens once in 10-15 iterations of this test case. On looking at logs and the logic we have for advanced_reboot.py, I see an issue with the way we check for flooding.

On DUT , we have fdb age time out as 600 sec and the warm_up time out sec is 300s. Just before we check for elapsed > warm_up_timeout_secs if the fdb age out happens on DUT, we would see flooding which would fail this test case.

I increased the FDB age timeout on DUT so that it wont expire and ran test cases multiple times and I dont see the issue. Also I looked to see if we can change fdb timeout on fly before we run this test case, but looks like only way to change age out timing is via modifying switch,json in orchagent which requires docker restart.

So, I am wondering a) if we can increase the warm_up time out to higher value? Or b) if we hitting a flooding case , can we retry and check once more with lesser interval to see if flooding stops so that testing can proceed further? c) Other way is if we can provide a configurable way to increase fdb aging timeout on DUT before we run this test.

    # check until flooding is over. Flooding happens when FDB entry of
    # certain host is not yet learnt by the ASIC, therefore it sends
    # packet to all vlan ports.
    uptime = datetime.datetime.now()
    while True:
        elapsed = (datetime.datetime.now() - start_time).total_seconds()
        if not self.asic_state.is_flooding() and elapsed > dut_stabilize_secs:
            break
        if elapsed > warm_up_timeout_secs:
            if self.allow_vlan_flooding:
                break
            raise Exception("Data plane didn't stop flooding within warm up timeout")
        time.sleep(1)
vikneels commented 2 years ago

@yxieca @vaibhavhd your thoughts?