telstra / open-kilda

OpenKilda is an open-source OpenFlow controller initially designed for use in a global network with high control-plane latency and a heavy emphasis on latency-centric data path optimisation.
Apache License 2.0
77 stars 53 forks source link

[Flaky] When main path ISL is UP, and ISL of protected path becomes active, and other non-involved ISLs have not enough bandwidth, the flow does not become UP, it stays degraded with “protected-path”: “Down” #5655

Open izadorozhna opened 1 month ago

izadorozhna commented 1 month ago

Steps to reproduce with the automated test:

  1. Go to a test spec "Flow swaps to protected path when main path gets broken, becomes DEGRADED if protected path is unable to reroute(no bw)"
  2. Change the code to select the 7 and 8 switches as a pair:
         given: "Two switches with 2 diverse paths at least"
         //def switchPair = switchPairs.all().withAtLeastNNonOverlappingPaths(2).random()
         //https://github.com/telstra/open-kilda/issues/5608
    -        def switchesWhere5608IsReproducible = topology.activeSwitches.findAll {it.dpId.toString().endsWith("08")
    -        ||it.dpId.toString().endsWith("09")}
    +        def switches_7_and_8 = topology.activeSwitches.findAll {it.dpId.toString().endsWith("07")
    +                ||it.dpId.toString().endsWith("08")}
         def switchPair = switchPairs.all()
    -                .excludeSwitches(switchesWhere5608IsReproducible)
    +                .includeSwitch(switches_7_and_8[0])
    +                .includeSwitch(switches_7_and_8[1])
                 .withAtLeastNNonOverlappingPaths(2).random()
  3. Execute the test. So, it will be executed with switches 7 and 8.
  4. If the test passes, repeat step 3.
  5. When the test fails on the step when the main ISL is restored, and the flow is expected to be UP, but it is Degraded, the issue is reproduced.

Steps to reproduce with the manually:

  1. Select switches 7 and 8 and create a flow with a protected path. Usually, such flow has path size 2 (7<-->8) for both main and protected paths.
  2. Select all non-involved ISLs into the main or protected path of the flow and decrease the BW there to a minimum.
  3. Break the ISL(s) of the main path, so the originally protected path swaps to the main, and vice-versa originally main path with broken ISL now becomes the protected path which is down.
  4. Check that now the flow has degraded status because the protected path cannot be found (the original main path ISL is broken and cannot be a new protected path, and other non-involved ISLs have not enough BW).
  5. Restore the original main ISL broken on step 4.
  6. Check that the flow becomes active with main and protected paths UP.

Expected result:

The flow becomes UP:

"status": "Up",
"status-details": {
"main-path": "Up",
"protected-path": "Up"
}

When checking the history, it should have the reroute action after ISL is Active, and since the protected path is already present, earlier it was down due to the broken ISL, and now this ISL is up, the same protected path is found. So, Kilda skipped creating of new protected path:

image

Actual result:

When executing the same test several times (with the same switch pair 7-8), the result is not consistent. Sometimes, the expected result is received. But sometimes, after the main ISL is restored, the flow still stays in the Degraded state with the “protected-path”: “Down”:

"status": "Degraded",
"status-details": {
"main-path": "Up",
"protected-path": "Down"
},
"status_info": "Couldn't find non overlapping protected path",

However, the history has the route action after ISL became active:

image

But for some reason, this time, it does not have "Found the same protected path. Skipped creating of it" message, but it has "Couldn't find non overlapping protected path. Skipped creating it" instead.

Also, when I try to do the manual explicit reroute action via Northbound V2 API, it helps to reroute the flow and the flow becomes UP. The flow history now has a new reroute action started via Northbound. However, the API response to the reroute action has rerouted: false for some reason:

image

Attaching the flow history JSON which include the manual explicit reroute action as well. 07May180118_375_cinnamon9255.json

Attaching tolopogy.yaml: topology.yaml.log

P.S. Please note that the test case is flaky and need to repeat the steps several times to reproduce the issue. Also, it is important to note that there is a separate similar test "Flow swaps to protected path when main path gets broken, becomes DEGRADED if protected path is unable to reroute(no bw)" which has similar steps, but the other (non-involved ISLs into main or protected paths), are broken instead of decreasing BW. In this case, the test also fails sometimes with switch pair 7-8.