telstra / open-kilda

OpenKilda is an open-source OpenFlow controller initially designed for use in a global network with high control-plane latency and a heavy emphasis on latency-centric data path optimisation.
Apache License 2.0
78 stars 53 forks source link

An acceptance test fails on path-recomputation #47

Closed carmine closed 6 years ago

carmine commented 7 years ago

We've adjusted one of the ATs to @MVP.Future wrt path recomputation. This task includes:

nikitamarchenko commented 7 years ago
  @MVP.FUTURE
  Scenario: Port Failover followed by failure followed by recovery

  This scenario checks that failover and recovery happens orderly and that
  failures do not break things apart. Also it checks whether alternate route
  is picked after complet failure.

    Given a clean flow topology
    And a clean controller
    And a multi-path topology
    And topology contains 16 links
    And a flow ffr is successfully created
    And flow ffr in UP state
    And flow ffr path is shortest
    And traffic flows through ffr flow

    When a switch 00:00:00:00:00:00:00:03 port 1 is disabled
    And flow ffr in UP state
    And flow ffr path is alternate
    Then traffic flows through ffr flow

    When a switch 00:00:00:00:00:00:00:05 port 1 is disabled
    And flow ffr in DOWN state
    And flow ffr path is alternate
    Then traffic does not flow through ffr flow

    When a switch 00:00:00:00:00:00:00:03 port 1 is enabled
>>>FAILED    And flow ffr in UP state
    And flow ffr path is shortest
    Then traffic flows through ffr flow

I don't know what is wrong so my estimate 5d

nikitacherevko commented 7 years ago

I think this issue may takes up to 5 days because we don't know the reason why this error is occurred

surabujin commented 7 years ago

I don't dig in this direction yet... so my very rude estimation is 5d.

a-ovchinnikov commented 7 years ago

This is a missing feature. Currently the only way to restore a flow is to bring up a port in the last path installed i.e. in p1 of s5. To make this test pass the controller needs to keep track of which port belongs to which path and on a port up event check if any path is affected. If a path is affected and if a flow with such path is down then a set of rules for the path should be created and installed or confirmed. Also this will pave way for flow path optimization feature in the future. I was thinking about this feature and I believe that it would be better to query some database for this data rather then mess with all the mappings needed. Simple queries should reduce future bugs' severity and make code cleaner and more maintainable overall. However this requires adding an extra service (a DB) and amending the rest of the code to use it. So I estimate this to be at least a 5d long exercise. If we choose some ad hoc solution then it will take about 5d as well, but in this case there is more chance that it will actually take longer to finish it and it will have greater potential to introduce new bugs.

nikitamarchenko commented 6 years ago

In https://github.com/telstra/open-kilda/blob/master/services/wfm/src/main/java/org/openkilda/wfm/topology/cache/CacheBolt.java#L330

I found a code

affectedFlows = flowCache.dumpFlows().stream()
                        .filter(flow -> FlowState.DOWN.equals(flow.getLeft().getState()))
                        .collect(Collectors.toSet());

emitRerouteCommands(affectedFlows, tuple, "ISL", FlowOperation.UPDATE);

But I don't see reroute command in logs, so will try to find what is wrong tomorrow.

nikitamarchenko commented 6 years ago

affectedFlows size is 0

nikitamarchenko commented 6 years ago

error in logs https://gist.github.com/nikitamarchenko/18b71438d6915cbc43749430cafc97c0

nikitamarchenko commented 6 years ago

affectedFlows size is 0 because the flow is on UP state.