kytos / flow_manager

MIT License
1 stars 16 forks source link

Consistency check keeps removing and inserting flows on a switch during the first N runs #124

Closed italovalcy closed 3 years ago

italovalcy commented 3 years ago

Hi,

The consistency check keeps inserting and removing flows on a switch, even in the simplest running scenario: one switch, only OF_LLDP flows. This seems to happen in the first N runs (from my tests, in the first 5 runs), then it gets "stable."

How to reproduce:

  1. Have kytos running:
    docker pull kytos/nightly:latest
    docker run -d --name kytos --privileged kytos/nightly:latest /usr/bin/tail -f /dev/null
    docker exec -it kytos bash
    for cada in storehouse of_core flow_manager topology of_lldp pathfinder mef_eline maintenance; do python3 -m pip install -e git+https://github.com/kytos/$cada#egg=kytos-$cada; done
    sed -i 's/CONSISTENCY_INTERVAL = 60/CONSISTENCY_INTERVAL = 3/g' /var/lib/kytos/napps/kytos/flow_manager/settings.py
    kytosd -E
  2. Run Mininet single topology:
    mn --topo single,2 --controller=remote,ip=127.0.0.1
  3. check the duration of the flows on the Mininet console:
    sh while true; do ovs-ofctl dump-flows s1; sleep 1; done

Expected output: the "duration" of the LLDP flows should be only increasing.

Actual output: the flows are recreated during the first N runs and then gets "stable", i.e., the duration start increasing:

mininet> sh while true; do ovs-ofctl dump-flows s1; sleep 1; done
 cookie=0x0, duration=0.053s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.061s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.069s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.069s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.074s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.080s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.077s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.083s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.091s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.093s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.100s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.105s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.100s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.105s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.111s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=0.103s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=1.116s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=2.127s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=3.140s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=4.151s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=5.162s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=6.174s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=7.186s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=8.197s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
 cookie=0x0, duration=9.209s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

From Kytos logs:

2021-04-22 13:04:48,985 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flows loaded.
2021-04-22 13:04:48,989 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) A consistency problem was detected in switch 00:00:00:00:00:00:00:01.
2021-04-22 13:04:49,000 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flow forwarded to switch 00:00:00:00:00:00:00:01 to be installed.
2021-04-22 13:04:52,007 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flows loaded.
2021-04-22 13:04:52,011 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) A consistency problem was detected in switch 00:00:00:00:00:00:00:01.
2021-04-22 13:04:52,022 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flow forwarded to switch 00:00:00:00:00:00:00:01 to be installed.
2021-04-22 13:04:55,028 - INFO [kytos.napps.kytos/flow_manager] (flow_manager) Flows loaded.

This behavior is not correct because it means that the network traffic will be unstable (flapping) every time the controller gets restarted.

italovalcy commented 3 years ago

Hi folks, just some additional information:

cmagnobarbosa commented 3 years ago

Hi @italovalcy, I'll check why that happens

italovalcy commented 3 years ago

Hi @cmagnobarbosa and folks,

I think I narrowed down why this is happening. The consistency routine in the way it is implemented today depends on the FLOW_STATS reply to be able to check the stored flows against the switch's flow table. The FLOW_STATS is run by of_core and has nothing to do with flow_manager consistency check, which means that it can be run in a totally different frequency. Suppose the flow_stats run less frequently than consistency_check. The problem then happens because consistency_check compares with an empty array for the switch's flow table and then send a FLOW_MOD to install the apparently missing flows (they are not actually missing, but they were just not seen by of_core at that point).

In fact, the FLOW_STATS is run according to the setting of_core.settings.STATS_INTERVAL and consistency check is run based on flow_manager.settings.CONSISTENCY_INTERVAL.

One way to work around this issue is by changing the of_core.settings.STATS_INTERVAL to something lower than CONSISTENCY_INTERVAL (keeping in mind that the consistency interval could be run right after or right before flow_stats)...

In terms of a proposal for this issue, my suggestions would be:

cmagnobarbosa commented 3 years ago

Hello @italovalcy, I agree with you, I'll implement a new event in the of_core (https://github.com/kytos/of_core/issues/122) NApp to notify interested NApps when new flow statistics are available, making it possible to improve and fix the consistency check.

italovalcy commented 3 years ago

Hi folks,

Another thing is the way flow_manager persists the flows: looks like it has a list of flows and the flow_mod command (add, delete, delete_strict). This is dangerous because then flow_manager will have to guarantee the order in which the commands should be run and looks like it is not the case (see below).

I've created some debug messages to understand how consistency check works:

diff --git a/main.py b/main.py
index c737f85..ab86ff4 100644
--- a/main.py
+++ b/main.py
@@ -88,6 +98,8 @@ class Main(KytosNApp):
         """Check consistency of installed flows for a specific switch."""
         dpid = switch.dpid

+        log.info("ITALO-DEBUG check_switch_consistency dpid=%s stored_flows=%s switch_flowtable=%s" % (dpid, self.stored_flows.get(d
pid, None), switch.flows))
+
         # Flows stored in storehouse
         stored_flows = self.stored_flows[dpid]['flow_list']
@@ -117,6 +129,7 @@ class Main(KytosNApp):
         """Check consistency of installed flows for a specific switch."""
         dpid = switch.dpid

+        log.info("ITALO-DEBUG check_storehouse_consistency dpid=%s stored_flows=%s switch_flowtable=%s" % (dpid, self.stored_flows.g
et(dpid, None), switch.flows))
         for installed_flow in switch.flows:
             if dpid not in self.stored_flows:
                 log.info('A consistency problem was detected in '

Then one of the messages shows the following information:

2021-04-23 17:13:12,159 - INFO [kytos.napps.kytos/flow_manager] (Thread-178) ITALO-DEBUG check_switch_consistency dpid=00:00:00:00:00:00:00:01 stored_flows={'flow_list': [{'command': 'add', 'flow': {'priority': 1000, 'table_id': 0, 'match': {'dl_type': 35020, 'dl_vlan': 3799}, 'actions': [{'action_type': 'output', 'port': 4294967293}]}}, {'command': 'delete', 'flow': {'cookie': 12280154957751673672, 'cookie_mask': 18446744073709551615}}, {'command': 'add', 'flow': {'match': {'in_port': 1, 'dl_vlan': 101}, 'cookie': 12280154957751673672, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 101}, {'action_type': 'output', 'port': 2}]}}, {'command': 'add', 'flow': {'match': {'in_port': 2, 'dl_vlan': 101}, 'cookie': 12280154957751673672, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 101}, {'action_type': 'output', 'port': 1}]}}]} switch_flowtable=[<napps.kytos.of_core.v0x04.flow.Flow object at 0x7fb51427d438>]

As you can see in the "stored_flows" there is a list of commands [add, delete, add, add]. Those commands/flows were created by the following sequence of events: start kytos (so the LLDP will be created), create an intra-switch EVC (so the two add flows), and then disable the EVC. The order in the stored_flows doesn't match with the sequence of events. Indeed, after disabling the EVC, the flows are recreated:

mininet> sh while true; do ovs-ofctl dump-flows s1; echo; sleep 1; done
...
 cookie=0xaa6bdfdefb8a6348, duration=47.632s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=101 actions=mod_vlan_vid:101,output:"s
1-eth2"
 cookie=0xaa6bdfdefb8a6348, duration=47.619s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth2",dl_vlan=101 actions=mod_vlan_vid:101,output:"s
1-eth1"
 cookie=0x0, duration=81.748s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

 cookie=0x0, duration=82.752s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

 cookie=0xaa6bdfdefb8a6348, duration=0.702s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=101 actions=mod_vlan_vid:101,output:"s1
-eth2"
 cookie=0xaa6bdfdefb8a6348, duration=0.688s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth2",dl_vlan=101 actions=mod_vlan_vid:101,output:"s1
-eth1"
 cookie=0x0, duration=83.756s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
italovalcy commented 3 years ago

Hi folks,

Another thing is the way flow_manager persists the flows: looks like it has a list of flows and the flow_mod command (add, delete, delete_strict). This is dangerous because then flow_manager will have to guarantee the order in which the commands should be run and looks like it is not the case (see below).

I've created some debug messages to understand how consistency check works:

diff --git a/main.py b/main.py
index c737f85..ab86ff4 100644
--- a/main.py
+++ b/main.py
@@ -88,6 +98,8 @@ class Main(KytosNApp):
         """Check consistency of installed flows for a specific switch."""
         dpid = switch.dpid

+        log.info("ITALO-DEBUG check_switch_consistency dpid=%s stored_flows=%s switch_flowtable=%s" % (dpid, self.stored_flows.get(d
pid, None), switch.flows))
+
         # Flows stored in storehouse
         stored_flows = self.stored_flows[dpid]['flow_list']
@@ -117,6 +129,7 @@ class Main(KytosNApp):
         """Check consistency of installed flows for a specific switch."""
         dpid = switch.dpid

+        log.info("ITALO-DEBUG check_storehouse_consistency dpid=%s stored_flows=%s switch_flowtable=%s" % (dpid, self.stored_flows.g
et(dpid, None), switch.flows))
         for installed_flow in switch.flows:
             if dpid not in self.stored_flows:
                 log.info('A consistency problem was detected in '

Then one of the messages shows the following information:

2021-04-23 17:13:12,159 - INFO [kytos.napps.kytos/flow_manager] (Thread-178) ITALO-DEBUG check_switch_consistency dpid=00:00:00:00:00:00:00:01 stored_flows={'flow_list': [{'command': 'add', 'flow': {'priority': 1000, 'table_id': 0, 'match': {'dl_type': 35020, 'dl_vlan': 3799}, 'actions': [{'action_type': 'output', 'port': 4294967293}]}}, {'command': 'delete', 'flow': {'cookie': 12280154957751673672, 'cookie_mask': 18446744073709551615}}, {'command': 'add', 'flow': {'match': {'in_port': 1, 'dl_vlan': 101}, 'cookie': 12280154957751673672, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 101}, {'action_type': 'output', 'port': 2}]}}, {'command': 'add', 'flow': {'match': {'in_port': 2, 'dl_vlan': 101}, 'cookie': 12280154957751673672, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 101}, {'action_type': 'output', 'port': 1}]}}]} switch_flowtable=[<napps.kytos.of_core.v0x04.flow.Flow object at 0x7fb51427d438>]

As you can see in the "stored_flows" there is a list of commands [add, delete, add, add]. Those commands/flows were created by the following sequence of events: start kytos (so the LLDP will be created), create an intra-switch EVC (so the two add flows), and then disable the EVC. The order in the stored_flows doesn't match with the sequence of events. Indeed, after disabling the EVC, the flows are recreated:

mininet> sh while true; do ovs-ofctl dump-flows s1; echo; sleep 1; done
...
 cookie=0xaa6bdfdefb8a6348, duration=47.632s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=101 actions=mod_vlan_vid:101,output:"s
1-eth2"
 cookie=0xaa6bdfdefb8a6348, duration=47.619s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth2",dl_vlan=101 actions=mod_vlan_vid:101,output:"s
1-eth1"
 cookie=0x0, duration=81.748s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

 cookie=0x0, duration=82.752s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

 cookie=0xaa6bdfdefb8a6348, duration=0.702s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth1",dl_vlan=101 actions=mod_vlan_vid:101,output:"s1
-eth2"
 cookie=0xaa6bdfdefb8a6348, duration=0.688s, table=0, n_packets=0, n_bytes=0, in_port="s1-eth2",dl_vlan=101 actions=mod_vlan_vid:101,output:"s1
-eth1"
 cookie=0x0, duration=83.756s, table=0, n_packets=0, n_bytes=0, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535

@cmagnobarbosa I was working deeper with this and I realized it is related to something else: archiving the EVC through PATCH and not through the DELETE method. So please, forget about the above comment. I will register a new issue in mef_eline for this.