Overlapping multipart request check can lead to delayed flow stats messages - false positive after switch reconnect

italovalcy commented 1 year ago

Hi,

I was investigating the issue on the end-to-end tests, as shown below:

rerun: 0
tests/test_e2e_20_flow_manager.py::TestE2EFlowManager::test_060_flow_another_table: 2022-12-07,06:20:27.345563 - 2022-12-07,06:20:43.261851
self = <tests.test_e2e_20_flow_manager.TestE2EFlowManager object at 0x7f1dbaf61af0>

   def test_060_flow_another_table(self):
     self.flow_another_table()

tests/test_e2e_20_flow_manager.py:524: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.test_e2e_20_flow_manager.TestE2EFlowManager object at 0x7f1dbaf61af0>
restart_kytos = False

   def flow_another_table(self, restart_kytos=False):
       """Tests if, after adding a flow in another
       table outside kytos, the flow is removed."""

       s1 = self.net.net.get('s1')
       s1.dpctl('add-flow', 'table=2,in_port=1,actions=output:2')
       if restart_kytos:
           # restart controller keeping configuration
           self.net.start_controller(enable_all=True)
           self.net.wait_switches_connect()
       else:
           self.net.reconnect_switches()

       time.sleep(10)

       s1 = self.net.net.get('s1')
       flows_s1 = s1.dpctl('dump-flows')
     assert len(flows_s1.split('\r\n ')) == BASIC_FLOWS, flows_s1
E       AssertionError:  cookie=0xac00000000000001, duration=26.490s, table=0, n_packets=0, n_bytes=0, priority=50000,dl_src=ee:ee:ee:ee:ee:03 actions=CONTROLLER:65535
E          cookie=0xac00000000000001, duration=26.424s, table=0, n_packets=0, n_bytes=0, priority=50000,dl_src=ee:ee:ee:ee:ee:02 actions=CONTROLLER:65535
E          cookie=0xab00000000000001, duration=14.940s, table=0, n_packets=17, n_bytes=714, priority=1000,dl_vlan=3799,dl_type=0x88cc actions=CONTROLLER:65535
E          cookie=0x0, duration=15.893s, table=2, n_packets=0, n_bytes=0, in_port="s1-eth1" actions=output:"s1-eth2"
E         
E       assert 4 == 3
E        +  where 4 = len([' cookie=0xac00000000000001, duration=26.490s, table=0, n_packets=0, n_bytes=0, priority=50000,dl_src=ee:ee:ee:ee:ee:...5535', 'cookie=0x0, duration=15.893s, table=2, n_packets=0, n_bytes=0, in_port="s1-eth1" actions=output:"s1-eth2"\r\n'])
E        +    where [' cookie=0xac00000000000001, duration=26.490s, table=0, n_packets=0, n_bytes=0, priority=50000,dl_src=ee:ee:ee:ee:ee:...5535', 'cookie=0x0, duration=15.893s, table=2, n_packets=0, n_bytes=0, in_port="s1-eth1" actions=output:"s1-eth2"\r\n'] = <built-in method split of str object at 0x7f1da0013d30>('\r\n ')
E        +      where <built-in method split of str object at 0x7f1da0013d30> = ' cookie=0xac00000000000001, duration=26.490s, table=0, n_packets=0, n_bytes=0, priority=50000,dl_src=ee:ee:ee:ee:ee:0...5535\r\n cookie=0x0, duration=15.893s, table=2, n_packets=0, n_bytes=0, in_port="s1-eth1" actions=output:"s1-eth2"\r\n'.split

tests/test_e2e_20_flow_manager.py:521: AssertionError

And I realized that the error above happened because the consistency check routing didn't execute as expected to remove the additional flow on table=2:

# grep -i consistency kytos.log3 | grep kytos.napps.kytos/flow_manager
Dec  7 06:21:01 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:01,065 (thread_pool_sb_4) kytos.napps.kytos/flow_manager:DEBUG main:332:  check_consistency on switch 00:00:00:00:00:00:00:01 has started
Dec  7 06:21:01 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:01,068 (thread_pool_sb_4) kytos.napps.kytos/flow_manager:DEBUG main:335:  check_consistency on switch 00:00:00:00:00:00:00:01 is done
Dec  7 06:21:01 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:01,226 (thread_pool_sb_6) kytos.napps.kytos/flow_manager:DEBUG main:332:  check_consistency on switch 00:00:00:00:00:00:00:03 has started
Dec  7 06:21:01 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:01,229 (thread_pool_sb_6) kytos.napps.kytos/flow_manager:DEBUG main:335:  check_consistency on switch 00:00:00:00:00:00:00:03 is done
Dec  7 06:21:01 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:01,419 (thread_pool_sb_1) kytos.napps.kytos/flow_manager:DEBUG main:332:  check_consistency on switch 00:00:00:00:00:00:00:02 has started
Dec  7 06:21:01 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:01,426 (thread_pool_sb_1) kytos.napps.kytos/flow_manager:DEBUG main:335:  check_consistency on switch 00:00:00:00:00:00:00:02 is done
Dec  7 06:21:13 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:13,765 (thread_pool_sb_5) kytos.napps.kytos/flow_manager:DEBUG main:332:  check_consistency on switch 00:00:00:00:00:00:00:02 has started
Dec  7 06:21:13 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:13,769 (thread_pool_sb_5) kytos.napps.kytos/flow_manager:DEBUG main:335:  check_consistency on switch 00:00:00:00:00:00:00:02 is done

As we can see from the output above, the consistency check was executed for switches 01 and 03 only at 06:21:01. It should have run again around 06:21:13, but it didn't.

If we look for the time when switches reconnected:

# grep HANDSHAKE kytos.log3
Dec  7 06:21:00 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:00,197 (thread_pool_sb_0) kytos.napps.kytos/of_core:INFO main:138:  Connection ('127.0.0.1', 33628), Switch 00:00:00:00:00:00:00:01: OPENFLOW HANDSHAKE COMPLETE
Dec  7 06:21:00 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:00,223 (thread_pool_sb_1) kytos.napps.kytos/of_core:INFO main:138:  Connection ('127.0.0.1', 33630), Switch 00:00:00:00:00:00:00:03: OPENFLOW HANDSHAKE COMPLETE
Dec  7 06:21:00 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:00,236 (thread_pool_sb_2) kytos.napps.kytos/of_core:INFO main:138:  Connection ('127.0.0.1', 33632), Switch 00:00:00:00:00:00:00:02: OPENFLOW HANDSHAKE COMPLETE
Dec  7 06:21:13 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:13,175 (thread_pool_sb_2) kytos.napps.kytos/of_core:INFO main:138:  Connection ('127.0.0.1', 33676), Switch 00:00:00:00:00:00:00:01: OPENFLOW HANDSHAKE COMPLETE
Dec  7 06:21:13 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:13,327 (thread_pool_sb_0) kytos.napps.kytos/of_core:INFO main:138:  Connection ('127.0.0.1', 33678), Switch 00:00:00:00:00:00:00:02: OPENFLOW HANDSHAKE COMPLETE
Dec  7 06:21:13 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:13,548 (thread_pool_sb_4) kytos.napps.kytos/of_core:INFO main:138:  Connection ('127.0.0.1', 33680), Switch 00:00:00:00:00:00:00:03: OPENFLOW HANDSHAKE COMPLETE

The switches reconnected at 06:21:13, but only switch 02 could perform the flow stats request successfully.

Checking for overlapping multipart requests, we can clearly see that sw01 and sw03 got stuck in a pending request:

# grep -i Overlapping kytos.log3
Dec  7 06:21:13 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:13,556 (thread_pool_sb_3) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:01 flows_xid 2584312572 ports_xid 273634828
Dec  7 06:21:14 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:14,283 (thread_pool_sb_1) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:03 flows_xid 13648313 ports_xid 2164488504
Dec  7 06:21:20 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:20,058 (Thread-101) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:01 flows_xid 2584312572 ports_xid 273634828
Dec  7 06:21:20 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:20,412 (Thread-102) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:03 flows_xid 13648313 ports_xid 2164488504
Dec  7 06:21:27 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:27,072 (Thread-104) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:01 flows_xid 2584312572 ports_xid 273634828
Dec  7 06:21:27 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:27,426 (Thread-105) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:03 flows_xid 13648313 ports_xid 2164488504
Dec  7 06:21:27 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:27,913 (Thread-107) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:01 flows_xid 2584312572 ports_xid 273634828
Dec  7 06:21:28 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:28,263 (Thread-109) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:03 flows_xid 13648313 ports_xid 2164488504

Looking for the transaction ID, we can confirm that the of_core was waiting on this request but the request was never sent to the switch due to the connection reconnection probably happening simultaneously:

Dec  7 06:21:13 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:13,072 (thread_pool_sb_5) kytos.napps.kytos/flow_manager:ERROR main:763:  Aborting retries, xid: 2584312572 not found on flow mods sent
Dec  7 06:21:13 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:13,556 (thread_pool_sb_3) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:01 flows_xid 2584312572 ports_xid 273634828
Dec  7 06:21:20 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:20,058 (Thread-101) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:01 flows_xid 2584312572 ports_xid 273634828
Dec  7 06:21:27 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:27,072 (Thread-104) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:01 flows_xid 2584312572 ports_xid 273634828
Dec  7 06:21:27 runner-rmtk68xb-project-107-concurrent-0 2022-12-07 06:21:27,913 (Thread-107) kytos.napps.kytos/of_core:INFO main:76:  Overlapping stats request: switch 00:00:00:00:00:00:00:01 flows_xid 2584312572 ports_xid 273634828

It is interesting to note that flow_manager captured the connection.error with the XID, but since it was from a FlowStats Req (not FlowMod), obviously flow_manager discarded the error.

Maybe we can improve OF_Core in two ways:

Reset the list of pending multipart requests upon switch reconnect (I'm not sure the multipart requests will be answered after switch reconnection)
Add a handler for connection error and remove the XID from the list of multipart requests waiting to be answered (which is used to check for overlapping requests)

viniarck commented 1 year ago

@italovalcy, great catch, much appreciated you looking into this.

Maybe we can improve OF_Core in two ways:

Reset the list of pending multipart requests upon switch reconnect (I'm not sure the multipart requests will be answered after switch reconnection)

This upfront reset would be very beneficial I believe, minimizing the chances of discarding a subsequent iteration. Maybe we could reset the pending multipart requests whenever a connection is lost, subscribing to '.*.connection.lost' like topology does?

Add a handler for connection error and remove the XID from the list of multipart requests waiting to be answered (which is used to check for overlapping requests)

I think we'll also need to subscribe to kytos/core.openflow.connection.error since (OSError, SocketError) can be raised as a result, having this as well would help to cover cases where a socket crashed mid transmission but hasn't closed. It seems that asyncio.Protocol connection lost will catch most things though.

italovalcy commented 1 year ago

thank you very much for the feedback @viniarck . Comments inline:

This upfront reset would be very beneficial I believe, minimizing the chances of discarding a subsequent iteration. Maybe we could reset the pending multipart requests whenever a connection is lost, subscribing to '.*.connection.lost' like topology does?

Yes, sounds like a good enhancement!

I think we'll also need to subscribe to kytos/core.openflow.connection.error since (OSError, SocketError) can be raised as a result, having this as well would help to cover cases where a socket crashed mid transmission but hasn't closed. It seems that asyncio.Protocol connection lost will catch most things though.

Agreed! Listening to connection.error would give of_core much more robustness to some corner cases like we saw in the e2e. Then we can keep evolving the actions to handle errors as we learn through out the process (now it seems like reset the pending multipart requests is one action to take, maybe we can come up with other actions)

viniarck commented 1 year ago

Right. Let's add these two enhancements then.

@italovalcy, I'll go ahead and assign this one to me, I have capacity and I think it's doable to ship it in this release and it's also aligned with the enhancements/stabilization of things that can impact e2e that we have in this version, let's go for it.

kytos-ng / of_core

Overlapping multipart request check can lead to delayed flow stats messages - false positive after switch reconnect #91