kytos-ng / mef_eline

Kytos NApp to create and manage point-to-point L2 circuits
https://kytos-ng.github.io/api/mef_eline.html
MIT License
0 stars 9 forks source link

Link down affecting current_path and failover_path can result in inconsistencies #575

Open italovalcy opened 1 day ago

italovalcy commented 1 day ago

While troubleshooting some inconsistencies reported for an EVC, I found a bug where when a Link Down affects both the current_path and also the failover_path, the failover_path can be left behind in an inconsistent state. Please see more details below.

The symptom was basically the the external script for consistency check that runs every 30min reporting the following inconsistency:

inconsistency found on NNI/Failover flows for EVC 0c65a33c20ef4a VLAN_xxxx at 00:00:00:00:00:28:00:01
==> expected flow: {'match': {'in_port': 6, 'dl_vlan': 18}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 37}, {'action_type': 'output', 'port': 2}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}
==> 0 candidates: []
inconsistency found on NNI/Failover flows for EVC 0c65a33c20ef4a VLAN_xxxx at 00:00:00:00:00:28:00:01
==> expected flow: {'match': {'in_port': 2, 'dl_vlan': 37}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 18}, {'action_type': 'output', 'port': 6}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}
==> 0 candidates: []
inconsistency found on UNI/Failover flows for EVC 0c65a33c20ef4a VLAN_xxxx at 00:00:00:00:00:00:00:17
==> expected flow: {'match': {'in_port': 5, 'dl_vlan': 47}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'pop_vlan'}, {'action_type': 'output', 'port': 15}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}
==> 0 candidates: []
inconsistency found on UNI/Failover flows for EVC 0c65a33c20ef4a VLAN_xxxx at 00:00:00:00:00:32:00:04
==> expected flow: {'match': {'in_port': 1, 'dl_vlan': 37}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'pop_vlan'}, {'action_type': 'output', 'port': 7}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}
==> 0 candidates: []

After investigating what happened at this date/time, I found a link flap that indeed impacted that EVC:

2024-11-24 17:29:45,325 - INFO [kytos.napps.kytos/mef_eline] [main.py:822:handle_link_down] (dynamic_single_0) Event handle_link_down Link(Interface('novi_port_2', 2, Switch('00:00:00:00:00:28:00:01')), Interface('novi_port_1', 1, Switch('00:00:00:00:00:32:00:04')), 8686cb2050fa1d9840b8eb4d5df56d196de0e5490743319fef3926ca732a3914)
2024-11-24 17:29:46,195 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:32:00:04, command: delete, force: True,  flows[0, 1]: [{'cookie': 12253280437931142986, 'cookie_mask': 18446744073709551615}]
2024-11-24 17:29:46,584 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:00:00:14, command: delete, force: True,  flows[0, 1]: [{'cookie': 12253280437931142986, 'cookie_mask': 18446744073709551615}]
2024-11-24 17:29:46,966 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:24:00:02, command: delete, force: True,  flows[0, 1]: [{'cookie': 12253280437931142986, 'cookie_mask': 18446744073709551615}]
2024-11-24 17:29:47,242 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:00:00:17, command: delete, force: True,  flows[0, 1]: [{'cookie': 12253280437931142986, 'cookie_mask': 18446744073709551615}]
2024-11-24 17:29:47,438 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:28:00:01, command: delete, force: True,  flows[0, 1]: [{'cookie': 12253280437931142986, 'cookie_mask': 18446744073709551615}]
2024-11-24 17:29:47,677 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:24:00:03, command: delete, force: True,  flows[0, 1]: [{'cookie': 12253280437931142986, 'cookie_mask': 18446744073709551615}]
2024-11-24 17:29:48,454 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:00:00:14, command: add, force: False,  flows[0, 2]: [{'match': {'in_port': 30, 'dl_vlan': 1}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 36}, {'action_type': 'output', 'port': 32}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}, {'match': {'in_port': 32, 'dl_vlan': 36}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 1}, {'action_type': 'output', 'port': 30}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}]
2024-11-24 17:29:48,610 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:16:00:03, command: add, force: False,  flows[0, 2]: [{'match': {'in_port': 32, 'dl_vlan': 36}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 58}, {'action_type': 'output', 'port': 30}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}, {'match': {'in_port': 30, 'dl_vlan': 58}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 36}, {'action_type': 'output', 'port': 32}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}]
2024-11-24 17:29:48,719 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:40:00:03, command: add, force: False,  flows[0, 2]: [{'match': {'in_port': 32, 'dl_vlan': 58}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 38}, {'action_type': 'output', 'port': 3}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}, {'match': {'in_port': 3, 'dl_vlan': 38}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 58}, {'action_type': 'output', 'port': 32}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}]
2024-11-24 17:29:48,885 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:40:00:04, command: add, force: False,  flows[0, 2]: [{'match': {'in_port': 3, 'dl_vlan': 38}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 2}, {'action_type': 'output', 'port': 1}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}, {'match': {'in_port': 1, 'dl_vlan': 2}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 38}, {'action_type': 'output', 'port': 3}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}]
2024-11-24 17:29:49,031 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:32:00:03, command: add, force: False,  flows[0, 2]: [{'match': {'in_port': 1, 'dl_vlan': 2}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 31}, {'action_type': 'output', 'port': 2}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}, {'match': {'in_port': 2, 'dl_vlan': 31}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'set_vlan', 'vlan_id': 2}, {'action_type': 'output', 'port': 1}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}]
2024-11-24 17:29:49,177 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:00:00:17, command: add, force: False,  flows[0, 2]: [{'match': {'in_port': 15, 'dl_vlan': 3002}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'push_vlan', 'tag_type': 's'}, {'action_type': 'set_vlan', 'vlan_id': 1}, {'action_type': 'output', 'port': 4}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}, {'match': {'in_port': 4, 'dl_vlan': 1}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'pop_vlan'}, {'action_type': 'output', 'port': 15}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}]
2024-11-24 17:29:49,309 - INFO [kytos.napps.kytos/flow_manager] [utils.py:196:flows_to_log_info] (AnyIO worker thread) Send FlowMod from request dpid: 00:00:00:00:00:32:00:04, command: add, force: False,  flows[0, 2]: [{'match': {'in_port': 7, 'dl_vlan': 3002}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'push_vlan', 'tag_type': 's'}, {'action_type': 'set_vlan', 'vlan_id': 31}, {'action_type': 'output', 'port': 2}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}, {'match': {'in_port': 2, 'dl_vlan': 31}, 'cookie': 12253280437931142986, 'actions': [{'action_type': 'pop_vlan'}, {'action_type': 'output', 'port': 7}, {'action_type': 'set_queue', 'queue_id': 1}], 'owner': 'mef_eline', 'table_group': 'evpl', 'table_id': 0, 'priority': 20000}]
2024-11-24 17:29:49,353 - INFO [kytos.napps.kytos/mef_eline] [evc.py:874:deploy_to_path] (thread_pool_app_61) EVC(0c65a33c20ef4a, VLAN_xxx) was deployed.
2024-11-24 17:29:49,353 - INFO [kytos.napps.kytos/mef_eline] [main.py:927:handle_evc_affected_by_link_down] (thread_pool_app_61) EVC(0c65a33c20ef4a, VLAN_xxx) redeployed due to link down 8686cb2050fa1d9840b8eb4d5df56d196de0e5490743319fef3926ca732a3914

Checking the previous saved state of this EVC above, I found check that both current_path and failover_path shares some switches/links:

cat mef_eline_backup_20241124_0900.json | jq -r '.["0c65a33c20ef4a"].current_path[]|.endpoint_a.id + " " + .endpoint_b.id + " " + (.metadata|tostring)'
00:00:00:00:00:00:00:14:30 00:00:00:00:00:00:00:17:4 {"s_vlan":{"value":1,"tag_type":"vlan"}}
00:00:00:00:00:00:00:14:1 00:00:00:00:00:24:00:02:1 {"s_vlan":{"value":25,"tag_type":"vlan"}}
00:00:00:00:00:24:00:02:2 00:00:00:00:00:24:00:03:2 {"s_vlan":{"value":24,"tag_type":"vlan"}}
00:00:00:00:00:24:00:03:1 00:00:00:00:00:28:00:01:1 {"s_vlan":{"value":29,"tag_type":"vlan"}}
00:00:00:00:00:28:00:01:2 00:00:00:00:00:32:00:04:1 {"s_vlan":{"value":36,"tag_type":"vlan"}}

cat mef_eline_backup_20241124_0900.json | jq -r '.["0c65a33c20ef4a"].failover_path[]|.endpoint_a.id + " " + .endpoint_b.id + " " + (.metadata|tostring)'
00:00:00:00:00:00:00:15:30 00:00:00:00:00:00:00:17:5 {"s_vlan":{"value":47,"tag_type":"vlan"}}
00:00:00:00:00:00:00:15:3 00:00:00:00:00:00:00:16:3 {"s_vlan":{"value":97,"tag_type":"vlan"}}
00:00:00:00:00:00:00:10:31 00:00:00:00:00:00:00:16:31 {"s_vlan":{"value":53,"tag_type":"vlan"}}
00:00:00:00:00:00:00:10:16 00:00:00:00:00:28:00:01:6 {"s_vlan":{"value":18,"tag_type":"vlan"}}
00:00:00:00:00:28:00:01:2 00:00:00:00:00:32:00:04:1 {"s_vlan":{"value":37,"tag_type":"vlan"}}

So, in the case above, it probably happened that part of the failover path got removed when handling the redeploy on the current path, because they both share some switches (ultimately leaving the Failover path inconsistent with some flows missing).

Checking the code, it seems like when an EVC is affected by Link Down, and its failover_path is also affected by the same link down, the EVC link down handler goes through the normal process, but when it is redeployed and generates the kytos/mef_eline.(redeployed_link_down KytosEvent, then the evc.try_setup_failover_path() will result in a early return because the failover_path is still present: https://github.com/kytos-ng/mef_eline/blob/a47e289c1896b913bf524d4357efc64b94d93ff1/models/evc.py#L898-L902