kytos-ng / mef_eline

Kytos NApp to create and manage point-to-point L2 circuits
https://kytos-ng.github.io/api/mef_eline.html
MIT License
0 stars 8 forks source link

bug: after a link went down with 300+ static EVCs, it didn't converge to `backup_path` #485

Open viniarck opened 1 month ago

viniarck commented 1 month ago

Scenario:

It's expected that if increasing the api_concurrency_limit this is supposed to go away, I'll cover this test to confirm that it'll behave as expected. However, this is also already a sign that the default value isn't suitable to handle 300+ EVCs, so this needs to be better documented. Also, finally mef_eline should handle this error more completely, logging the error is the minimum expected in this case, but not doing anything after that it's problematic. It should probably mark it for a next deployment to try to recover from this failure (this isn't a consistency check, just part of the rest of error handling)

2024-07-25 13:10:58,456 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,457 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,458 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,458 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,458 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,461 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,462 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,463 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,463 - WARNING [uvicorn.error] (MainThread) Exceeded concurrency limit.
2024-07-25 13:10:58,493 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49008 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A02 HTTP/1.1" 202
2024-07-25 13:10:58,500 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49056 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 202
2024-07-25 13:10:58,508 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49526 - "GET /api/kytos/flow_manager/v2/stored_flows?state=installed&state=pending HTTP/1.1" 503
2024-07-25 13:10:58,509 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49558 - "GET /api/kytos/flow_manager/v2/stored_flows?state=installed&state=pending HTTP/1.1" 503
2024-07-25 13:10:58,511 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49588 - "GET /api/kytos/flow_manager/v2/stored_flows?state=installed&state=pending HTTP/1.1" 503
2024-07-25 13:10:58,513 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_82) Error removing flows from switch 00:00:00:00:00:00:00:02 forEVC EVC(d4caab37f41d43, evc89): Service U
navailable
2024-07-25 13:10:58,512 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49732 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A02 HTTP/1.1" 503
2024-07-25 13:10:58,515 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_253) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(44e9dd2ce52142, evc254): Service
 Unavailable
2024-07-25 13:10:58,513 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49758 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,519 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_248) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(f74e8067494148, evc247): Service
 Unavailable
2024-07-25 13:10:58,520 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_232) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(d97b5c7966994d, evc231): Service
 Unavailable
2024-07-25 13:10:58,516 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49784 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,522 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_255) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(0680b6a67c6d45, evc255): Service
 Unavailable
2024-07-25 13:10:58,523 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_300) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(0bcef7fe92bf44, evc295): Service
 Unavailable
2024-07-25 13:10:58,517 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49786 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,525 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_265) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(a2ffb85c69ec4d, evc264): Service
 Unavailable
2024-07-25 13:10:58,527 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_153) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(30b819b1572b4e, evc152): Service
 Unavailable
2024-07-25 13:10:58,520 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49842 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,530 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_296) Error removing flows from switch 00:00:00:00:00:00:00:02 forEVC EVC(5bcfdb8283f447, evc299): Service
 Unavailable
2024-07-25 13:10:58,521 - INFO [uvicorn.access] (MainThread) 127.0.0.1:49936 - "POST /api/kytos/flow_manager/v2/delete/00%3A00%3A00%3A00%3A00%3A00%3A00%3A06 HTTP/1.1" 503
2024-07-25 13:10:58,532 - ERROR [kytos.napps.kytos/mef_eline] (thread_pool_app_148) Error removing flows from switch 00:00:00:00:00:00:00:06 forEVC EVC(5c168c6ec4ca4a, evc143): Service
 Unavailable

20240725_131549

{
    "active": false,
    "archived": false,
    "backup_path": [
        {
            "id": "cf0f4071be426b3f745027f5d22bc61f8312ae86293c9b28e7e66015607a9260",
            "endpoint_a": {
                "id": "00:00:00:00:00:00:00:01:2",
                "name": "novi_port_2",
                "port_number": 2,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:01",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "cf0f4071be426b3f745027f5d22bc61f8312ae86293c9b28e7e66015607a9260"
            },
            "endpoint_b": {
                "id": "00:00:00:00:00:00:00:02:2",
                "name": "novi_port_2",
                "port_number": 2,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:02",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "cf0f4071be426b3f745027f5d22bc61f8312ae86293c9b28e7e66015607a9260"
            },
            "metadata": {},
            "active": true,
            "enabled": true,
            "status": "UP",
            "status_reason": []
        },
        {
            "id": "192e11d8563e9059d5d1d4bb6fd5f2e80bdf1bd56638e1a980ccb785a21ed372",
            "endpoint_a": {
                "id": "00:00:00:00:00:00:00:02:1",
                "name": "novi_port_1",
                "port_number": 1,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:02",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "192e11d8563e9059d5d1d4bb6fd5f2e80bdf1bd56638e1a980ccb785a21ed372"
            },
            "endpoint_b": {
                "id": "00:00:00:00:00:00:00:06:5",
                "name": "novi_port_5",
                "port_number": 5,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:06",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": true,
                "enabled": true,
                "status": "UP",
                "status_reason": [],
                "link": "192e11d8563e9059d5d1d4bb6fd5f2e80bdf1bd56638e1a980ccb785a21ed372"
            },
            "metadata": {},
            "active": true,
            "enabled": true,
            "status": "UP",
            "status_reason": []
        }
    ],
    "bandwidth": 0,
    "circuit_scheduler": [],
    "current_path": [],
    "dynamic_backup_path": false,
    "enabled": true,
    "failover_path": [],
    "id": "004fdf5c1c8b41",
    "metadata": {
        "telemetry_request": {},
        "telemetry": {
            "enabled": true,
            "status": "DOWN",
            "status_reason": [
                "redeployed_link_down_no_path"
            ],
            "status_updated_at": "2024-07-25T16:11:03"
        }
    },
    "name": "evc59",
    "primary_path": [
        {
            "id": "d7e3aade462df35a656320b717e8603a7fd624c4cfb7dcb72feafe95c6be1c96",
            "endpoint_a": {
                "id": "00:00:00:00:00:00:00:01:11",
                "name": "novi_port_11",
                "port_number": 11,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:01",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": false,
                "enabled": true,
                "status": "DOWN",
                "status_reason": [
                    "deactivated"
                ],
                "link": "d7e3aade462df35a656320b717e8603a7fd624c4cfb7dcb72feafe95c6be1c96"
            },
            "endpoint_b": {
                "id": "00:00:00:00:00:00:00:06:11",
                "name": "novi_port_11",
                "port_number": 11,
                "mac": "00:00:00:00:00:00",
                "switch": "00:00:00:00:00:00:00:06",
                "type": "interface",
                "nni": true,
                "uni": false,
                "speed": 12500000000.0,
                "metadata": {},
                "lldp": true,
                "active": false,
                "enabled": true,
                "status": "DOWN",
                "status_reason": [
                    "deactivated"
                ],
                "link": "d7e3aade462df35a656320b717e8603a7fd624c4cfb7dcb72feafe95c6be1c96"
            },
            "metadata": {},
            "active": false,
            "enabled": true,
            "status": "DOWN",
            "status_reason": [
                "deactivated"
            ]
        }
    ],
    "service_level": 0,
    "uni_a": {
        "tag": {
            "tag_type": "vlan",
            "value": 59
        },
        "interface_id": "00:00:00:00:00:00:00:01:15"
    },
    "uni_z": {
        "tag": {
            "tag_type": "vlan",
            "value": 59
        },
        "interface_id": "00:00:00:00:00:00:00:06:22"
    },
    "sb_priority": null,
    "execution_rounds": 0,
    "owner": null,
    "queue_id": -1,
    "primary_constraints": {},
    "secondary_constraints": {},
    "primary_links": [],
    "backup_links": [],
    "start_date": "2024-07-25T16:00:51",
    "creation_time": "2024-07-25T16:00:51",
    "request_time": "2024-07-25T16:00:51",
    "end_date": null,
    "flow_removed_at": "2024-07-25T16:11:00",
    "updated_at": "2024-07-25T16:11:03"
}
viniarck commented 1 month ago

It's expected that if increasing the api_concurrency_limit this is supposed to go away,

With api_concurrency_limit and api thread pool as 512 it didn't result in issues.

The default value on kytos.conf as of July 25th is 160.

viniarck commented 1 month ago

It should probably mark it for a next deployment to try to recover from this failure (this isn't a consistency check, just part of the rest of error handling)

This needs to be assessed, discussed, and see if it'd work reliably, and or think about anything that can still be done.

viniarck commented 1 month ago

It's probably worth considering retries with tenacity and pacing the method that sends requests too