kytos-ng / maintenance

Kytos Maintenance Window NApp
https://kytos-ng.github.io/api/maintenance.html
0 stars 7 forks source link

Refine existing blueprint #59

Closed viniarck closed 1 year ago

viniarck commented 2 years ago

Refine existing blueprint, make sure existing use cases are covered.

italovalcy commented 2 years ago

Here is the blueprint as defined on our internal gitlab:

:EP: 001
:Title: Maintenance Window for items on the network
:Authors: - Jeronimo Bezerra jbezerra@fiu.edu;
          - Arturo Quintana arquinta@fiu.edu ;
          - Italo Valcy idasilva@fiu.edu

:Issued Date: to be defined
:Status: Draft
:Type: Standards Track

########
Abstract
########

This blueprint details the main features, workflows and requirements for a Network Team operate and orchestrate the backbone specially related to Maintenance Window on an item of the network. Thus any application that implement the orchestration of a Network should follow this specification to provide a Maintenance Mode feature.

##########
Motivation
##########

When a Network Operator is going to plan, program and execute the maintenance operations there are some activities that needs to be accomplished  before, during and after the Maintenance Window. The Network Orchestration Tool should provide features to help the Network Operator on those activities, such as:

- Disable user notifications: it's common have service oscillation or flapping during the MW (e.g. links going up and down, switch reboot, ports up/down) and the user (Customer, Partner or Operator) dont want to be flooded of notifications during the MW (i.e. no e-mails, no sms, no media alerts)

- List of services/users affected by the MW: it's important have a clear view of who is going to be affected by the MW before it is even scheduled, so the Operator can send notifications to its customers/partner  in advance.

- Move services away from items under MW: from the orchestration perspective it's essential to move all possible production services away from the item under MW, so the customer/partner has no outage. Furthermore, it may be interesting not allow Customer/Partner users to request new services that rely on items under MW. Any change/movement of services on the network due to items under MW should respect user requirements

- Test Plan: well-planned MW has a Test Plan to ensure that deployed changes meet service healthy and customer expectations. From the Network Orchestration perspective, it's important having ways to test the items under MW to ensure they are working properly. Thus the Operator may request the creation of services using the items under MW just to run his/her Test Plan.

#############
Specification
#############

1. Admin request a MW on a LINK, UNI or SWITCH for a specific period of time
2. The orchestration tool  should disable user notifications for the provided item
3. When the MW begins, the orchestration tool should run a set of steps as detailed bellow:

3.1. For a Link: take all the Services that uses the link under MW as part of the path (either primary or backup)  and apply the following table on each of them:

+------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+
|                  | Dynamic Path                                                                                                                                                                                                                                | Static Path                                                           |
+------------------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+                                                                       |
|                  | No User requirements                                      | With User Requirements                                                                                                                                                          |                                                                       |
+------------------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------+
| Has alternative  | 3.1.1) Find a new path                                    | 3.1.1) Find a new Path                                                                                                                                                          | Move the service to the alternative Route                             |
| route            |                                                           |                                                                                                                                                                                 | ** Pay attention to check if the MW affects                           |
|                  | 3.1.2) move the service to the new path When MW Ends: 5.A | 3.1.2) Move the service to the new path                                                                                                                                         | all static path (primary and backup)                                  |
|                  |                                                           |                                                                                                                                                                                 |                                                                       |
|                  |                                                           | When MW Ends: 5.A                                                                                                                                                               | When MW Ends: 5.C                                                     |
+------------------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------+-----------------------------------+
| No alternative   | 3.1.1) ask for user confirmation                          | 3.1.1) No alternative Route that fulfils the User Requirements. So, check for the user provided configuration during provisioning about "open mind" or not to find a new path:  | User is not "open mind" or        | There is an alternative physical  |
|  route           |                                                           |                                                                                                                                                                                 | there is no alternative physical  | path and User is "open mind"      |
|                  | 3.1.2) disable service during the MW                      |                                                                                                                                                                                 | path                              |                                   |
|                  |                                                           | - if user is "open mind", then: i) find a new path; ii) move the circuit; iii) When MW                                                                                          |                                   | 3.1.1) Find a new path            |
|                  | When MW Ends: B                                           | Ends: C                                                                                                                                                                         | 3.1.1) ask for confirmation       |                                   |
|                  |                                                           |                                                                                                                                                                                 |                                   | 3.1.2) Move the service           |
|                  |                                                           | - if user is not "open mind", then: i) ask for confirmation; ii) disable the circuit; iii)                                                                                      |                                   |                                   |
|                  |                                                           | When MW Ends: B                                                                                                                                                                 | 3.1.2) disable the service        | When MW Ends: C                   |
|                  |                                                           |                                                                                                                                                                                 |                                   |                                   |
|                  |                                                           | 3.1.2) No Alternative Route because there is no PATH:                                                                                                                           |                                   |                                   |
|                  |                                                           | i) Ask confirmation; ii) disable service;                                                                                                                                       | When MW Ends: B                   |                                   |
|                  |                                                           | iii) When MW Ends: B                                                                                                                                                            |                                   |                                   |
+------------------+-----------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------+-----------------------------------+

.. The table above was generated using https://www.tablesgenerator.com/text_tables  (see saved table on ./static/table/ep001-table1.tgn)

3.2. For a UNI: disable service during the MW and When MW ends: 5.B

3.3. For a Switch: 

3.3.1. take all the Services whose UNIs is at the switch and apply the same as step 3.2

3.3.2. take all the Services whose path (either primary or backup) pass through the switch under MW and apply the same logic as 3.1 (pay attention to remove all the links connected to the switch under MW before find a new path - i.e. not consider any link on the switch under MW as a alternative path)

4. Testing Phase: The Operator should be able to create services using the items under MW to make tests and validate the maintenance activities
5. When the MW ends, or when the Operator explictly ask for the end of MW, the orchestration tool should run a set of steps as detailed bellow:

- 5.A - Leave the service as it is currently 
- 5.B - Enable the service 
- 5.C - Restore the service AS IT WAS BEFORE the MW (i.e. should use the "saved setup" before the MW and no ask for path_finder to find path) 

Points of Attention (PA):

- PA1. It should be possible to generate a report of Services and Users that will be affected by a future MW. The report should take into consideration items under MW mode in primary or backup PATH. For instance, if the MW will affect LinkA but LinkA is a primary path for EVC 1 and the only backup path for EVC 2, then the report should issue a warning about EVC 1 and EVC 2. It should not be necessary to create a MW to accomplish this.
- PA2. Every Action should be logged and reported on the end of MW 
- PA3. The services available on the Orchestration Tool (e.g. MEF e-Line) should have an user configuration option to allow or not flexible backup path (a.k.a. open mind user), with the default value of allow flexible backup path.
- PA4. The orchestration tool should be able to support multiple MW at the same time (e.g. two links, many UNIs, etc)
- PA5. The orchestration tool should be able to support multiple items under MW in the same operation (e.g. a link, a switch and many UNIs).
- PA6. When scheduling a new MW, the orchestration tool should take into consideration other scheduled MWs and how the topology of the network is supposed to be at that time in order to verify alternative routes and affected services by that new MW. For example, the orchestration tool may picture the future network topology *without links under the MW already scheduled*  and, using that future topology, check how the new MW will affect services. Thus, if a sheduled MW will affect a Link A or a Switch X, that Link A or Switch X should not be considered as part of a alternative route for the new MW being scheduled. The same logic should be applied for a report of possible affected Users/Services.

##############
Rejected Ideas
##############

[Why certain ideas that were brought while discussing this PEP were not ultimately pursued.]

###########
Open Issues
###########

[Any points that are still being decided/discussed.]

########
Glossary
########

- Maintenance - Maintenance activities are focused on facilitating
  repairs and upgrades -- for example, when equipment must be
  replaced, when a router needs a patch for an operating system
  image, a link needs to be repaired, or a customer is going to
  make some change on its side. Maintenance also involves corrective 
  and preventive measures to make the managed network run more 
  effectively, e.g., adjusting device configuration and parameters [rfc6291].

- Maintenance Window (MW) - time slot between the start of the maintenance 
  and its end. Usually the Network Ops Team send a notification to all
  users/customers/partners affected by the MW and then the they are
  aware if the service will be available or not during that time slot.

- Item under MW - an item under MW is the network component/equipment
  that is going to be affected by the maintenance activities during 
  the time slot. Item under MW can be: UNI (User Network Interface), 
  Link or Switch.

- User requirements - Set of parameters required by the user when the 
  service was created: bandwidth, delay, localization (Atlantic, 
  Pacific, terrestrial / submarine), not shared with EVC XYZ, etc

- Fulfils user requirements - service provisioning is compliance 
  with user requirements

- Dynamic Path - the user requested a circuit and specify only the 
  end-points, no matter what path it is going to take (the 
  orchestrator can select a dynamic path) 

- Static Path - the user request a circuit and specify the end-points 
  as well as the path that should be taken (i.e. a static path was 
  chosen by user) 

- Alternative Route - A physical path that does not share any item 
  under MW

- Flexible backup path - an altenrative route that may not fulfils the
  user requirements, but at least that routes offers connectivity.

- Open mind user - the user requested a service with a flexible backup 
  path, i.e. the user has open mind to allow an alternative route in
  case the primary one is not available even through the alternative
  route does not fulfils the user requirements.

- Disable a service - Remove all flows related with the service 

- Enable a Service - Install all flows related with the service 

- Saved setup - the setup of the orchestration tool saved as a 
  snapshot considering all circuits, requirements, paths (primary and 
  secondary), flow mods that should be running on the switches, and 
  all other important information 

- Network Operator - a person who administrate the network and has 
  knowledge and autonomy to decide how the network should behavior

##########
References
##########

[A collection of URLs used as references.]

#########
Copyright
#########

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.
Ktmi commented 1 year ago

Here are my thoughts on refining the blueprint.

Formalizing the Blueprint

At this point, the main functionality of maintenace is complete, that being informing other NApps that a given set of items are undergoing maintenance. However, we still lack some features desired in the original blueprint, such as being able to see which services will be afffected by maintenance windows or being able to execute test plans on devices under maintenance.

In order to get these features, we will need a centralized component to manage the interaction between services and maintenance notifications. What I propose, is adding in new functionality to topology for managing these interactions.

How To Implement

A new component should be added to topology, the ServicePlanner. The ServicePlanner will manage interactions between services and interruptions through ServicePlans and ServiceTaggers. ServicePlans specify the devices needed to satisfy a service, any backup plans that could also satisfy the service, as well as the service's compatibility with ServiceTags. ServiceTaggers mark devices with ServiceTags to indicate device conditions.

When a ServiceTagger is added, it is not immediately active. It can be queried for which ServicePlans it will affect if activated. Once activated, an event will be generated for all ServicePlans affected, and which backup plan they should switch to.

What does this gain us?

This implementation has a few key advantages, First is that we would be able to see which services are affected by interruptions like maintenace windows, and that this functionality could be extended to other types of interruptions like link flapping. Second is that services can intentionally ignore various types of interruptions, allowing for executing test plans, such as creating an eline EVC over links undergoing maintenance. Third, it allows for quickly finding backup paths, as its built into the process of checking compatibility.

@italovalcy and @viniarck What are your thoughts?

viniarck commented 1 year ago

@Ktmi, I'll leave to @italovalcy to confirm requirements or provide another feedback, but here's my initial feedback:

1) #### being able to see which services will be affected by maintenance.

We can evolve topology to answer this question to start to provide this functionality. However, looking to avoid diverging too much how each NApp is already responsible for finding backup or failover paths during network convergence, on topology let's try to keep it as stateless in a request-response cycle as we can, for example, topology could send (async) requests to all NApps that are capable of provision network services (mef_eline at the moment, and we could take a list of which services, and also document which ones are supported) and then parse all responses and send response back. That way, being able to respond, if given network items (switches, links, interfaces) were to become unavailable which list of services would be impacted with their respective owners (each provisioning service is expected to have ownership data, on mef_eline this field can be None/null though).

Let me know what you think or if you see another viable simpler way to solve this problem.

2) #### being able to execute test plans on devices under maintenance.

Any sort sort of mid maintenance path/connectivity tests once protocol layer is UP I'd recommend to conduct tests by submitting requests directly to whichever NApp is provisioning a service but with static paths and also optinally use NApps with sdn trace capability to verify control plane or data plane, would this approach solve what network operators need? Is there any simpler way we could go for without deviating much from what we currently have?

Since network operators will know which services will be impacted, on mef_eline for instance, they can either provision a new one to test what they need or edit an EVC accordingly, so we could adapt NApps (currently only mef_eline) to introduce a way to optionally provision a static EVC over resources over a Path considered either EntityStatus.UP or EntityStatus.DOWN (since status_funcs outcome is expected to result in link status down).

3) #### and which backup plan they should switch to

All current (and new future) services that provision resources should be handle to convergence accordingly based on whatever requirements and/or constraints they were provisioned. I don't think that services should be deactivated to force a failover, if a tool (or external tool or another NApp) try to interfere with the control plane convergence it will do a worse job since it doesn't own the resources and so it doesn't know how to provision or failover it in an optimal way (or it would have to replicate code). All of the "ask for confirmation" related requirements I'd recommend trying to discuss with network operators to provide a solution but in a different way, such as maybe a new endpoint to run queries to understand what will be impacted and then actually call endpoint that will provision the MW, an then in the UI they could have this pre-confirmation if they want to pre-check the impact on services/users to have (or just directly use the endpoints provided).

4) #### disable/enable services

Will network operators want to also have disabled the network items during a maintenance? Do they also want that to automatically set the actual physical device lower layer down as well to minimize up/down notifications?

If they want devices lower laywer down disabled and to be able to run tests mid maintenance, they'll need to enable the interfaces again and we'd have to also implement sending PortMod.

5) ### 5. When the MW ends, or when the Operator explictly ask for the end of MW, the orchestration tool should run a set of steps as detailed bellow:

When network operators realize a MW will impact a service that can't afford to have downtime, I think they either could pre-edit the services before the MW starts and make sure there are failover or backup or tolerate the expected downtime. This desired requirement of restoring a service was assuming that this orchestration tool would manage services convergence paths, in most cases it would do a worse job if it tries to fight or just add extra steps in the convergence that NApps were designed to react efficiently based on network events, and also there's unrecoverable cases that can happen since we never know if whatever went into maintenance if it'll recover from our southbound protocol perspective. Although, in a realistic case even if network operators want to always trying to auto re enable, if the hardware is damaged it's active state shouldn't go up.

Heads up that mef_eline doesn't have switchover to the primary path yet when it becomes available again if the failover or backup is still active and hasn't failed. Will network operators want services to switchover again to their primary paths? If they do, they will need to force EVCs redeploy to switch again to the primary path if it's available unless we implement switchover policies if they want to switch to the primary again when it's eventually available after some time.

6) #### PA6. When scheduling a new MW, the orchestration tool should take into consideration other scheduled MWs and how the topology of the network is supposed to be at that time in order to verify alternative routes and affected services by that new MW.

We already got this covered since we agreed to not support overlapping MWs and NApps will converge the control plane accordingly in this iteration. I don't think we need to model how a future topology will look like at this time to meet the requirement behind the intention of what asked. This PA6 it was assuming that the convergence algorithm would be driven by an orchestrating tool (or centralized NApp or whatever), when in practice it won't based on how currently NApps and their responsibilities were designed. A MW performs a side effect in the network, the network will be updated accordingly, any affected service like mef_eline will react and it's responsible for either using whichever it has precomputed as a failover or try to dynamicallly find another path available. Each NApp that provisions a service knows best how to converge, trying to defer that responsibility to another NApp or external orchestration tool will lead to complications.

Ktmi commented 1 year ago

Here's where I'm at with the blueprint:

########
Abstract
########

This blueprint details a mechanism for NApps providing services
to receive information on interruptions,
and provide information on the affected services back
to the source of the interruption.
Additionally, this blueprint details how NApps which produce
interruptions should signal them in order to take advantage
of this new mechanism.

##########
Motivation
##########

In order for NApps which produce service interruptions
to receive information on what services they are affecting,
there needs to be a two way communication channel between
the service provider and the interruption source.
The kytos event bus provides only one way communication,
making it unsuitable for this task.

#############
Specification
#############

---------------

Endpoint: ``POST topology/v3/preview_interruption``

Purpose:

- Previewing affected services if the specify interruption is added.

Expected Input:

.. code-block:: python3

    {
        "type": <interruption type str>,
        "switches": [<switch id>],
        "interfaces": [<interface id>],
        "links": [<link id>],
    }

Expected Response:

.. code-block:: python3

    {
        "type": <interruption type str>,
        "switches": [<switch id>],
        "interfaces": [<interface id>],
        "links": [<link id>],
        "affected_services": {
            <provider name str>: [<service name str>],
        },
    }

Side Effects:

- None

---------------

Endpoint: ``POST topology/v3/interruptions``

Purpose:

- Add in a new interruption

Expected Input:

.. code-block:: python3

    {
        "type": <interruption type str>,
        "switches": [<switch id>],
        "interfaces": [<interface id>],
        "links": [<link id>],
    }

Expected Response:

.. code-block:: python3

    {
        "id": <interruption id>
        "type": <interruption type str>,
        "switches": [<switch id>],
        "interfaces": [<interface id>],
        "links": [<link id>],
        "affected_services": {
            <provider name str>: [<service name str>],
        },
    }

Side Effects:

- New interruption is stored to DB.
- All known service providers are notified of the new interruption.
- Targeted devices have the interruption type added to their `status_reason` set.

---------------

Endpoint: ``DELETE topology/v3/interruptions/<interruption id>``

Purpose:

- Remove an interruption

Expected Input:

.. code-block:: python3

    None

Expected Response:

.. code-block:: python3

    None

Side Effects:

- Interruption is removed from DB
- All known service providers are notified of the removal of the interruption.
- Targeted devices have the interruption type removed from their `status_reason` set,
  if no other interruption of the same type exists for that device.

---------------

Endpoint: ``POST topology/v3/service_providers``

Purpose:

- Add a new service provider to listen for interruptions

Expected Input:

.. code-block:: python3

    {
        "name": <provider name str>,
        "update_url": <url for updating>,
        "preview_url": <url for previewing>,
    }

Expected Response:

.. code-block:: python3

    {
        "id": <provider id>
        "name": <provider name str>,
        "update_url": <url for updates>,
        "preview_url": <url for previews>,
    }

Side Effects:

- New service provider is stored to DB.
- Service provider is informed of all active interruptions.

---------------

Endpoint: ``POST <url for previewing>``

Purpose:

- Previewing affected services if the specified interruption is added.

Expected Input:

.. code-block:: python3

    {
        "type": <interruption type str>,
        "switches": [<switch id>],
        "interfaces": [<interface id>],
        "links": [<link id>],
    }

Expected Output:

.. code-block:: python3
    [<service name str>]

Side Effects:

- None

---------------

Endpoint: ``POST <url for updating>``

Purpose:

- Informing service providers of an added interruption, and viewing the affected services.

Expected Input:

.. code-block:: python3

    {
        "type": <interruption type str>,
        "switches": [<switch id>],
        "interfaces": [<interface id>],
        "links": [<link id>],
    }

Expected Output:

.. code-block:: python3
    [<service name str>]

Side Effects:

- Services are informed of interruption starting.

---------------

Endpoint: ``DELETE <url for updating>``

Purpose:

- Informing service providers of an removed interruption.

Expected Input:

.. code-block:: python3

    {
        "type": <interruption type str>,
        "switches": [<switch id>],
        "interfaces": [<interface id>],
        "links": [<link id>],
    }

Expected Output:

.. code-block:: python3
    None

Side Effects:

- Services are informed of the end of an interruption.

I'm also making a implementation in topology matching the blueprint.