networkservicemesh / sdk

Apache License 2.0
35 stars 35 forks source link

Destructive Chaos Testing: Refactor healing #950

Open Bolodya1997 opened 3 years ago

Bolodya1997 commented 3 years ago

Structure

  1. Healing overview
  2. Chains
    1. Networkservice healing
      1. Connect
      2. Monitor
      3. Heal
    2. NS, NSE Registry healing
      1. Connect
      2. Heal

1. Healing overview

There are 2 types of healing we will cover in this issue:

  1. Restore - remote side is not dead, it is just restarting, so we are waiting for it and reconnecting again.
  2. Heal - remote side is dead, so we need to select a new one.

And 2 models of application we will cover in this issue:

  1. Pure client - common NSC, NSE applications that are leftmost clients in networkservice/registry chains.
  2. Server-client - NSMgr, Forwarder, passthrough NSE, Registry, ... applications that are both a server and a client in networkservice/registry chains.
Restore Heal
Pure client + -
Server-client -(?) +

2. Chains

2.1. Networkservice healing

We have different cases for networkservice healing:

  1. Healing in NSC - pure client case, we have no heal server, connect server; in case of monitor client break/event received path, endpoint name should be cleaned up and a new Connection should be requested.
    c := new(networkservice.NetworkServiceClient)
    *c = chain.NewNetworkServiceClient( 
    connect.NewClient(ctx, connectTo,
      connect.WithAdditionalFunctionality(
         ...,
         heal.NewClient(ctx, c),
         monitor.NewClient(ctx)
      ),
      connect.WithDialOptions(dialOptions...),
    ),
    )
  2. Healing in local NSMgr - in case of monitor client break/event received, if index == 1, should be the same as [1]; in the other way - close the connection.
    mgr := new(nsmgrServer)
    mgr.Endpoint = endpoint.NewServer(ctx, tokenGenerator,
    ...,
    endpoint.WithAdditionalFunctionality(
      ...,
      heal.NewServer(ctx, mgr),
      connect.NewServer(ctx,
         connect.WithAdditionalFunctionality(
            ...,
            monitor.NewClient(ctx)
         ),
         connect.WithDialOptions(dialOptions...),
      ),
    ),
    )
  3. Healing in Forwarder - in case of monitor client break just do nothing; in case of event received do the same as [2].
      ...,
      heal.NewServer(ctx, forwarder,
         heal.WithIgnoreMonitorBreak(true)),
      ...,
  4. Healing in rest applications - just the same as [2], but here we always have index != 1.

2.1.1. Connect

package connect

type connectOptions struct {
   additionalFunctionality []networkservice.NetworkServiceClient
   dialOptions []grpc.DialOption
   dialTimeout time.Duration // [=100ms]
}

// Creates a chain of [connectClient, additionalFunctionality, grpcClient].
// connectClient:
// * on Request - takes existing (or create a new if none exists) *grpc.ClientConn and injects it into the
//                Request context.
// * on Close - the same if exists, does nothing if not exists.
// * on *grpc.ClientConn create - starts listening for cc.WaitForStateChange, removes it on
//                                connectivity.Shutdown.
// grpcClient:
// * on Request - receives *grpc.ClientConn from the Request context, creates
//                networkservice.NewNetworkServiceClient and makes a Request.
// * on Close - the same.
func NewClient(ctx context.Context, connectTo *url.Url, options ...Option) {...}

// Handles refcount map of clients for different remote URLs.
// * on Request - receives clientURL from the Request context;
//                checks if there is already existing client for the connection:
//                * exists - checks if client URL matches clientURL:
//                  * matches - selects this client for Request;
//                  * doesn't match - sends Close to the existing client;
//                                    closes existing client & decrements refcount;
//                                    goes to the `not exists` case;
//                * not exists - checks if there is already existing client for the clientURL:
//                  * yes - increments refcount for this client;
//                          selects this client for Request;
//                  * no - creates a new client with NewClient;
//                         selects this client for Request;
//                performs Request with the selected client:
//                * on error on first Request for the connection - closes the client;
//                * on error on refresh Request for the connection - returns the error;
//                * on success - updates Request with the received Connection;
//                               requests next.
// * on Close - checks if there is already existing client for the connection:
//              * exists - sends Close to the existing client;
//                         closes this client & decrements refcount;
//              * not exists - does nothing;
//              closes next.
func NewServer(ctx context.Context, options ...Option) {...}

// Returns injected *grpc.ClientConn from context. Only applicable in connect client chain Request/Close
// context.
func CCFromContext(ctx context.Context) *grpc.ClientConn {...}

2.1.2. Monitor

package monitor

// * on Request - if not exists, creates a new networkservice.MonitorConnectionClient using *grpc.ClientConn
//                from the Request context;
//                stores the connection;
//                sends a Requests and waits for initial/update event for it;
//                returns error if no corresponding event was received for the connection.
// * on Close - deletes stored connection.
// * on initial/update event - updates stored connection.
// * on delete event - invokes delete event handler from the Request context for the connection.
// * on monitor break - invokes break event handler from the Request context for all stored connections and
//                      *grpc.ClientConn.
func NewClient(ctx context.Context) {...}

// Delete event handler type.
type DeleteEventHandlerFunc func(conn *networkservice.Connection)

// Monitor break handler type.
type BreakHandlerFunc func(cc *grpc.ClientConn, conns []*networkservice.Connection)

// Injects delete event handler into the context.
func WithDeleteEventHandler(parent context.Context, handler DeleteEventHandlerFunc ) context.Context {...}

// Injects monitor break handler into the context.
func WithBreakHandler(parent context.Context, handler BreakHandlerFunc) context.Context {...}

2.1.3. Heal

package heal

type healOptions struct {
   ignoreMonitorBreak bool // [=false]
}

// * on Request - injects event handler, break handler into the Request context;
//                performs a Request;
//                stores/updates Request mechanism preferences;
//                cancels heal process if any exists.
// * on Close - cancels heal process if any exists.
// * on event - cleans up path, endpoint name and selected mechanism;
//              asynchronously requests given onHeal with passed conn.
// * on break - closes passed *grpc.ClientConn;
//              same as on event for all passed conns.
func NewClient(ctx context.Context) {...}

// * on Request - same as client.
// * on Close - same as client.
// * on event - checks path.Index:
//              * <= 1 - same as client;
//              * > 2 - asynchronously closes given onHeal with passed conn.
// * on break - closes passed *grpc.ClientConn;
//              same as on event, but if ignoreMonitorBreak == true, does nothing.
func NewServer(ctx context.Context) {...}

2.2. NS, NSE registry healing

We have different cases for NS, NSE registry healing:

  1. Healing in NSE - pure client case, we have no heal server, connect server; in case of watching find break we just should try to refresh the existing registration.
c := new(registry.NetworkServiceRegistryClient)
*c = chain.NewNetworkServiceRegistryClient(
   connect.NewNetworkServiceRegistryClient(ctx, connectTo,
      connect.WithNSAdditionalFunctionality(
         ...,
         heal.NewNetworkServiceRegistryClient(ctx, c),
      ),
      connect.WithDialOptions(dialOptions...),
   ),
)
  1. Healing in rest applications - just pass watching find break down to the NSE.
nsServer := chain.NewNetworkServiceRegistryServer(
   ...,
   connect.NewNetworkServiceRegistryServer(ctx,
      connect.WithDialOptions(dialOptions...)
   ),
)

2.2.1. Connect

package connect

type connectOptions struct {
   nsAdditionalFunctionality []registry.NetworkServiceRegistryClient
   nseAdditionalFunctionality []registry.NetworkServiceEndpointRegistryClient
   dialOptions []grpc.DialOption
}

// Creates a chain of [connectClient, nsAdditionalFunctionality, grpcClient].
// connectClient:
// * on Register - takes existing (or create a new if none exists) *grpc.ClientConn and injects it into the
//                 Register context.
// * on Find - the same.
// * on Unregister - the same.
// * on *grpc.ClientConn create - starts listening for cc.WaitForStateChange, removes it on
//                                connectivity.Shutdown.
// grpcClient:
// * on Register - receives *grpc.ClientConn from the Register context, creates
//                 registry.NewNetworkServiceRegistryClient and makes a Register.
// * on Find - the same.
// * on Unregister - the same.
func NewNetworkServiceRegistryClient(ctx context.Context, connectTo *url.Url, options ...Option) {...}

// Handles refcount map of clients for different remote URLs.
// * on Register - receives clientURL from the Register context;
//                 checks if there is already existing client for the service:
//                 * exists - checks if client URL matches clientURL:
//                   * matches - selects this client for Register;
//                   * doesn't match - closes existing client & decrements refcount;
//                                     goes to the `not exists` case;
//                 * not exists - checks if there is already existing client for the clientURL:
//                   * yes - increments refcount for this client;
//                           selects this client for Register;
//                   * no - creates a new client with NewNetworkServiceRegistryClient;
//                          selects this client for Register;
//                 performs Register with the selected client:
//                 * on error on first Register for the service - closes the client;
//                 * on error on refresh Register for the connection - returns the error;
//                 * on success - returns received service.
// * on Find - the same as for Register without `exists` branch;
//             closes created client & decrements refcount.
// * on Unregister - the same as for Register;
//                   closes selected client & decrements refcount.
func NewNetworkServiceRegistryServer(ctx context.Context, options ...Option) {...}

// Returns injected *grpc.ClientConn from context. Only applicable in connect client chain
// Register/Find/Unregister context.
func CCFromContext(ctx context.Context) *grpc.ClientConn {...}

2.2.2. Heal

package heal

// * on Request - if not exists, creates a new networkservice.MonitorConnectionClient using *grpc.ClientConn
//                from the Request context;
//                stores the connection;
//                sends a Requests and waits for initial/update event for it;
//                returns error if no corresponding event was received for the connection.
// * on Close - deletes stored connection.
// * on initial/update event - updates stored connection.
// * on delete event - invokes delete event handler from the Request context for the connection.
// * on monitor break - invokes break event handler from the Request context for all stored connections and
//                      *grpc.ClientConn.

// * on Register - stores the service;
//                 cancels heal process if any exists;
//                 if not exists, creates a new registry.NetworkServiceRegistry_FindClient using *grpc.ClientConn
//                 from the Register context;
//                 performs a Register.
// * on Find - if watching:
//             * yes - creates find client, performing onHeal.Find() on next find client failure;
//             * no - just calls next.Find().
// * on Unregister - cancels heal process if any exists.
// * on watching find break - closes passed *grpc.ClientConn;
//                            asynchronously calls onHeal.Register() with all stored services.
func NewNetworkServiceClient(ctx context.Context) {...}
Bolodya1997 commented 3 years ago

From my point of view there is a huge problem with healing - it is not fully covered with tests and so it can be difficult to rework it from the existing solution to a new one without creating new hidden bugs.

We already have:

It is pretty clear, that we need to make some rework both for the Networkservice and Registry healing, because it is not good to use server subchains inside of the client chain. But if we will start this rework right now, there still will be a problem that we don't have well-working tests covering this.

I suppose that it would be better to do all of this in the following way:

  1. Create a simple solution for the Client healing and cover it with tests - these tests will be reused in further heal client rework.
  2. Finish NS, NSE registry healing based on connect.Server - further it will be reworked with the connect.Client in the same time with the Networkservice healing.
  3. Increase coverage by sandbox tests, k8s tests.
  4. Rework connect server + client.
  5. Rework heal server + client.
Bolodya1997 commented 3 years ago

@edwarnicke @denis-tingaikin Thoughts?

edwarnicke commented 3 years ago

@Bolodya1997 I presume you are suggesting:

  1. Create a simple solution for the Client healing and cover it with tests - these tests will be reused in further heal client rework.
  2. Finish NS, NSE registry healing based on connect.Server - further it will be reworked with the connect.Client in the same time with the Networkservice healing.
  3. Increase coverage by sandbox tests, k8s tests.

For the 1.0 Release and

  1. Rework connect server + client.
  2. Rework heal server + client.

For post NSM 1.0?

edwarnicke commented 3 years ago

@Bolodya1997 One other thing. At this stage, start with system level tests on K8s before you add more sandbox tests. It will help you uncover systemic issues. Those systemic issues may require changes that either:

  1. Are not in the healing code, and can proceed in parallel to writing more sandbox unit tests
  2. Are actually in the healing code at a design level. In which case your sandbox tests would have simply been burning time into locking in those design bugs, as well as making them harder to correct.
Bolodya1997 commented 3 years ago

@denis-tingaikin @edwarnicke I have updated the healing scheme, according to the last pre-release changes and ideas. Please feel free to share your thoughts and comment on this :)

Bolodya1997 commented 3 years ago

@denis-tingaikin @edwarnicke Few schemes how it should work in local case:

Update: There was a mistake in the old scheme, when heal client / heal server receiving monitor break didn't call gRPC close connection. Fixed.

Request

image

Response

image

Refresh

image

NSE healing

image image image image image

NSMgr healing

image image image