wasmCloud / wascc-host

Library for hosting actors and capability providers in a host process
Apache License 2.0
201 stars 15 forks source link

Rationalize the nature of bindings in the context of the lattice #83

Closed autodidaddict closed 4 years ago

autodidaddict commented 4 years ago

The original concept of bindings was created for an isolated, single-process system that had no knowledge of a lattice. Right now, the way we're handling bindings isn't quite in accordance with the kind of functionality we want to support in the lattice. There are a number of things that we need to do in order to bring bindings up to par with lattice functionality:

Acceptance Criteria

To verify that this is working as planned:

autodidaddict commented 4 years ago

Thoughts on the new flow:

  1. During the set_binding call
    1. If the binding differs from what the bus already sees as existing, the host will tell the bus to publish a BindingSet event (which will trigger cache updates in all listening hosts)
    2. If the binding differs from what the bus already sees as the existing binding, the bus will deliver the OP_BIND_ACTOR message to the named provider instance
  2. During the remove_binding call
    1. Send the OP_REMOVE_ACTOR message to the named provider instance
    2. Publish a BindingRemoved event on the bus (which will trigger cache updates in all listening hosts)
  3. When an actor is removed from a host
    1. Unsubscribe all of that actor's active bus subscriptions, remove local claims, etc.
  4. When a capability provider is removed from a host
    1. Unsubscribe all of that provider's active subscriptions
  5. When an actor is added to a host
    1. Subscribe to all actor-provider private topics based on the bus's awareness of existing bindings
  6. When a capability provider is added to a host
    1. Receive an OP_BIND_ACTOR message (locally!) for each of the bindings known for that provider. (Thoughts: I worry that bypassing the normal bus mechanisms to attempt local-only delivery of messages could ultimately be a source of skew over time and/or split-brain syndrome. It might be a bigger burden on providers, but it might also be far easier to simply require providers to enforce idempotency on all binding calls - e.g. politely do nothing for redundant binds. This would make 3 instances of the same provider ignore the dupe message but the 4th newly reconstituted provider see it is a new binding and provision the appropriate resources.)
  7. The bus will always be listening for BindingSet and BindingRemoved events, which it will use to maintain a local cache of known bindings, which is what will be used to respond to inventory queries made by lattice clients or lattice members attempting to restore binding data to restarting actors or providers.

This can obviously create scenarios where actors will get timeout failures when attempting to communicate with non-existent providers even though their bindings exist (old system would've removed the binding so the RPC call would fail immediately due to lookup failure). I think I'm okay with that, as some other entity should be able to monitor the system and attempt to ensure that there's always enough provider instances for actors, etc.

autodidaddict commented 4 years ago

@ewbankkit @rylev @bacongobbler @brooksmtownsend any thoughts on this? I'm looking for edge cases where this new world where the lattice maintains a distributed cache of known bindings falls down.

autodidaddict commented 4 years ago

:thinking: I might be overthinking this entire thing. If providers can be trusted to safely ignore "re-bind" operations, then we might be able to boil this down to:

  1. When an actor starts, publish OP_BIND_ACTOR for all its existing bindings
  2. When a capability provider starts, publish OP_BIND_ACTOR for all its existing bindings
  3. When a binding is set, publish the BindingSet event and OP_BIND_ACTOR for the new binding
  4. When a binding is removed, publish BindingRemoved event and OP_REMOVE_ACTOR to all matching providers

:thinking:

autodidaddict commented 4 years ago

Further thoughts: what are the tradeoffs between having a "binding service" that each host queries in order to get updated binding data versus having each host maintain a cache? Off the top of my head, the biggest two that bug me:

  1. For a lattice with hundreds of actors and hundreds of capability bindings, that data overhead needs to be maintained by all of the lattice hosts. It's probably not more than 100KB of consumption per host, but it still could be considered wasteful.

  2. For portions of the lattice that exist on the edge or at endpoints beyond like a raspberry pi deployed in the field, that device needs to maintain pretty reliable and constant contact with the lattice in order to function properly. If it simply queried the lattice for bindings, and the closest binding service responded to the query, then the host on that device would need only concern itself with bindings that are immediately relevant to it.

  3. A binding service could be a SPOF (single point of failure), but running on a lattice we could deploy multiple binding services in multiple leaf cells to reduce traffic and increase resiliency

  4. if a host only queries binding data when it is necessary for either an actor or a provider being loaded into the host, then its memory only ever has the configuration relevant for the provider running in-process. In the "distributed cache" model, all hosts contain all data, so compromising the memory space of a host in that model has a much bigger blast radius.

? :thinking:

autodidaddict commented 4 years ago

Even further thoughts. In a scaled situation, we can conceivably have two instances of the same provider running in the lattice. If this is an HTTP provider, and we have 2 different actors, we need to be able to tell both of those instances to spin up the appropriate resources for each of the unique actors. In other words, actors must be able to scale on their own, on demand, and providers must be able to scale on their own, on demand. When a provider scales, it must be able to accept the same binding information that other instances previously accepted.

I think a potential solution to all of the various highly complex solutions in previous comments is to go the auction route: To establish a binding between a group of n actor instances and y provider instances, we hold an auction. The instigator (e.g. a lattice client or the host API) publishes the auction request for the binding and then the first host to respond affirmatively to that will be issued a control command to establish that binding (this will ultimately result in the OP_BIND_ACTOR invocation being performed on the provider residing in that host). After the first auction, the binding in this situation will be partially applied because not all of the y provider instances are bound. To reach a state of being fully bound, the binding auction can take place y-1 more times, until all of the available providers have accepted the binding.

Some concrete examples:

Some benefits:

Potential problems:

autodidaddict commented 4 years ago

New proposal providing what could be a more stable, iterative foundation:

Action Host Impact Lattice/Local Bus Impact
API add_actor Module loaded, listener thread started Subscribes to actor topic
API remove_actor Module removed, thread terminated Unsubscribes from actor topic. If the actor being removed is the last of its instances in the lattice, it will call OP_REMOVE_ACTOR to unbind from the cap provider
API add_capability Plugin loaded, listener thread started Subscribe to capability main topic. Query bus for any existing bindings and re-subscribe to those topics
API remove_capability Plugin unloaded, thread terminated Unsubscribe from all cap topics
API set_binding None OP_BIND_ACTOR invoked on ALL matching caps in lattice (not random via queue subscribe)
API remove_binding None OP_REMOVE_ACTOR invoked on ALL matching caps in lattice (not random via queue subscribe)
Lattice schedule actor None Auction held, actor bytes downloaded from Gantry, actor started. No affect on bindings
Lattice stop actor None Specific host is told to terminate an actor. Identical to host's remove_actor. Only impacts a single instance of an actor
Lattice set binding None OP_BIND_ACTOR invoked on ALL matching caps in lattice. Identical to host API call
Lattice remove binding None OP_REMOVE_ACTOR invoked on ALL matching caps in lattice. Identical to host API call
Lattice Add Capability N/A Unsupported until Gantry supports the storage / retrieval of cap providers

This should produce the following high-level behaviors:

Multiplicity of Bindings

In the above described feature, bindings will expand to fill the space they are given. If you are running 9 instances of a single named capability provider (e.g. wascc:http_server,default, wascc:messaging,foobar) then every bound actor group will be bound to all 9 instances of that provider. If you run 3 actor groups that all need bindings to the default message broker, each of those 3 groups will be bound to each of the 9 instances of the provider and you cannot sub-divide by giving more or less instances to specific actor groups.

This can have consequences developers need to be aware of. For example, if you bind an actor group to a message broker that is using a straight subscription and not a queue subscription, then each message from a subscription will be delivered to a random actor within the group n times, once for each running provider instance. If you don't want duplicates, you'll need to use a queue subscribe.