InDieTasten commented 4 years ago

Problem set

As of right now, the samples only include deployment settings and implementations, that would require downtime for scaling in or out. Furthermore, no matter the durability or reliability levels of the cluster/VMs in the service fabric, calls would be dropped in case of cluster upgrades.

Service instances are deployed using "-1" node count (singleton per node) and exposed via load balancer individually. Even the signaling is exposed individually in a way, that the load balancer isn't load balancing anything anymore. It's just firewall at this point. Distributing calls evenly across the cluster is currently not a concern covered in the samples. It is currently up to the consumer to decide, and the consumer needs to know about all ports.

It is very sad to see, that service fabric is promoted so heavily in the samples, but it's being used completely wrong such that all the benefits that you could get are thrown out the window. Availability and scalability are not considered at all at the moment.

Goal

I want to propose the creation of an additional local media bot, that showcases how to create a bot that can

be in a call for an extended period of time while receiving and sending local media
load balance calls across nodes with a single signaling endpoint
perform rolling updates to the cluster or bot services themselves without "dropping" calls
self-heal, if a node goes down
fully utilize the cluster when healthy
fully utilize healthy nodes, when nodes are unhealthy (as long as quorum is given obviously)

Concept

A smart stateful Reliable Service with reliable meta data for storing call parameters needed to rejoin calls that get lost via node failure, cluster operations or service upgrades.

Lifecycle

First of all, the bot service instances need to be declared stateful. Having an active call list is state and throwing away calls is not acceptable. It is not possible to serialize the state of calls into something like a reliable collection. But all meta data required to rejoin the call can be serialized and stored in a reliable collection.

Call replication

When there are multiple replicas of a call metadata list partition, one of the replicas will always be primary. This primary instance is responsible for sending and receiving media. If a primary is demoted to secondary, they close their connections and let the new primary take over. The new primary therefor needs to establish connection, when promoted.

This process should enable the migration of calls from one service instance to another. This will facilitate the cluster to drain nodes entirely for cluster operations.

Endpoint ports and load balancer rules

Media control ports need to be dynamic, such that multiple instances can run in parallel on a single node. Feasibility of dynamic media control ports needs to be confirmed. RFC!

Signaling itself should be fairly easy, as this could easily be put into a stateless service that resolves the relevant replica for the current request to complete it. This would function like a gateway, but is required to be inside the cluster to be able to consume the naming service. Partitioning of the stateful service acts as load balancing this way.

Looking for comments regarding this concept, especially regarding dynamic ports and how one would address them from the outside / route connections to them afterwards. There's lots of info about routing, api management, probing, proxying and port sharing with HTTP, but for raw TCP connections I wasn't able to find much.

Alternatively, fixed port can be used still together with a LB probe for the media connections, while signaling could take care of balancing calls across nodes and rejoining lost calls. This would require the signaling to detect node failures and similar events. Bot instances would have to remote call the signaling service to migrate it's calls upon closing, which seems like a lot of work to implement.

Another alternative would be to update the LB NAT rules dynamically. That would be very powerful, but would require the bot to have permission to change NAT rules of it's own cluster, which sounds a bit scary.

jsweiler commented 4 years ago

@InDieTasten Just curious if you figured any of these out? I'm especially interested in this:

perform rolling updates to the cluster or bot services themselves without "dropping" calls

@ssulzer Do you know how to do a rolling upgrade with policy based recording and not lose calls?

kieronlanning commented 4 years ago

@InDieTasten / @ssulzer has this be raised internally or the MS Teams partner channel? It's something we've been discussing internally, and a question we've asked directly.

InDieTasten commented 4 years ago

@kieronlanning No, this stems entirely from myself trying to work with the SDK. My company is part of the MPN, but I don't have any contacts personally. Feel free to do so, although I would appreciate, if this kind of communication would happen here on GitHub accessible to everyone.

@jsweiler No, apart from the time spent to come up and document my idea here, I've not invested more time. I am still very interested, but currently my company is not willing to fund the resources to pursue this idea, as I assessed the risk of it not working how I intended to by quite large. If @ssulzer or anyone else could clear up some of the vague portions of the concept, or provide some best practices for the problems described, that might change.

InDieTasten commented 1 week ago

For anyone coming across this, a production grade sample that's scalable and highly-available has been implemented here: https://github.com/LM-Development/aks-sample/tree/main/Samples/PublicSamples/RecordingBot

I am the lead maintainer of that project. It's running in K8s rather than Service Fabric.

microsoftgraph / microsoft-graph-comms-samples

RFC: Concept for scalable high-availability local-media-bots in Service Fabric #336