microsoftgraph / microsoft-graph-comms-samples

Microsoft Graph Communications Samples
MIT License
211 stars 236 forks source link

Configuring active-active infrastructure for Teams compliance recording bot #666

Closed RiccardoGMoschetti closed 1 year ago

RiccardoGMoschetti commented 1 year ago

Describe the issue I've correctly set up a teams compliance recording bot, based on this sample. I transformed it into a console application (rather than an azure classic worker role) and deployed on an Azure Windows VM.

To obtain redundance, I deployed the bot on a second machine behind an Azure Load Balancer.

We saw unfortunately that there are problems if both virtual machines get different API calls and packets for the same call. The bot won't always get engaged correctly; some calls will drop unless only one machine gets all of API calls and the packets. That's why we switched to an active-passive configuration. One machine will get all of the traffic; if it goes down, another one (previously passive) will get all of the new traffic.

How is it possible to have an active-active infrastructure, with a variable number of virtual machines doing the compliance recording?

Expected behavior Ability to deploy the compliance bot on multiple active VMs

InDieTasten commented 1 year ago

Are you trying to scale horizontally, or do you want to achieve redundancy?

Scaling horizontally

This works by having only the initial incoming call notification being distributed across machines. The individual instances also need their own endpoint that always directs to them. When answering a call, they supply their unique endpoint to the Answer request as notification url.

The Teams platform will use the supplied endpoint to talk to your bot from there on.

Redundancy

When you want to have multiple bot instances in the same call, you'd have to segregate your deployment into failure domains with each one having it's own app registration. When setting up policies, you can then specify multiple recording applications for a policy and the platform will ensure that at least one of the instances stays within the call.

You will end up with roughly double the compute, bandwidth and storage cost. This way of adding redundancy via policy is usually known as 2N recording.

RiccardoGMoschetti commented 1 year ago

Hi Max, thanks for your answer. I am trying to scale horizontally.

What I understand is that:

Did you write that paragraph yourself, or is it part of a larger document I have been missing?

thanks again.

InDieTasten commented 1 year ago

@RiccardoGMoschetti I think you understand correctly.

Loadbalancer with endpoint L passes requests to endpoint A and B via round robin Machine A with public endpoint A Machine B with public endpoint B

The initial request goes towards L. The selected machine answers the call with notification url A or B, depending on where L decided to route the request. Subsequent communication is between Teams Platform and A or B.

I wrote this myself. The docs do not describe this behavior, however, the samples contain this logic. Eg. the AKS sample does this with an ingress-nginx and multiple pods with public facing ports for each pod.

RiccardoGMoschetti commented 1 year ago

thanks, appreciated.