microsoftgraph / microsoft-graph-comms-samples

Microsoft Graph Communications Samples
MIT License
206 stars 222 forks source link

Bot sometimes does not join meeting: Server Internal Error. DiagCode: 500#7117 #750

Open leroypijlman opened 3 weeks ago

leroypijlman commented 3 weeks ago

Describe the issue Sometimes when inviting the bot to a meeting, the call stays stuck in "Establishing", before eventually getting "terminated" with an HTTP 500 server error.

Scenario We have the sample setup and it works pretty most of the time. But there seem to be random moments/hours where we can not get the bot to join the call and it's the one thing stopping us from implementing this successfully.

Because it's highly unpredictable when our integration does and does not work. The same software, without any new changes in the infrastructure or code. Sometimes works 100% of the time and on certain days it drops down to 20%.

By not working I mean - the bot not joining the call. This is the only issue. Once the bot actually joins and the call is established, all audio media is received just as expected.

We can usually get it to work by re-inviting the bot many times, but this is not practical for our users.

Scenario:

We then receive an "Establishing" message back. However we then sometimes never receive the "Established" callback. After a period of about a minute, we end up receiving a "Terminated" status update with the error code:

Server Internal Error. DiagCode: 500#7117

Code Snippet var statefulCall = await this.Client.Calls().AddAsync(joinParams).ConfigureAwait(false);

This runs successfully, but we never receive the Established callback.

Expected behavior I expect microsoft to (eventually) send us the "Established" callback to our platform signaling endpoint. Which it sometimes does not.

Graph SDK:

Microsoft.Graph.Communications.Calls" Version="1.2.0.850"
Microsoft.Graph.Communications.Client" Version="1.2.0.850"
Microsoft.Graph.Communications.Common" Version="1.2.0.850"
Microsoft.Graph.Communications.Core" Version="1.2.0.850"
Microsoft.Graph.Communications.Calls.Media" Version="1.2.0.850"
Microsoft.Skype.Bots.Media" Version="1.27.0.2-alpha"

Call Chain ID adb8687d-12fc-4878-a1c9-66a28a168403

Logs

Establishing received:

{
    "@odata.type":"#microsoft.graph.commsNotifications",
    "value":[
       {
          "@odata.type":"#microsoft.graph.commsNotification",
          "changeType":"updated",
          "resource":"/app/calls/39003780-ad9d-42d2-a09c-bee3a0721389",
          "resourceUrl":"/communications/calls/39003780-ad9d-42d2-a09c-bee3a0721389",
          "resourceData":{
             "@odata.type":"#microsoft.graph.call",
             "state":"establishing",
             "chatInfo":{
                "@odata.type":"#microsoft.graph.chatInfo",
                "threadId":"19:meeting_N2Y0YmIyZGEtNjNjZS00ZDQ4LTgxMTktMzYxYmI2ZTExNDk5@thread.v2",
                "messageId":"0"
             },
             "meetingInfo":{
                "@odata.type":"#microsoft.graph.organizerMeetingInfo",
                "organizer":{
                   "@odata.type":"#microsoft.graph.identitySet",
                   "user":{
                      "@odata.type":"#microsoft.graph.identity",
                      "id":"6200feed-a0e8-49c0-95a9-e5bbb4d3d1c5",
                      "tenantId":"<ommitted>"
                   }
                },
                "callChainId":"adb8687d-12fc-4878-a1c9-66a28a168403"
        }
    ]
}

Terminated 500 7117 received

{
    "@odata.type":"#microsoft.graph.commsNotifications",
    "value":[
       {
          "@odata.type":"#microsoft.graph.commsNotification",
          "changeType":"deleted",
          "resource":"/app/calls/39003780-ad9d-42d2-a09c-bee3a0721389",
          "resourceUrl":"/communications/calls/39003780-ad9d-42d2-a09c-bee3a0721389",
          "resourceData":{
             "@odata.type":"#microsoft.graph.call",
             "state":"terminated",
             "resultInfo":{
                "@odata.type":"#microsoft.graph.resultInfo",
                "code":500,
                "subcode":7117,
                "message":"Server Internal Error. DiagCode: 500#7117.@"
             },
             "chatInfo":{
                "@odata.type":"#microsoft.graph.chatInfo",
                "threadId":"19:meeting_N2Y0YmIyZGEtNjNjZS00ZDQ4LTgxMTktMzYxYmI2ZTExNDk5@thread.v2",
                "messageId":"0"
             },
             "meetingInfo":{
                "@odata.type":"#microsoft.graph.organizerMeetingInfo",
                "organizer":{
                   "@odata.type":"#microsoft.graph.identitySet",
                   "user":{
                      "@odata.type":"#microsoft.graph.identity",
                      "id":"6200feed-a0e8-49c0-95a9-e5bbb4d3d1c5",
                      "tenantId":"<ommitted>"
                   }
                },
                "callChainId":"adb8687d-12fc-4878-a1c9-66a28a168403"
             }
          ]
       }

Additional context

InDieTasten commented 3 weeks ago

Please take a look at this item and make sure you are not doing anything silly with TLS termination: https://github.com/microsoftgraph/microsoft-graph-comms-samples/issues/457

tamagoeggo commented 3 weeks ago

Hey @leroypijlman, if you're trying to build a meeting bot might be worth checking out the Recall.ai API.

It’s a simple 3rd party API that lets you use meeting bots to get raw audio/video/metadata from meetings without you needing to spend months to build, scale and maintain these bots.

Here are the API docs: https://docs.recall.ai/page/tutorials

leroypijlman commented 3 weeks ago

Please take a look at this item and make sure you are not doing anything silly with TLS termination: #457

Thank you @InDieTasten I did come across that thread. But if this were the root cause, wouldn't I experience consistent failures rather than intermittent ones?

I'm really hoping that @ssulzer or someone from MSFT can find in the logs what is going on during these terminations. As we're blocked by this big time. 😞

ssulzer commented 3 weeks ago

Hello @leroypijlman I've taken a look but so far cannot tell what the reason for your sporadic call setup failures is. Is your bot running in Azure? We do not support media bots running outside of Azure. Thanks.

leroypijlman commented 3 weeks ago

@ssulzer Thank you for your answer. I was afraid of hearing that...

We have the option to migrate to Azure, however this will be a big operation for us (think complicance and internal policies). Is there any clues in the MSFT logging that this is indeed related to this?

Some form of reassurance that this is indeed related to the bot not running in Azure would be most welcome. Migrating to Azure and still issue still occurring would be ... "hell". 🔥

ssulzer commented 3 weeks ago

Our platform service has TCP connection timeouts trying to reach your bot during call session setup. If you can capture incoming TCP traffic to your bot VM(s), there should be TCP packets from the following Office365 IP address ranges during call session setup (after you have started the call join process and received the "Establishing" notification): 52.112.0.0/14 52.122.0.0/15

On the bot's local network interface, these TCP packets should be going to the InstanceInternalPort specified in your MediaPlatformInstanceSettings.

One option to consider: if you have a lot of application "machinery" running in another cloud which post-processes the audio/video received from the Teams meeting, can you at least move the portion of the bot which uses the media SDK (Audio/VideoSocket APIs) into Azure? So that at least the application's real-time media endpoint connected to the Teams meeting is running in Azure. And then have that bot app transport the received media over to the rest of the application running elsewhere, ideally over a lossless, reliable transport (like a websocket).

InDieTasten commented 3 weeks ago

@leroypijlman Perhaps a bit of a shameless plug, but if you decide to migrate to Azure, do not hesitate to hit me up. While nobody will be able to help you with internal processes directly, I could at least offer assistance regarding technical hurdles during migration. Options like no downtime migration can alleviate some pain points for internal discussions for example.

leroypijlman commented 2 weeks ago

Really appreciative of your answers and overall helpfulness @InDieTasten @ssulzer Thank you so much.

I managed to deploy the bot to a basic VM2022 (quick manual setup), just to see if I could reproduce the issue. As of right now - it's working well. We'll assess this over the next couple of days with some actual user load. If it proves to alleviate the problem, we'll follow @ssulzer advice and just host the bot instances in Azure and send it downstream through websockets.

The Only downside being... having to manage an extra k8s(AKS) cluster with a separate cloud provider. (We already have 5 of our own data centers and 1 aws cluster.) Ops is not going to like this. 😅

One concern that I do have, is, whether my current basic VM setup is a good representation. e.g. maybe the issue was really just occurring due to the bot running in k8s (which should not matter, but still..)

Because I find it odd that those call signaling messages timed-out. Whenever the call did get established, 99,9999% of the audio packets were being received successfully. QualityOfExperience metrics were close to being lossless even during heavy load. Same goes for all other messages, which never seemed to get lost. Making me think wonder if this wasn't a bug in this particular callback.

I'll post back whether it worked and know where to find you if I do get stuck. 🙏

ssulzer commented 2 weeks ago

Given a choice, my preference/recommendation is to run a media bot endpoint in a "basic VM setup" versus running within a Windows container.

InDieTasten commented 2 weeks ago

If all your load is capable to run on a single machine and you aren't concerned about uptime during Windows updates, configuration changes or application updates, I would agree with ssulzer.

If automated scaling, high-availability and rolling deployment strategies are of interest, I would recommend AKS.