Bot sometimes does not join meeting: Server Internal Error. DiagCode: 500#7117

leroypijlman commented 3 months ago

Describe the issue Sometimes when inviting the bot to a meeting, the call stays stuck in "Establishing", before eventually getting "terminated" with an HTTP 500 server error.

Scenario We have the sample setup and it works pretty most of the time. But there seem to be random moments/hours where we can not get the bot to join the call and it's the one thing stopping us from implementing this successfully.

Because it's highly unpredictable when our integration does and does not work. The same software, without any new changes in the infrastructure or code. Sometimes works 100% of the time and on certain days it drops down to 20%.

By not working I mean - the bot not joining the call. This is the only issue. Once the bot actually joins and the call is established, all audio media is received just as expected.

We can usually get it to work by re-inviting the bot many times, but this is not practical for our users.

Scenario:

Create a meeting
Invite the bot to the call through the JoinCallController, which runs all the way up to the .AddAsync() of the BotService class as shown in the sample. This also executes successfully and the request is sent to microsoft.

We then receive an "Establishing" message back. However we then sometimes never receive the "Established" callback. After a period of about a minute, we end up receiving a "Terminated" status update with the error code:

Server Internal Error. DiagCode: 500#7117

Code Snippet var statefulCall = await this.Client.Calls().AddAsync(joinParams).ConfigureAwait(false);

This runs successfully, but we never receive the Established callback.

Expected behavior I expect microsoft to (eventually) send us the "Established" callback to our platform signaling endpoint. Which it sometimes does not.

Graph SDK:

Microsoft.Graph.Communications.Calls" Version="1.2.0.850"
Microsoft.Graph.Communications.Client" Version="1.2.0.850"
Microsoft.Graph.Communications.Common" Version="1.2.0.850"
Microsoft.Graph.Communications.Core" Version="1.2.0.850"
Microsoft.Graph.Communications.Calls.Media" Version="1.2.0.850"
Microsoft.Skype.Bots.Media" Version="1.27.0.2-alpha"

Call Chain ID adb8687d-12fc-4878-a1c9-66a28a168403

Logs

Establishing received:

{
    "@odata.type":"#microsoft.graph.commsNotifications",
    "value":[
       {
          "@odata.type":"#microsoft.graph.commsNotification",
          "changeType":"updated",
          "resource":"/app/calls/39003780-ad9d-42d2-a09c-bee3a0721389",
          "resourceUrl":"/communications/calls/39003780-ad9d-42d2-a09c-bee3a0721389",
          "resourceData":{
             "@odata.type":"#microsoft.graph.call",
             "state":"establishing",
             "chatInfo":{
                "@odata.type":"#microsoft.graph.chatInfo",
                "threadId":"19:meeting_N2Y0YmIyZGEtNjNjZS00ZDQ4LTgxMTktMzYxYmI2ZTExNDk5@thread.v2",
                "messageId":"0"
             },
             "meetingInfo":{
                "@odata.type":"#microsoft.graph.organizerMeetingInfo",
                "organizer":{
                   "@odata.type":"#microsoft.graph.identitySet",
                   "user":{
                      "@odata.type":"#microsoft.graph.identity",
                      "id":"6200feed-a0e8-49c0-95a9-e5bbb4d3d1c5",
                      "tenantId":"<ommitted>"
                   }
                },
                "callChainId":"adb8687d-12fc-4878-a1c9-66a28a168403"
        }
    ]
}

Terminated 500 7117 received

{
    "@odata.type":"#microsoft.graph.commsNotifications",
    "value":[
       {
          "@odata.type":"#microsoft.graph.commsNotification",
          "changeType":"deleted",
          "resource":"/app/calls/39003780-ad9d-42d2-a09c-bee3a0721389",
          "resourceUrl":"/communications/calls/39003780-ad9d-42d2-a09c-bee3a0721389",
          "resourceData":{
             "@odata.type":"#microsoft.graph.call",
             "state":"terminated",
             "resultInfo":{
                "@odata.type":"#microsoft.graph.resultInfo",
                "code":500,
                "subcode":7117,
                "message":"Server Internal Error. DiagCode: 500#7117.@"
             },
             "chatInfo":{
                "@odata.type":"#microsoft.graph.chatInfo",
                "threadId":"19:meeting_N2Y0YmIyZGEtNjNjZS00ZDQ4LTgxMTktMzYxYmI2ZTExNDk5@thread.v2",
                "messageId":"0"
             },
             "meetingInfo":{
                "@odata.type":"#microsoft.graph.organizerMeetingInfo",
                "organizer":{
                   "@odata.type":"#microsoft.graph.identitySet",
                   "user":{
                      "@odata.type":"#microsoft.graph.identity",
                      "id":"6200feed-a0e8-49c0-95a9-e5bbb4d3d1c5",
                      "tenantId":"<ommitted>"
                   }
                },
                "callChainId":"adb8687d-12fc-4878-a1c9-66a28a168403"
             }
          ]
       }

Additional context

We only run a single instance (one on our Dev environment, one on our prod environment). During these 'broken' moments, it happens in either environment.
This happens regardless whether this is a new or old meeting
This happens regardless of the participant state in the meeting in question
This only happens SOMETIMES. Sometimes the bot joins 100% of the time for a whole day, then on some days, the bot joins 20% of the time, under the exact same conditions.

InDieTasten commented 3 months ago

Please take a look at this item and make sure you are not doing anything silly with TLS termination: https://github.com/microsoftgraph/microsoft-graph-comms-samples/issues/457

tamagoeggo commented 3 months ago

Hey @leroypijlman, if you're trying to build a meeting bot might be worth checking out the Recall.ai API.

It’s a simple 3rd party API that lets you use meeting bots to get raw audio/video/metadata from meetings without you needing to spend months to build, scale and maintain these bots.

Here are the API docs: https://docs.recall.ai/page/tutorials

leroypijlman commented 3 months ago

Please take a look at this item and make sure you are not doing anything silly with TLS termination: #457

Thank you @InDieTasten I did come across that thread. But if this were the root cause, wouldn't I experience consistent failures rather than intermittent ones?

I'm really hoping that @ssulzer or someone from MSFT can find in the logs what is going on during these terminations. As we're blocked by this big time. 😞

ssulzer commented 3 months ago

Hello @leroypijlman I've taken a look but so far cannot tell what the reason for your sporadic call setup failures is. Is your bot running in Azure? We do not support media bots running outside of Azure. Thanks.

leroypijlman commented 3 months ago

@ssulzer Thank you for your answer. I was afraid of hearing that...

We have the option to migrate to Azure, however this will be a big operation for us (think complicance and internal policies). Is there any clues in the MSFT logging that this is indeed related to this?

Some form of reassurance that this is indeed related to the bot not running in Azure would be most welcome. Migrating to Azure and still issue still occurring would be ... "hell". 🔥

ssulzer commented 3 months ago

Our platform service has TCP connection timeouts trying to reach your bot during call session setup. If you can capture incoming TCP traffic to your bot VM(s), there should be TCP packets from the following Office365 IP address ranges during call session setup (after you have started the call join process and received the "Establishing" notification): 52.112.0.0/14 52.122.0.0/15

On the bot's local network interface, these TCP packets should be going to the InstanceInternalPort specified in your MediaPlatformInstanceSettings.

One option to consider: if you have a lot of application "machinery" running in another cloud which post-processes the audio/video received from the Teams meeting, can you at least move the portion of the bot which uses the media SDK (Audio/VideoSocket APIs) into Azure? So that at least the application's real-time media endpoint connected to the Teams meeting is running in Azure. And then have that bot app transport the received media over to the rest of the application running elsewhere, ideally over a lossless, reliable transport (like a websocket).

InDieTasten commented 3 months ago

@leroypijlman Perhaps a bit of a shameless plug, but if you decide to migrate to Azure, do not hesitate to hit me up. While nobody will be able to help you with internal processes directly, I could at least offer assistance regarding technical hurdles during migration. Options like no downtime migration can alleviate some pain points for internal discussions for example.

leroypijlman commented 2 months ago

Really appreciative of your answers and overall helpfulness @InDieTasten @ssulzer Thank you so much.

I managed to deploy the bot to a basic VM2022 (quick manual setup), just to see if I could reproduce the issue. As of right now - it's working well. We'll assess this over the next couple of days with some actual user load. If it proves to alleviate the problem, we'll follow @ssulzer advice and just host the bot instances in Azure and send it downstream through websockets.

The Only downside being... having to manage an extra k8s(AKS) cluster with a separate cloud provider. (We already have 5 of our own data centers and 1 aws cluster.) Ops is not going to like this. 😅

One concern that I do have, is, whether my current basic VM setup is a good representation. e.g. maybe the issue was really just occurring due to the bot running in k8s (which should not matter, but still..)

Because I find it odd that those call signaling messages timed-out. Whenever the call did get established, 99,9999% of the audio packets were being received successfully. QualityOfExperience metrics were close to being lossless even during heavy load. Same goes for all other messages, which never seemed to get lost. Making me think wonder if this wasn't a bug in this particular callback.

I'll post back whether it worked and know where to find you if I do get stuck. 🙏

ssulzer commented 2 months ago

Given a choice, my preference/recommendation is to run a media bot endpoint in a "basic VM setup" versus running within a Windows container.

InDieTasten commented 2 months ago

If all your load is capable to run on a single machine and you aren't concerned about uptime during Windows updates, configuration changes or application updates, I would agree with ssulzer.

If automated scaling, high-availability and rolling deployment strategies are of interest, I would recommend AKS.

leroypijlman commented 3 weeks ago

Over 2 months later I return to this thread to again seek for your assistance. Our infra department has been working OT to get an Azure cluster up and running which is now finally live.

We have deployed our teams bot software there, similarly to the way we did this in our previous cluster. Unfortunately we are still plagued with the issue as mentioned in the original post of this thread.

The bot sometimes does not receive the callback. On subsequent invite attempts it joins eventually. But after about a minute of that bot joining, it's instructed by teams to leave. Which gives me the impression that "reinviting until it works" is not intended since it causes the bot to get kicked. So retrying the invite is not really an option.

Can you please dig into the logs to see if you can find any meta data for us that can be of assistance?

Here's some logging:

Establishing

{
    "@odata.type": "#microsoft.graph.commsNotifications",
    "value": [
        {
            "@odata.type": "#microsoft.graph.commsNotification",
            "changeType": "updated",
            "resource": "/app/calls/05005180-709a-43c2-8264-3d12938e1155",
            "resourceUrl": "/communications/calls/05005180-709a-43c2-8264-3d12938e1155",
            "resourceData": {
                "@odata.type": "#microsoft.graph.call",
                "state": "establishing",
                "chatInfo": {
                    "@odata.type": "#microsoft.graph.chatInfo",
                    "threadId": "19:meeting_NzQ2MzdkNWYtYzIwYi00NWY1LWI0NjEtODFkMTA0NjVjZTA0@thread.v2",
                    "messageId": "0"
                },
                "meetingInfo": {
                    "@odata.type": "#microsoft.graph.organizerMeetingInfo",
                    "organizer": {
                        "@odata.type": "#microsoft.graph.identitySet",
                        "user": {
                            "@odata.type": "#microsoft.graph.identity",
                            "id": "fbe232de-fdb4-40fb-bbd4-222b2bc44d3a",
                            "tenantId": "<redacted>"
                        }
                    },
                    "callChainId": "e2cc7f79-0326-436c-a2b1-80f381ba5a20"
                }
            ]
        }

Terminated

{
    "@odata.type": "#microsoft.graph.commsNotifications",
    "value": [
        {
            "@odata.type": "#microsoft.graph.commsNotification",
            "changeType": "deleted",
            "resource": "/app/calls/05005180-709a-43c2-8264-3d12938e1155",
            "resourceUrl": "/communications/calls/05005180-709a-43c2-8264-3d12938e1155",
            "resourceData": {
                "@odata.type": "#microsoft.graph.call",
                "state": "terminated",
                "resultInfo": {
                    "@odata.type": "#microsoft.graph.resultInfo",
                    "code": 500,
                    "subcode": 7117,
                    "message": "Server Internal Error. DiagCode: 500#7117.@"
                },
                "chatInfo": {
                    "@odata.type": "#microsoft.graph.chatInfo",
                    "threadId": "19:meeting_NzQ2MzdkNWYtYzIwYi00NWY1LWI0NjEtODFkMTA0NjVjZTA0@thread.v2",
                    "messageId": "0"
                },
                "meetingInfo": {
                    "@odata.type": "#microsoft.graph.organizerMeetingInfo",
                    "organizer": {
                        "@odata.type": "#microsoft.graph.identitySet",
                        "user": {
                            "@odata.type": "#microsoft.graph.identity",
                            "id": "fbe232de-fdb4-40fb-bbd4-222b2bc44d3a",
                            "tenantId": "<redacted>"
                        }
                    },
                    "callChainId": "e2cc7f79-0326-436c-a2b1-80f381ba5a20"
                }
            ]
        }

Where a successful instance usually contains an "Established" message.

@ssulzer can you confirm whether or not this is still related to TCP connection timeouts? and if so - how long is this timeout configured for?

I want to have our infra department analyze this for us and I want to be able to give enough details to aid their search.

arantar commented 2 weeks ago

@leroypijlman Hi Were you able to identify the source of the problem and resolve it? I only get this same error ("Server Internal Error. DiagCode: 500#7117.@") when running the bot locally.

leroypijlman commented 2 weeks ago

@arantar Unfortunately not. Can you describe your local setup? Are you using ngrok?

For us it only seems to happen when running it inside of kubernetes.

microsoftgraph / microsoft-graph-comms-samples

Bot sometimes does not join meeting: Server Internal Error. DiagCode: 500#7117 #750