[Dev support]: Teams bot conversations - Frequent downtime (Unauthorized 401 in weekends)

penguinsource commented 1 week ago

Please be sure to check the Discussions Q&A section before filing a new question.

Question

Hello everyone! I have made a related post in a different repository (bot framework) and they have directed me this way. This is the original issue raised, a year ago - https://github.com/microsoft/botframework-sdk/issues/6601. The issues persists today.

Describe the bug

I am sending POST requests to https://smba.trafficmanager.net/amer and receiving timeouts on my remote server (google cloud, app engine) - the same is not happening locally when using ngrok.

Nothing has changed on our code's behalf and it all started failing yesterday (June 14th).

Expected behavior

Our platform proactively sends messages to an existing conversation (through the REST url) and then continues the conversation through a bot-builder implementation.

Screenshots

Additional context

Issue does not occur in ngrok, but it does occur on app engine instances
Bot has been working fine for the past 2 years, but all of a sudden, something must have happened and it's now getting all these timeouts

 error: FetchError: request to https://smba.trafficmanager.net/amer/v3/conversations/a:17cILmBogFKvwOhSb3Qb8s8s8Mvi8-v28V74WaXT2do5qpcvX2iWx7u9TIFKgjUZCbC2Sbuy3LltGjgAxKHPhYDjwOaEarjchGEhQOtN2Rax4hMDG8YjSjl_rpBR-2yaD/activities/ failed, reason: connect ETIMEDOUT 52.114.142.186:443
      at ClientRequest.<anonymous> (/workspace/node_modules/node-fetch/lib/index.js:1491:11)
      at ClientRequest.emit (events.js:400:28)
      at TLSSocket.socketErrorListener (_http_client.js:475:9)
      at TLSSocket.emit (events.js:400:28)
      at emitErrorNT (internal/streams/destroy.js:106:8)
      at emitErrorCloseNT (internal/streams/destroy.js:74:3)
      at processTicksAndRejections (internal/process/task_queues.js:82:21) {
    type: 'system',
    errno: 'ETIMEDOUT',
    code: 'ETIMEDOUT'
  },

Code Snippets

I store each user's MS Teams Authentication object in my database (this is our prod test user):

Every time we make a call to send a (proactive) message or updating an existing conversation message, we use the serviceUrl defined in this msTeamsAuth object. All that code works fine throughout the week, and then starts receiving a 401 Unauthorized as shown in the above screenshots in the weekend (Sat to Monday) and then sometimes in the weekdays, too.

I have a simple function which sends a message every 5 mins (CRON job), which can more easily be used to debug this issue:

Getting data to make the proactive messaging call:

Sending a proactive message:

I have just yesterday also added a hardcoded version of the serviceUrl set to 'https://smba.trafficmanager.net/teams/'; instead of the usual 'https://smba.trafficmanager.net/amer/'; which is what most of our prod users are using (according to their ms teams auth object).

This issue also occurs when a user answers an interactive message from our bot. This is a bot framework implementation. Code snippet for this: Step 1:

Step 2:

What's happening?

I have a good feeling that the traffic manager is blocking our prod bot during the weekends. The dev bot seems to work? I am not sure how to go about debugging it any further without additional help.

Thank you very much!

corinagum commented 6 days ago

@penguinsource thanks for filing this issue. I'm following up internally with this and will get back to you shortly.

penguinsource commented 6 days ago

thanks @corinagum . if it helps, if i upload my service to another endpoint, it seems like it does not fail. so the most likely issue is with the dns smba traffic manager limiting traffic per up/domain

corinagum commented 4 days ago

Hi @penguinsource. I've spoken with one of my colleagues and he has asked for further information. Once I get his github username, I will tag him here.

To quote:

for further investigation we would need more details about concrete failed requests. The MS-CV is not applicable here since the requests didn't come through, but at least bot id and timestamp of several problematic requests would be great help here.

Could you provide a few bot id and timestamps for a number of these problematic requests you've had? If possible, the more recent they can be, the better.

Thank you!

corinagum commented 3 days ago

Assigning this to @vsvandelik; could you please leave a comment on this GitHub issue and I can assign it to you for tracking?

vsvandelik commented 3 days ago

@penguinsource As @corinagum mentioned - if you could provide us with the bot id and timestamps where the requests didn't come through, we can investigate on our side what could possibly be the issue.

if it helps, if i upload my service to another endpoint, it seems like it does not fail. so the most likely issue is with the dns smba traffic manager limiting traffic per up/domain

Can you please confirm this is still valid and on another endpoint you don't see this issue? Are you still calling the same endpoint?

microsoft / teams-ai