Started receiving "unable to verify the first certificate" when interacting with the sdk

nodkrot commented 1 week ago

Versions

What package version of the SDK are you using. (botbuilder@4.16.0) What nodejs version are you using (v16.20.2) What os are you using (Mac)

Describe the bug

On July 3rd we started receiving "unable to verify the first certificate" error when starting and using botbuilder sdk APIs. It appears to happen to some but not all customers.

To Reproduce

Instantiate and use any operation botbuilder sdk (send or update message for example)

Expected behavior

No error

"exception":{"message":"unable to verify the first certificate","stack":"Error: unable to verify the first certificate\n    at new RestError (/var/www/app/node_modules/@azure/ms-rest-js/lib/restError.ts:18:5)\n    at AxiosHttpClient.<anonymous> (/var/www/app/node_modules/@azure/ms-rest-js/lib/axiosHttpClient.ts:194:15)\n    at step (/var/www/app/node_modules/@azure/ms-rest-js/node_modules/tslib/tslib.js:141:27)\n    at Object.throw (/var/www/app/node_modules/@azure/ms-rest-js/node_modules/tslib/tslib.js:122:57)\n    at rejected (/var/www/app/node_modules/@azure/ms-rest-js/node_modules/tslib/tslib.js:113:69)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)"},"level":"error","message":"Failure updating message: unable to verify the first certificate"

avilabiel commented 1 week ago

I'm getting the same error right now.

wilsonsantosnet commented 1 week ago

I'm getting the same error right now.

tim-tribble-ai commented 4 days ago

Getting the same error intermittently too!

firecode commented 4 days ago

we are receiving same error as well

BrendanAndrade commented 4 days ago

We are seeing the same on our two apps as well.

tracyboehrer commented 4 days ago

The OP indicates 4.16, which is quite old. What about the others in this thread?

tim-tribble-ai commented 4 days ago

I'm on 4.22.2

at1as commented 4 days ago

Same issue for us.

First occurrence: 2024-07-03 03:37:56 pm PDT.

Since then 290 further occurrences.

For the time being we're catching this error and simply retrying each request up to 3x and virtually all succeed. However we'd prefer to not have to do this.

firecode commented 4 days ago

we are seeing issue on 4.22.3 as well ..

tracyboehrer commented 4 days ago

Since this is happening on older versions, it's probably not related to a recent change in SDK. I'll check for issues with the other end. I'm assuming its failing during the request, especially since retries make it work.

I'm wondering what the HTTP response status code is. Is it being throttled? Though that is a specific status code and handled automatically. We can help mitigate by increasing our retries.

avilabiel commented 4 days ago

Since this is happening on older versions, it's probably not related to a recent change in SDK. I'll check for issues with the other end. I'm assuming its failing during the request, especially since retries make it work.

I'm wondering what the HTTP response status code is. Is it being throttled? Though that is a specific status code and handled automatically. We can help mitigate by increasing our retries.

We receive 500 if I'm not mistaken. Since it's intermittent, it feels like there is a load balancer pointing to a server with an expired certificate. My guess

at1as commented 4 days ago

The trace provided in this issue (third comment for requests.exceptions.SSLError) seems to indicate it's occurring with Python as well:

https://answers.microsoft.com/en-us/msteams/forum/all/unabletoverifyleafsignature-traffic-manager-teams/3ad8a8cb-bb41-4c46-ae4e-e31d0688b06e

In which case it may have nothing to do with bot-framework, which seems likely given these type of certificate issues generally originate on the server.

I have a support ticket open with Microsoft, however I've been informed that I don't have the correct tier of paid support to have anyone from engineering take a look.

If anyone has premium support I'd appreciate them raising the issue through a ticket.

at1as commented 4 days ago

I'm wondering what the HTTP response status code is. Is it being throttled? Though that is a specific status code and handled automatically. We can help mitigate by increasing our retries.

I think it's actually a success response code. Our 429/5XX logic wasn't catching this error. We had to add some custom logic to catch-and-retry it via matching on e.message.includes("unable to verify the first certificate"). As far as I can tell, the server believes the request is valid.

gitnavneet commented 4 days ago

Hi,

We started facing this error with our Teams App on 3/Jul/2024 and it has already impacted an important release. Before this, our App has been running fine since Dec'23. We are using botbuilder@4.20.0 on node.js@18.20.3.

The exception has the format: FetchError: request to \<URL> failed, reason: unable to verify the first certificate

type: system
errno, code: UNABLE_TO_VERIFY_LEAF_SIGNATURE

The target is of the form: "https://smba.trafficmanager.net/amer/v3/conversations/.../activities/..."

We face this intermittently when our Teams app sends or updates an adaptive card using the sendActivity or updateActivity methods of the TurnContext object in the Bot Builder SDK for Node.js.

We were able to replicate this issue in non-production and usually see it occur for 10% of the send/update attempts. Handling the exception and retrying the activity works on the first attempt. I think you should be able to fix this by identifying and fixing the failing certificate chain on the concerned nodes of your distributed infrastructure.

Looking forward to a quick fix! Thanks.

tracyboehrer commented 4 days ago

Are these all Teams bots? 4.21.0 is when JS SDK moved to MSAL auth (from ADAL). But 4.16 would still be ADAL of course.

tim-tribble-ai commented 4 days ago

Ours is a Teams Bot.

tracyboehrer commented 4 days ago

This is being actively investigated by the Teams group.

mclarke-logikor commented 4 days ago

Also experiencing this with a teams bot running 4.21.1

tracyboehrer commented 4 days ago

There has been at least two Sev 2's raised in that group. That is Microsoft terminology for high impact issue. It also means there will be eyes on it. I'll leave this open and post updates as I get them.

nikunj2001 commented 3 days ago

Same issue with two of our application.

YunnyChung commented 3 days ago

Hey team, do you still see the issues? Could you please let us know if you still see the issues?

at1as commented 3 days ago

Hey team, do you still see the issues? Could you please let us know if you still see the issues?

The last instance we observed was at 2024-07-08 04:00:17 pm PDT (~3 hours ago).

However, since it's intermittent and our app traffic volume is lower in the evening, we'll need to wait longer to be sure

YunnyChung commented 3 days ago

Thank you for sharing @at1as! For others, please do share whether you still see the issues or not with me here :)

One more question to the group: could you share your end point you are targeting like @gitnavneet did? I am mainly interested in the region that the end point contains. ex: "https://smba.trafficmanager.net/**amer**/v3/conversations/.../activities/..."

Hi,

We started facing this error with our Teams App on 3/Jul/2024 and it has already impacted an important release. Before this, our App has been running fine since Dec'23. We are using botbuilder@4.20.0 on node.js@18.20.3.

The exception has the format: FetchError: request to failed, reason: unable to verify the first certificate

type: system

errno, code: UNABLE_TO_VERIFY_LEAF_SIGNATURE

The target is of the form: "https://smba.trafficmanager.net/amer/v3/conversations/.../activities/..."

We face this intermittently when our Teams app sends or updates an adaptive card using the sendActivity or updateActivity methods of the TurnContext object in the Bot Builder SDK for Node.js.

We were able to replicate this issue in non-production and usually see it occur for 10% of the send/update attempts. Handling the exception and retrying the activity works on the first attempt. I think you should be able to fix this by identifying and fixing the failing certificate chain on the concerned nodes of your distributed infrastructure.

Looking forward to a quick fix! Thanks.

tim-tribble-ai commented 3 days ago

Haven't seen the issue in the last few hours, but, will wait and see how things go tomorrow!

Our region endpoint was amer

gitnavneet commented 3 days ago

Hey team, do you still see the issues? Could you please let us know if you still see the issues?

Hey @YunnyChung,

We tried replicating the issue again in non-production today, however, it has not recurred! 😄

In summary,

We faced a total of 70 errors in production some of which caused a bad UX.
The first exception occurred on 3-Jul-24 at 21:56:45 UTC.
The last exception was ~12 hours ago on 8-Jul-24 at 20:35:26 UTC.
The target endpoint was always in the amer region.

We will continue monitoring the production logs for a few more days and report back here if the issue recurs.

🤔 Could you please share the root cause analysis? Thanks!

nodkrot commented 3 days ago

Errors stopped for as well

YunnyChung commented 3 days ago

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

gitnavneet commented 3 days ago

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

Our solution is deployed as an Azure Function app running on Windows.

avilabiel commented 2 days ago

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

I didn't see any errors yesterday, thanks for that! However, today we received a new one, probably related:

{
  "url": "https://smba.trafficmanager.net/amer/api/v2/[redacted]/channel/messages/[redacted]"
  "response": "read ECONNRESET"
}

gitnavneet commented 2 days ago

We saw this error once too, today at 6:15:57 UTC.

{
    "level": "error",
    "message": "An error occurred while handling the event FetchError: request to https://smba.trafficmanager.net/amer/v3/conversations/.../activities/... failed, reason: read ECONNRESET"
}

FBraz-RMFarma commented 2 days ago

Thank you so much everyone for sharing the information here! Yes, please do let me know whether errors have stopped or still occurring. We will share the root cause analysis when it is ready.

One another small favor I want to ask to this community: For those who encountered this issue, are you using Linux? Could you share the OS of the machine you used when this issue occurred?

Our bot is deployed as an service in Google Cloud Functions

tracyboehrer commented 2 days ago

@FBraz-RMFarma

Our bot is deployed as an service in Google Cloud Functions

No kidding? That's pretty cool.

tim-tribble-ai commented 2 days ago

I saw the ECONNRESET as well yesterday.

avilabiel commented 2 days ago

We saw this error once too, today at 6:15:57 UTC.

{
    "level": "error",
    "message": "An error occurred while handling the event FetchError: request to https://smba.trafficmanager.net/amer/v3/conversations/.../activities/... failed, reason: read ECONNRESET"
}

Yeah, we got 223 issues in the last hour 😞

gitnavneet commented 2 days ago

Yeah, we got 223 issues in the last hour 😞

@avilabiel , now you have me worried! 😮 How did those 223 new issues impact your app or end users?

For our app, it caused an ephemeral error message, but the action succeeded with a retry.

YunnyChung commented 2 days ago

Thank you everyone for sharing the issue here 🙏🏻 For those who started to see the error, could you help me one more time by sharing the following information?

Is this issue still happening? If so, how often does this issue occur? (ex: All requests are failing or some? If this happens intermittently, how often? (ex: 1%, 5%, 10% of the requests etc))
When did you start to see the issue? (date with approximate time (with timezone) will be super appreciated 🙏🏻)
- If the issue stopped happening, could you share the last time you saw the issue? (again date with approximate time will be super helpful here)
Could you share the end point with the region information? ex: https://smba.trafficmanager.net/amer/v3/conversations/.../activities/...

at1as commented 2 days ago

Latest: 2024-07-10 08:00:42 am PDT (50 mins ago) First: 2024-07-09 08:00:03 pm PDT Occurrences: 37

Our app is heavily skewed to triggering on the hour, for certain hours, so ~50 minutes of no occurrences does not tell us much. It is intermittent, but I don't have immediate counts.

We see it for this endpoint (though it's our most heavily used, so this isn't necessarily an exhaustive list):

https://smba.trafficmanager.net/amer/v3/conversations", "method":"POST"

YunnyChung commented 1 day ago

We are in active process for investigation & mitigation -quick two follow up questions here:

(1) For those failed requests, do you see those requests' responses containing header with the ms-cv? (2) Could you let me now if you see the decrease of the errors starting 07/10/24 8:30 pm in UTC?

gitnavneet commented 1 day ago

@YunnyChung ,

The read ECONNRESET error occurred multiple times for our app today (10/Jul/24). Some users missed important updates. So, we have postponed the app launch to a wider group.

(1) I did not notice any header with ms-cv in our logs. (2) Yes, errors have decreased. They have not occurred since 19:05:49 UTC.

Some error timestamps (UTC): 6:15:57 (first), 14:16:51, 14:19:28, 15:21:01, 16:50:57, 19:05:49 (last). Target: https://smba.trafficmanager.net/amer/v3/conversations/.../activites/...

microsoft / botbuilder-js