microsoft / BotFramework-Services

Microsoft Bot Framework Services
Creative Commons Attribution 4.0 International
38 stars 11 forks source link

directline errors #39

Closed ubreddy closed 5 years ago

ubreddy commented 5 years ago

I keep getting directline errors once a while multiple times during the day....

e.g. getaddrinfo ENOTFOUND directline.botframework.com directline.botframework.com:443

and some errors like below https://directline.botframework.com/v3/conversations/I02GjBASIPz4DOmNQeTxKL/activities/I02GjBASIPz4DOmNQeTxKL|0000000' failed: [503] Service Unavailable

https://directline.botframework.com/v3/conversations/6lg94qXdQbv6WOaUTf42YO/activities/6lg94qXdQbv6WOaUTf42YO|0000000' failed: [500] Internal Server Error

https://directline.botframework.com/v3/conversations/HBVeZlhbpVx1B1cjKD5QXS/activities/HBVeZlhbpVx1B1cjKD5QXS|0000000' failed: [502] Bad Gateway Any reason why ? it stops our webchat conversation flow....

Appid: 3b2bea4a-c6ac-4e54-a926-cbee6a1c1d2a

One common thing for all errors is that activitiy id is always ending with |0000000

vincec-msft commented 5 years ago

The errors on the 1st and 2nd requests were caused by problems with Direct Line talking to internal storage services. We are improving the retry logic on those calls.

The 3rd request never got to Direct Line. Something on the bot or between it and Direct Line caused the error.

There will be transient errors such as these. I recommend retry logic with appropriate backoff times so that your bot doesn't get throttled. It would be nice if the SDK had such logic built in but I'm not sure if it does.

Regarding the |0000000s: this is the activity id that the bot is responding to. It looks like every post from your bot uses the same activity id. That's a little strange but not a problem. Especially if your bot is sending proactive messages, then it would only have the initial message to respond to so this pattern would be expected.

When posting reports please include the exact time of the errors. Thanks.

ubreddy commented 5 years ago

yes, activity ID ending with 0000000 are proactive messages.... If the conversation is closed and sending proactive messages to the same conversation would result in any of these above errors? It should not and the directline and SDK should handle that state gracefully. Strangely these are not getting logged in Application insights too....

Log: Error: POST to 'https://directline.botframework.com/v3/conversations/359sMaL8del1UtijuBqT2a/activities/359sMaL8del1UtijuBqT2a|0000000' failed: [503] Service Unavailable

Timestamp: 2019-01-25T14:13:05+00:00

Log: Error: POST to 'https://directline.botframework.com/v3/conversations/EMlNAKytTvh3CqGtToBLw3/activities/EMlNAKytTvh3CqGtToBLw3|0000000' failed: [503] Service Unavailable

Timestamp: 2019-01-25T14:13:34+00:00

Log: Error: POST to 'https://directline.botframework.com/v3/conversations/CvuopaqxjEd5qBDXruIx0/activities/CvuopaqxjEd5qBDXruIx0|0000000' failed: [503] Service Unavailable

Timestamp: 2019-01-25T14:12:48+00:00

Log: Error: POST to 'https://directline.botframework.com/v3/conversations/4mudhEhQFeJFqFeeI3mPOn/activities/4mudhEhQFeJFqFeeI3mPOn|0000000' failed: [503] Service Unavailable

Timestamp: 2019-01-25T14:12:49+00:00

Log: Error: POST to 'https://directline.botframework.com/v3/conversations/2lNTiJBLgBkJWXK08goYbJ/activities/2lNTiJBLgBkJWXK08goYbJ|0000000' failed: [503] Service Unavailable

Timestamp: 2019-01-25T14:13:16+00:00

Log: Error: POST to 'https://directline.botframework.com/v3/conversations/4mudhEhQFeJFqFeeI3mPOn/activities/4mudhEhQFeJFqFeeI3mPOn|0000000' failed: [503] Service Unavailable

Timestamp: 2019-01-25T14:12:32+00:00

Log: Error: POST to 'https://directline.botframework.com/v3/conversations/9s47vjxS0uCGDKKz0L7bts/activities/9s47vjxS0uCGDKKz0L7bts|0000000' failed: [503] Service Unavailable

Timestamp: 2019-01-25T14:13:00+00:00

With timestamp... these are occuring with a pattern.... more frequently off late last one week or so...

vincec-msft commented 5 years ago

The errors around 14:12 were caused by a bug in a Direct Line dependency. We are working with that team to resolve the issue. It is a high priority for all teams involved.

ubreddy commented 5 years ago

Service unavailable errors during these time periods Timestamp: 2019-01-26T08:50:24+00:00 Timestamp: 2019-01-27T10:33:10+00:00 Timestamp: 2019-01-28T05:55:07+00:00

Internal Server error during this time Timestamp: 2019-01-27T15:58:17+00:00

Any idea when will the directline stabilize? These were not occurring earlier.... only in last 2-3 weeks or so they started appearing almost on a daily basis...

ubreddy commented 5 years ago

These are still coming almost every two hours... mostly service unavailable errors or internal server errors... Timestamp: 2019-01-29T14:08:19+00:00 Timestamp: 2019-01-29T13:59:15+00:00 Timestamp: 2019-01-29T13:49:46+00:00 Timestamp: 2019-01-29T12:01:27+00:00 Timestamp: 2019-01-29T11:15:44+00:00 Timestamp: 2019-01-29T09:43:45+00:00 Timestamp: 2019-01-29T09:03:52+00:00 Timestamp: 2019-01-29T08:30:20+00:00 Timestamp: 2019-01-28T06:59:13+00:00 Timestamp: 2019-01-28T05:55:06+00:00

Thought it could help in your fixing and making it stable... Any idea when it would get stable?

EricDahlvang commented 5 years ago

Hi @ubreddy

I spoke with one of the Direct Line Service engineers about the issues you've been having. They provided this response:

There are a couple different problems:

  1. Errors from internal dependencies a. We are in the process of moving our internal databases to CosmosDB which will give better performance and resiliency. That should be done in the next couple weeks.
  2. Errors from the platform a. We've battling a networking problem and have engaged other Azure teams to root cause this. This is our highest priority issue.
  3. Errors from deployment a. We are deploying some fixes that will make deployments less impactful. This should also be in place in the next couple weeks

The good news is that we're aware of each of these and making progress on their fixes. The bad news is that it's taking far too long and for that I apologize. These issues are frustrating for us, too.

Rashree138 commented 5 years ago

Hi Eric Dahlvang,

in case our issue we have already using TLS1.2 and other recommended setting. BUt still we are getting the error. so please let me what should be the changes we need to do....

Below line is raising the issue:

<<<<<using (var response = await client.PostAsync(url, strContent).ConfigureAwait(false))>>>>>.

Thanks, Raj

EricDahlvang commented 5 years ago

Hi @Rashree138

This issue is not related to TLS 1.2, but rather the service as explained here: https://github.com/Microsoft/BotFramework-Services/issues/39#issuecomment-463321808

thedrew12 commented 5 years ago

Any update on this? Today, I'm getting 502's at one point and then the API starts working.

EricDahlvang commented 5 years ago

@thedrew12 The fix for the issue is still in progress.

EricDahlvang commented 5 years ago

Closing this issue. The fixes for this have been completed.