microsoft / BotFramework-Services

Microsoft Bot Framework Services
Creative Commons Attribution 4.0 International
38 stars 11 forks source link

Randomly get "No such host is known" when calling directline #318

Closed rR6kULhc5xgS closed 2 years ago

rR6kULhc5xgS commented 3 years ago

This has been randomly happening with our bot for the past couple weeks where we will suddenly get a burst of System.Net.Http.HttpRequestException with the message "No such host is known". These exceptions will coincide with failed dependency calls to directline.botframework.com/v3/conversations/{conversationId}/activities/{conversationId} that just have a result code of "Faulted".

Usually when this happens it's only a small number and it happens over the course of a few seconds and then everything is fine. At worst, we had 150 such exceptions occur at once though. As far as I know, it can affect every active conversation when it happens. We've seen this happen twice in the same day, but we've also seen it go days without happening.

The last occurrence was on 9/5/2021, it happened 5 times and affected 3 conversations.

Time ConversationId
2021-09-05T20:30:40.344Z BbF399eQPcGBSJYD02Fq8n-j
2021-09-05T20:30:41.819Z DFBDR9Ph63TKKOQcsR1DJl-o
2021-09-05T20:30:44.239Z BbF399eQPcGBSJYD02Fq8n-j
2021-09-05T20:30:45.611Z 4lEDhShN7YnFKQWEJWqpSt-6
2021-09-05T20:30:45.857Z 4lEDhShN7YnFKQWEJWqpSt-6

The Microsoft App Id is e40a7baf-d26a-4255-8506-1228b9d043dc.

Any help is appreciated and if there's any additional information I can provide please let me know.

kevinfms01 commented 3 years ago

I took a look at server logs on our side and didn't see anything suspicious. I did see those conversation Ids show up in the logs in that general timeframe. A few questions for you: 1) When did the problem start happening? Did something change on your side that coincided with when this started happening? 2) Is there a VPN, proxy, or some other layer that is running that would be managing outgoing calls? 3) Are calls to other services impacted by this issue at the same time? 4) Do you do anything to fix the problem or does it correct itself on its own?

Thanks, Kevin

rR6kULhc5xgS commented 3 years ago

Thanks for your help.

  1. It looks like it's something that's randomly happened going back months, but it only started happening with more consistency (happening at least every 2 or 3 days) on 8/29. Nothing significant has changed in our application, but we have been steadily increasing the amount of traffic the bot handles, and it's increased a fair bit in the last month.
  2. Not that I'm aware of. We have a bot built with Composer that's deployed to an Azure App Service and a separate front-end application running BotFramework-WebChat that talks to it, also deployed to an Azure App Service.
  3. I haven't seen any other errors that coincide with this problem.
  4. The errors are normally clustered within around ten seconds of each other, so it appears to self-correct quickly. The time we had ~150 errors they were spread out over 30 seconds before it self-corrected. I've never had to do anything to fix it.

Thanks again

anusharr commented 2 years ago

@rR6kULhc5xgS, can you share your WebChat implementation code? Also, do you get any kind of error/warning on Composer while testing locally?

rR6kULhc5xgS commented 2 years ago

I have not seen any of these errors when running locally through Composer, or had a situation where messages suddenly wouldn't go through for a few seconds and then returned to normal. That's not to say it couldn't happen in that environment, but with how randomly and quickly it happens in production, I'd probably have to be lucky to catch it locally.

We are using ReactWebChat and on initial load we are getting the token and then creating the direct line and storing that in the state. Once that completes we dynamically load the ReactWebChat using the directline from the state.

Our ReactWebChat looks like this

<ReactWebChat
 activityMiddleware={activityMiddleware}
 attachmentMiddleware={attachmentMiddleware}
 directLine={botState.directLine}
 store={store}
 userID={appState.cookieObj?.VisitorID}
 styleOptions={WEBCHAT_STYLE_OPTIONS}
 className='chat-window-component'
 />
ChatApi.getToken(
    pageContext.chatToken,
    (res) => {
        botDispatch({type: 'directLine', payload: createDirectLine({ token: res.token })});
    },
    (err) => {
        //
    }
);

For sending messages we are using everything built into the library. We do sometimes send messages to the bot manually via directline if some events warrant it. In these cases we call the built in dispatch function.

Please let me know if you need any additional information.

Thanks

anusharr commented 2 years ago

@rR6kULhc5xgS, are you making any external api calls which is taking more than 15secs/long running calls? We do see a few operations running more than 15secs.

rR6kULhc5xgS commented 2 years ago

@anusharr We have a few flows within the bot that are making external API calls. I definitely wouldn't expect any of them to normally take 15 seconds or more, and we don't make any purposefully long-running calls. But that's not to say an API request couldn't take a long time to respond on occasion. I can try and see if slow outgoing requests correspond with these errors when they happen. On the front-end we keep a SignalR connection open, but I'm assuming that's outside the scope of your question.

anusharr commented 2 years ago

@rR6kULhc5xgS, a few more clarifying questions to help understand the issue better:

Also, as you mentioned in your above comment, it would be helpful if you can check if the slow outgoing requests correspond with the errors when they happen.

rR6kULhc5xgS commented 2 years ago

@anusharr Thanks for continuing to investigate this. For your first few questions, I've asked the front-end engineer to respond, since he can probably provide better answers than I can into how all of that is set up.

I haven't seen much consistency around when the error happens in the lifetime of a conversation, but anecdotally I'd say it's happening a few minutes in normally. As far as I know, there isn't token refresh logic in the app. But we do have FullStory set up and I've seen where there's an active conversation, they send a message during one of these periods where the "no such host" is happening, their message doesn't go through, but then the next message they send goes through fine. So this isn't confined to conversations that have been sitting idle.

Regarding the long requests, for the timestamps I gave in my original post, I do see four longer requests happening with the bot at that time between 6 and 12 seconds long. However, the dependency calls going out at that time that are longer are the faulted "no such host is known" calls to directline (also ranging from 6 to 12 seconds). I don't see any long-running calls from the bot to any other service around that time.

Full disclosure, I have not seen this error happen since 9/11. Since I can see it in our logs going back months, and it has had long gaps without happening before, I'd be hesitant to say it's resolved, but I wanted to provide that date in case anything has changed since then. On my side, we haven't deployed anything that we think would have affected it.

Thanks again

JDKMX33 commented 2 years ago

@rR6kULhc5xgS, a few more clarifying questions to help understand the issue better:

  • Is botDispatch() the "dispatch function" you are referring to? If not, can you share that code?
  • What is the code behind botDispatch()? Is this where the Direct Line object is being saved to state? Or, is it only returning the token?
  • You mention that you are dynamically loading Web Chat from the Direct Line object stored in state, indicating that Web Chat is getting reloaded periodically. Is the assumption right?
  • How far into a conversation does this error tend to occur? Do you have a method in place for refreshing the token?

Also, as you mentioned in your above comment, it would be helpful if you can check if the slow outgoing requests correspond with the errors when they happen.

Correct, this botDispatch is for storing the direct line in that state, we are using useReduce.

MichaelHaydenAtMicrosoft commented 2 years ago

@rR6kULhc5xgS @JDKMX33 - Could you help fill in the blanks on these other Qs here?

JDKMX33 commented 2 years ago

@rR6kULhc5xgS @JDKMX33 - Could you help fill in the blanks on these other Qs here?

  • What is the code behind botDispatch()? Is this where the Direct Line object is being saved to state? Or, is it only returning the token?
  • You mention that you are dynamically loading Web Chat from the Direct Line object stored in state, indicating that Web Chat is getting reloaded periodically. Is the assumption right?
  • How far into a conversation does this error tend to occur? Do you have a method in place for refreshing the token?

This is just storing the directLine in the state. This does not return a token, just stores it in the state.

It is dynamically reloaded but only after we have received the directline. After that, the chat does not reload.

We do not have an refreshing the token logic.

anusharr commented 2 years ago

@JDKMX33, thanks for the above info. Can you share your complete Webchat code for us to investigate further?

rR6kULhc5xgS commented 2 years ago

@anusharr I am verifying that we can share the code. What is the best way to share that with you? My organization uses Teams if that's easier than something like email.

anusharr commented 2 years ago

@rR6kULhc5xgS, you can share your code via email. I have added my email id in my GitHub bio.

rR6kULhc5xgS commented 2 years ago

@anusharr I sent the code via email. Thanks.

anusharr commented 2 years ago

Investigating.

rR6kULhc5xgS commented 2 years ago

Hey @anusharr, I wanted to check if you had any updates on this issue, or if there was anything additional I could provide to help with your investigation. Thanks!

stevkan commented 2 years ago

@rR6kULhc5xgS, thank you for your patience. Unfortunately, @anusharr had to attend to some outside matters, so I will be assisting in her place going forward. Before doing so, she shared the code with me that you had sent to her. I will take a look at and get back to you with any questions I may have.

stevkan commented 2 years ago

@rR6kULhc5xgS, thank you again for your patience. Unfortunately, some necessary testing was needing to be done on our SDK which blocked my time, but I am now free. I looked at your code and was hoping to run it but it looks like it may only be part of the app. Are you able to provide a reproducible working copy that I can test against?

In the meantime, I'll continue to look over the code already provided to see if I spot anything out of the ordinary.

rR6kULhc5xgS commented 2 years ago

Thanks, @stevkan. I sent the complete code to the email listed on your GitHub profile.

stevkan commented 2 years ago

@rR6kULhc5xgS, can you check that the email sent? I don't see it in my inbox or junk mail folder. If it's not compressed or is sizeable, it may be getting blocked before reaching me. Another option is to store it via a cloud service and provide a link I can download it from.

rR6kULhc5xgS commented 2 years ago

@stevkan Sorry for the delay. I sent you another email with a link to download the project. Thanks.

stevkan commented 2 years ago

@rR6kULhc5xgS, @JDKMX33 - thank you for your patience.

First, I'm curious to know if the issue is still occurring. The last date mentioned was on 9/11.

As for running the provided code, is this something I could run locally with the right appsettings properties entered? If so, can you tell me what the minimum appsettings properties are that need to be filled in. I've taken a few stabs at what I think is right, but am obviously wrong as it's not starting up without exceptions. If you could let me know the properties and the type of value to be entered, that would be great.

With regards to the code itself, I'm not seeing anything that appears out of the ordinary. But, it's also a complex implementation. I will keep looking it over for anything. I also plan on reaching out to a dev on the Web Chat team to see if he notices anything. I don't know if, or when, I may hear from him as he has a lot on his plate, at this time.

Lastly, can you elaborate a bit more on the following? I'm not sure I follow on whether this is a concern or not.

But we do have FullStory set up and I've seen where there's an active conversation, they send a message during one of these periods where the "no such host" is happening, their message doesn't go through, but then the next message they send goes through fine. So this isn't confined to conversations that have been sitting idle.

rR6kULhc5xgS commented 2 years ago

@stevkan Thanks for your continued help with this. I sent you an updated solution link that will hopefully run for you, it just needs your own chat bot token added.

This does still happen to us, but much less frequently than before. The last time this happened was around 12:34 AM on 12/9/2021 (UTC), and seemed to affect at least 5 conversations.

Regarding the comment I made about FullStory, I was just clarifying that it can happen to active conversations and then those same conversations seem to be fine after it happens, though a message sent right at the time of the error can be lost. I think the main point I was trying to make was that the errors weren't caused by conversations sitting idle and timing out.

stevkan commented 2 years ago

@rR6kULhc5xgS, unfortunately, I am unable to get your solution to run. A colleague and I have run thru the code but I simply can't get past the initial conversation update. Any activity sent after the conversation update consistently results in a 502 error regardless of changes we make to the code to try and alleviate the issue.

Also, while I don't think this is the source of the issue, why in the storeHelper.js file do you use DirectLine's postActivity() instead of using one of the built-in Web Chat options such as piggyback-on-outgoing-activities or post-activity-event?

While postActivity() does work, from the Web Chat perspective it is not considered a best practice since it ends up bypassing core Web Chat functionality which could result in unusual results.

JDKMX33 commented 2 years ago

@stevkan thank you for sharing the piggy back and post activity event links. The postActivity() was carried over from previous version of webchat we were using. We only use it in two places, when the connection is fulfilled and on the unload event before the user window closes.

stevkan commented 2 years ago

Revisiting the frequency and occurrence - I'm not sure this was answered:

  • How far into a conversation does this error tend to occur?

Additionally, have you tried repro'ing the issue connecting as a user and communicating with the bot? I'm wondering if any warnings or error messages are generated in the dev console of the browser.

The reason I ask is I've encountered an issue in the past, when using Web Chat's store, when trying to send an activity when connection is fulfilled. It's not a Web Chat issue, per se, but the warning manifests there. Essentially, for example, if my bot is set up to send an activity on conversation update and I also have Web Chat set up to send an activity when connection is fulfilled, the two will collide. The bot's activity wins out and the Web Chat activity time's out (before trying again, if memory serves me right). I don't know but would guess that using DirectLine's postActivity() would result in the same. I'm thinking that this could be a possible cause of the log exceptions.

To be clear, I believe this issue was addressed an alleviated in a subsequent release of Web Chat, at the time. But, it might be worth investigating. As I cannot get your code to run, I'll need you to check this. It may take a few tries as it may be subject to a race condition.

As a side question, does the "before unload" event actually fire when the user closes the window? I ran into this as an "issue" earlier this year where Chrome (and chromium-based Edge), and I think Firefox, added a security measure that stops "before unload" events from blocking the closing of a window. I was able to find a nice workaround to it, though. If you find it is not firing, let me know and I'll give you the details of the workaround.

rR6kULhc5xgS commented 2 years ago

For how far into the conversation it happens, I've never seen any consistency. It could happen at any time, and since it would affect all active conversations at once, it could impact a conversation that had just started as well as one that had been going for a while at the same time. I've never been able to reproduce the problem locally or when testing otherwise. The good news is that this issue seems to happen very rarely for us now, so I think something has helped to alleviate it quite a bit. When I first posted this issue it was happening fairly often and would impact several conversations every time it did. With how little it happens now, and since it doesn't seem like there's a clear path forward on it, it's probably fine to close this issue so you don't waste any more of your time on it.

The before unload event generally works without any issues for us. We just try to get one request sent off to end the conversation if it's with a live agent.

Thank you for all of your help on this issue. If I discover anything new I'll let you know.

stevkan commented 2 years ago

Closing the issue as mitigated/resolved. As you mentioned, if you have new data to share or the frequency of occurrences increases significantly, please feel free to re-open.