First response after bot idling is way too slow

microsoft / botframework-sdk

Bot Framework provides the most comprehensive experience for building conversation applications.

MIT License

7.5k stars 2.44k forks source link

First response after bot idling is way too slow #6519

Closed johnkm516 closed 2 years ago

johnkm516 commented 2 years ago

Github issues should be used for bugs and feature requests. Use Stack Overflow for general "how-to" questions.

Version

What package version of the SDK are you using. 4.16.1 Dot net SDK Bot has been created using Bot Framework Composer. Apps are hosted in West-US region.

Describe the bug

Initial bot replies take way too long when the bot has been idling for a while.

To Reproduce

Steps to reproduce the behavior: 1, Set "Always On" setting in app configuration on Azure to True.

Send a message to the bot after nobody has used the bot for at least 1.5~2 hours (for example, during the night when everyone in the organization is sleeping)
Bot takes at least 15 seconds and upwards of 45 seconds to reply for the first response after idling.

Expected behavior

Not take 45 seconds to reply.

Additional context

I have searched several other issues on this (most of them years old) but it seems this issue still has not been resolved. From my research all of them suggest to set the "Always On" in the app configuration to True, and I have already done this. It seems there is some sort of start-up delay when the bot has been idle for a long time.

My educated guess as to what is causing this problem : https://blog.botframework.com/2017/11/02/optimizing-latency-bot-framework/

While this article is almost 5 years old, I believe the issue that is causing this problem is the token expiring. Some reasons as to why : 1) While the blog post is old there is a possibility periodic token refreshing was not implemented into the SDK for whatever reason and was expected that the user implement this 2) I have tried to time when the bot requires another "start-up" time, and it is always after 1.5~2 hours of idling (when nobody has been using the bot), which might be the amount of time the token takes to expire. 3) As I said above the "Always On" setting is set to True in the app configuration on Azure, and I have checked the app is definitely running before sending the message after idling (so this is a runtime issue) 4) To check if this issue is per user, I let the bot idle for ~2 hours, then sent a message. The first reply as always took ~20 seconds for a response. I then went to a colleague's machine and sent a message to the bot, and the response was near instant.

Is there a reason the periodic token refreshing seen in this blog post is not implemented into the SDK by default?

Edit :

To add, this issue is compounded by the use of skills. Because skills are essentially separate bots on their own, if nobody has used the chatbot for 1.5 ~ 2 hours both the root bot and the skill tokens expire, and the time to reply is compounded. If the first user utterance after idling the bot for a sufficient amount of time invokes a skill, the reply take takes upwards of a minute.

stevkan commented 2 years ago

@johnkm516 - to clarify, this issue occurs irrespective of the bot having a connected skill? In other words, if both the bot and skill run separately does this issue still occur. And, if so, does it occur with both or just one?

Also, would you be able to share your bot (minus any secrets/keys) so I can review it just to be sure it isn't a "code" generated issue?

johnkm516 commented 2 years ago

@stevkan I have not tested the skills separately, but yesterday when I tested an utterance that invokes a skill on my root bot via MS Teams the reply took significantly longer (almost a minute to reply), which leaves me to believe the issue is compounded when calling a skill (if it indeed is a token problem, both the root bot and skill seems to take ~20 seconds each causing the reply to take upwards of a minute). And yes, the issue occurs irrespective of the utterance invoking a skill or not. Invoking a skill happens to take much longer, but calling just the root bot takes upwards of 15 seconds to reply after the bot has been idle for a long time.

I unfortunately cannot share my code as this bot is built for my company and it invokes internal APIs. To track down and see if it's indeed a code issue I have ruled out the following :

To start, the chatbot was built on top of the Enterprise Assistant Template downloaded from the Composer. The issue was noticeable in the very early stages before I connected any external APIs; I began work on this bot about January of this year and one of the first things I did trying to resolve this issue after uploading the bot onto Azure was to change the configuration setting to "Always On".

1) External APIs that my bot calls cannot be the issue since I've been experiencing this issue from the beginning before I implemented any custom API calls, and the API servers that my bot calls are hosted on an on-premise rack and the API server is always running (the HTTP call would fail rather than take ~20 seconds to return) 2) Middleware cannot be an issue - I have Middleware that saves / updates a chatlog and conversation references in a document (per conversation) on CosmosDB, and to test whether or not this Middleware was causing the issue I have deleted the entire container multiple times over my tests to see if this was the cause, but the bot's response was fast even after deleting the container.

You could try downloading the Enterprise Assistant template on Composer, I don't know if the template has been updated since I have made my bot out of it and the problem fixed. But considering there's not many avenues for a composer based bot to cause the delays that I am experiencing I believe the issue resides in the SDK.

Forgive my ignorance here I'm just making guesses to be honest, I am wondering if the blog article I linked in the OP has any plausibility to be the cause of the issue.

My suggestion to test this out is to :

1) Download and upload the Enterprise Assistant template on Azure (as this would be closest to my codebase without me having to share the code) 2) Turn the app configuration setting to "Always On" 3) Just to be safe let the bot idle without any utterances for at least 24 hours 4) Send an utterance that invokes the People skill, like "Who reports to John?" (Invoking the skill should compound the issue further and the response should take much longer)

Please let me know if you experience the same delayed response (at least 10 seconds or so to respond) that I am experiencing after the bot has been idle.

johnkm516 commented 2 years ago

@stevkan As I mentioned in the above post, I have middleware that saves and updates conversation references along with a chatlog to CosmosDB. Here is the log along with timestamps showing how long the first reply took when I tested just now after letting the bot idle for 24 hours. Please excuse the fact that the utterances are in Korean :

            {
                "UserMsg": "(2022-07-07 23:35:47) User : 김상효가 누구야",
                "BotMsg": null,
                "UserActivityId": "1657236945868",
                "BotActivityId": null,
                "HasBotAdaptiveCard": false,
                "botActivity": null
            },
            {
                "UserMsg": null,
                "BotMsg": "(2022-07-07 23:36:02) Bot : Here is what I found for **김상효(John Kim)**:\r\n",
                "UserActivityId": null,
                "BotActivityId": "1657236963450",
                "HasBotAdaptiveCard": false,
                "botActivity": null
            },
            {
                "UserMsg": null,
                "BotMsg": "(2022-07-07 23:36:04) Bot : Adaptive Card Attachment",
                "UserActivityId": null,
                "BotActivityId": "1657236966468",
                "HasBotAdaptiveCard": false,
                "botActivity": null
            }

As you can see the time it took between the user message and the bot's first reply is (23:36:02 - 23:35:47 = 15 seconds). The first response varies wildly in response time, at minimum it takes about 15 seconds, on other times I've seen it take upwards to a minute to respond. The two second delay between the adaptive card and the first bot response is the delay it takes for teams to render the adaptive card.

I then tested again by sending the bot the same utterance right after, and you can immediately see the difference in performance after the bot has "spooled up" :

            {
                "UserMsg": "(2022-07-07 23:39:31) User : 김상효가 누구야",
                "BotMsg": null,
                "UserActivityId": "1657237170622",
                "BotActivityId": null,
                "HasBotAdaptiveCard": false,
                "botActivity": null
            },
            {
                "UserMsg": null,
                "BotMsg": "(2022-07-07 23:39:34) Bot : Here is what I found for **김상효(John Kim)**:\r\n",
                "UserActivityId": null,
                "BotActivityId": "1657237174992",
                "HasBotAdaptiveCard": false,
                "botActivity": null
            },
            {
                "UserMsg": null,
                "BotMsg": "(2022-07-07 23:39:35) Bot : Adaptive Card Attachment",
                "UserActivityId": null,
                "BotActivityId": "1657237182374",
                "HasBotAdaptiveCard": false,
                "botActivity": null
            }

As you can see the reply now only takes 3 seconds. Then I went to a colleague's computer to test the same utterance again (the colleague of course also had not used the bot for more than 24 hours), and it's even faster, proving this is not a per-user issue but a global issue :

           {
                "UserMsg": "(2022-07-07 23:47:39) User : 김상효가 누구야",
                "BotMsg": null,
                "UserActivityId": "1657237658705",
                "BotActivityId": null,
                "HasBotAdaptiveCard": false,
                "botActivity": null
            },
            {
                "UserMsg": null,
                "BotMsg": "(2022-07-07 23:47:41) Bot : Here is what I found for **김상효(John Kim)**:\r\n",
                "UserActivityId": null,
                "BotActivityId": "1657237661912",
                "HasBotAdaptiveCard": false,
                "botActivity": null
            },
            {
                "UserMsg": null,
                "BotMsg": "(2022-07-07 23:47:42) Bot : Adaptive Card Attachment",
                "UserActivityId": null,
                "BotActivityId": "1657237664023",
                "HasBotAdaptiveCard": false,
                "botActivity": null
            }

The data-in chart of the bot's app service on Azure showing the Always On setting periodically pinging the bot every 5 minutes before the spike in data-in where I send messages to the bot :

If : the application is most definitely running, the issue is not per user, and there is some sort of global timeout on the app which is causing the first response after the bot has been idle to take a long time to respond, the only origin of this issue I can think of is some auth token that is expiring. Whether it be the bot's auth token by itself, perhaps it's CosmosDB or any other services that the bot may call such as LUIS. The 5 year old blog article that I linked in the OP strengthens my suspicions, but again these are just guesses on my part. If it was a "code" issue I cannot think of how the response time is variable in this way. Nothing I added in terms of Middleware or dialogs to my bot would result in this type of "spool up" delay behavior.

stevkan commented 2 years ago

@johnkm516 - Thank you for your patience as well as for the detailed responses. I'm in the process of testing the Enterprise Assistant template and will let you know what I find.

Another thought, could this be a firewall / proxy (or related) issue? Do you experience delays of any sort when you try communicating with the bot from a computer that is located off-site? For instance, personal laptop at home running solely on your own network?

As I don't think it was mentioned / asked, which version of Composer have you been developing on? I also see you are using v4.16.1 of the SDK. Were you using any other versions to start?

johnkm516 commented 2 years ago

@johnkm516 - Thank you for your patience as well as for the detailed responses. I'm in the process of testing the Enterprise Assistant template and will let you know what I find.

Another thought, could this be a firewall / proxy (or related) issue? Do you experience delays of any sort when you try communicating with the bot from a computer that is located off-site? For instance, personal laptop at home running solely on your own network?

As I don't think it was mentioned / asked, which version of Composer have you been developing on? I also see you are using v4.16.1 of the SDK. Were you using any other versions to start?

@stevkan I have worked on, tested, and deployed my bot on my personal PC at home on multiple occasions and sent messages to the bot via MS teams on my personal PC, my phone, etc. The symptoms were the same.

My composer version :

My library references :

<ItemGroup>
    <PackageReference Include="Microsoft.AspNetCore.Mvc.NewtonsoftJson" Version="3.1.25" />
    <PackageReference Include="Newtonsoft.Json" Version="13.0.1" />
    <PackageReference Include="Microsoft.Bot.Builder.AI.Luis" Version="4.16.1" />
    <PackageReference Include="Microsoft.Bot.Builder.AI.QnA" Version="4.16.1" />
    <PackageReference Include="Microsoft.Bot.Builder.AI.Orchestrator" Version="4.16.1" />
    <PackageReference Include="Microsoft.Bot.Builder.Dialogs.Adaptive.Runtime" Version="4.16.1" />
    <PackageReference Include="Microsoft.Bot.Builder.Dialogs" Version="4.16.1" />
    <PackageReference Include="Microsoft.Bot.Components.AdaptiveCards" Version="1.2.1" />
    <PackageReference Include="Microsoft.Bot.Components.Graph" Version="1.2.1" />
</ItemGroup>
<ItemGroup>
    <ProjectReference Include="..\CustomComponents\CustomComponents.csproj" />
    <ProjectReference Include="..\ChatLogComponent\ChatLogComponent.csproj" />
</ItemGroup>

I originally had default SDK versions from the Enterprise Template but since I started adding my own Middleware (CustomComponents and ChatLogComponent), I started compiling the project on VSCode since it was much faster to deploy and test (Composer automatically runs a whole bunch of scripts like uploading to LUIS and QnAMaker which makes the compile and deploying locally takes like 10 minutes even when I'm not publishing to Azure). I thought it wouldn't hurt to update the SDK since the Composer's actions themselves are in Dialogs.Adaptive.Runtime which I also updated to 4.16.1. The SDK version that Composer (and project) was on was originally 4.12 I think. Composer still shows this in the about section, but it might have updated during the months while I was working on this project : The symptoms I am experiencing are the same since the very beginning when I first deployed the base Enterprise Assistant template without any changes on Azure and now across the different SDK versions. I also remember modifying the NewtonsoftJson library version and setting the MaxDepth in Startup.cs to 1024 since some of the adaptive cards were showing that dreaded max depth issue, but again I don't believe any of these changes should be relevant since I remember the stock Enterprise Assistant template had this issue long before I made any of these changes.

To note if you plan on running and deploying the bot + skills locally on VSCode to test, Composer also keeps automatically overwriting appsettings.json. But in order to run via VSCode you have to add MicrosoftAppPassword for each bot as well as authoringKey and the endpointKey for LUIS otherwise the bot will not run properly (the Composer will keep deleting this every time you open the project or go to the project settings). What I do is to push the appsettings changes to the repo and revert changes that Composer makes automatically to the appsettings.json every time before I run the project.

I also left out a minor detail in the original post, although you probably assumed it already. I of course set all the settings on all the skills to "Always On" as well, and have verified the 5 minute periodic pings on the skills as well.

Edit : Not that this matters since it's not the point of the test, but just to add if you have any issues with the adaptive card not rendering properly on MS Teams or returning an error in the stock Enterprise Assistant template, try copying the returned JSON in the adaptive card trace in the emulator and pasting it in the Adaptive Card designer website. I remember the stock Enterprise Assistant had some issues with the adaptive card JSON, but verifying the JSON through the Adaptive Card designer helped me debug and make the necessary changes to render the cards properly. I never bothered to post an issue on this and this issue might already be resolved since when I downloaded the template.

stevkan commented 2 years ago

@johnkm516 - Thank you for the additional info. Unfortunately, I am unable to repro the issue you are experiencing. I have tested after waiting an hour, several hours, and several days but the bot (and skills) always responded with an initial response within 7-9 seconds. This could be better, but it was consistent.

Does this issue only occur when you are communicating with the bot via Teams?

And, just to clarify, as it wasn't specifically stated, are you working behind a firewall or proxy? I don't believe you are given you said you tried from your personal device. To this end, when you tried on your personal device, were you connecting to the same tenant (i.e. your work's Teams account)? Or did you deploy the bot to a personal tenant/Teams account? Again, there could be something org related that is interfering.

Do you have Application Insights setup in Azure for the bot? If so, can you check the logs and see if there are any reported errors when communicating after idle time? If not, it's easy to setup: just add the app insights resource to the resource group, and then include the Application Insights API key in your bot's Configuration blade in Azure.

Another option is to install Fiddler to trace the calls that are being made when communicating after idle time. That may show additional information.

A couple other discussion points:

The latest version of the SDK runtime package for Composer is v4.12.0. This won't be an issue for running the bot. Additionally, without cloning the Composer project, updating the SDK version used, and re-building the project, you shouldn't be able to affect this version. You can update the SDK version used in your bot by opening the bot solution in VS/VS Code and updating the packages there.
Regarding the Newtonsoft Json library and the MaxDepth issue, the recommendation from the dev team is to align the Newtonsoft Json library version to that used in the SDK (v13.0.1). The MaxDepth value can then assigned the null value (for reference, see here).
As for Composer overwriting the appsettings file and not saving the secrets/keys, this is by design. Composer will always overwrite the appsettings file when the bot is run in Composer. And the keys are only saved locally via localStorage for security purposes. In this way, the secrets/keys aren't accidentally shared if the project ever changes hands. It is a bit of a nuisance when working outside of Composer, but it is for good reason.

johnkm516 commented 2 years ago

@stevkan Thank you for your detailed response.

Does this issue only occur when you are communicating with the bot via Teams?

Yes. I have only ever deployed the bot on Teams as my organization uses Teams. I cannot say how if this issue only occurs on Teams since Teams is the only channel I have deployed to.

And, just to clarify, as it wasn't specifically stated, are you working behind a firewall or proxy? I don't believe you are given you said you tried from your personal device. To this end, when you tried on your personal device, were you connecting to the same tenant (i.e. your work's Teams account)? Or did you deploy the bot to a personal tenant/Teams account? Again, there could be something org related that is interfering.

My organization does have a firewall, but this cannot be the issue as I still had the issue on devices outside the organization. I am connecting to the same tenant (on my organization's Teams account), and the bot is deployed as a Teams App to the entire organization.

I unfortunately did not have Insights enabled for my root bot. However I happen to have had Insights enabled for the People skill. So I went about searching the requests log for about the time ranges when I had the long delays as seen from the chatlog I posted above. The time range is only 3 days, from July 6-8, but I can see three requests that resulted in a long delay :

Notably the first request here took a whopping 17 seconds. The second and third request seems to be for the metrics not bot messaging, judging by the customDimensions value. So in this time range the delay took place for the one time that I tested above.

Extending the time range further to the entire month of July :

I can see a huge number of requests that are experiencing delays. And while I thought the issue was token related, looking at the timestamp it seems to be quite random and occurring in a cluster at a certain time interval. Perhaps this is Teams channel related as you said.

Edit :

Since I've been updating the bot a lot I found a lot of exceptions that carried into production that are already fixed. So I went about trying to find a time range where the delays occurred but there weren't exceptions related to code bugs. The delays occur very regularly so regardless of whatever time range I choose in a 2-3 day range there is always 2-3 records where a request takes up to ~20 seconds. The most recent range, July 15th - July 20th, has no exceptions but still has delays in the GET Message requests.

stevkan commented 2 years ago

The delays occur very regularly so regardless of whatever time range I choose in a 2-3 day range there is always 2-3 records where a request takes up to ~20 seconds.

Does each set of occurrences happen at roughly the same point in time as on other days (i.e., always around 10:17 am)? Or do you mean they regularly occur every 2-3 days, but the exact time varies (i.e., first set of occurrences: 10:17 am, second set of occurrences: 1:32 pm, third set of occurrences: 12:05 pm)?

Have you checked to see if there are any other processes that might be running, related to this project or not, that might be causing a slowdown in network connectivity?

Are you able to share your Insights logs by either posting them here, sending in an email (my email is in my GH profile), or sharing across a secure folder?

Apologies if I already asked this but I'm not seeing that I have. Do you see the same delays if you run the bot locally and use a tunneling service, like ngrok, to connect the locally running bot to your Azure bot resource? In this scenario, you would replace the messaging endpoint in the bot's Configuration page with the ngrok endpoint + '/api/messages' (e.g., 'https://eef210880a5e.ngrok.io/api/messages'). The tunneling service would allow you to still test in Teams but check if the locally running bot also suffers from network delays.

johnkm516 commented 2 years ago

@stevkan

Sorry for the delay. Due to MS Teams issues a few days ago I couldn't test properly. Does each set of occurrences happen at roughly the same point in time as on other days (i.e., always around 10:17 am)? Or do you mean they regularly occur every 2-3 days, but the exact time varies (i.e., first set of occurrences: 10:17 am, second set of occurrences: 1:32 pm, third set of occurrences: 12:05 pm)?

From what I can see in the logs, it's entirely random. But the pattern is always the same. If nobody uses the bot for a while, the first response / few responses take ~10s to respond, but once users across the organization use the bot more regularly the responses become quite fast.

I have not connected the bot running locally to the Azure bot resource. But from my experience just running the emulator the tunneling service becomes the bottleneck and basically every response becomes slow. I currently reside in South Korea so the latency probably has something to do with it.

I'll monitor the logs a bit more and since I'm not actively changing the bot right now there shouldn't be any code related exceptions causing the delays from here on out. Please give me a week or so to get a more comprehensive set of logs that aren't affected by code related issues or the MS Teams issues that we had recently.

stevkan commented 2 years ago

@johnkm516 - Were you able to collect any logs/data over the last 10 days that we can look over?

johnkm516 commented 2 years ago

@stevkan Sorry for the wait. I sent you the logs for the last 7 days via email that's on your Github profile (steven.kanberg@teksystemsgs.com). In terms of the exceptions, please ignore the socket exceptions as that exception occurs when I forget to revise the skill connections to remote from local in the appsettings. I'm not too sure what the host starting assembly exception is exactly, but I have not seen it affect the bot in any way.

Please let me know which other logs you require, perhaps you could specify a time range that interests you.

stevkan commented 2 years ago

@johnkm516 - Apologies for the long delay. Looking at the exceptions file you sent, it looks like a call to a find a local file is failing.

ProblemId System.IO.FileNotFoundException at System.Reflection.RuntimeAssembly.nLoad

innermostMessage Could not load file or assembly 'DiagnosticServices.HostingStartup, Culture=neutral, PublicKeyToken=null'. The system cannot find the file specified.

Does the above look familiar to you in something you may have coded into the bot? If not, it may be some dependency making a call and failing. This may explain the delays as the process decides what to do next.

As I was unable to repro using the unaltered sample, I am guessing it is related to something that has been added to your bot. At this point, you will need to try and identify what that might be as the logs aren't that granular.

johnkm516 commented 2 years ago

@stevkan

I assumed the diagnostic services issue was caused by application insights. I believe Insights injects some libraries into the app service in order to gather all the logs, but this one in particular isn't loading properly. I found an issue on this here : https://github.com/dotnet/aspnetcore/issues/17370

Because the exceptions don't coincide with the timestamps of long message times, I do not believe this exception is the cause. I'll continue to monitor this issue during timeframes when new code isn't being pushed to production and the application is stable and try to find out the cause.

Please close this issue for now. Thanks for your help.

stevkan commented 2 years ago

Closing, per request, until new information is available.