Closed SenseVision closed 6 years ago
Please share your code, this will be next to impossible to diagnose without your code.
@JasonSowers added as collaborator to the copy of repository
Also today we have tried to downgrade BF to 3.12.4(stable version we used before 3.13.1) right now 30+ full conversations(25-40 interactions) , no bug
You're not going mad SenseVision, I just had a panic attack for most of the afternoon today because I thought I had completely messed up my work (and mediocre understanding of how BotFramework worked).
Exactly the same experience after moving from 3.14.0 to either 3.14.1 or 3.15 (not enough testing between upgrades to determine which) - random mutations in conversation position, sometimes even going back to points in 'completed' (Context.Done) dialogs. I only noticed properly after upgrading again to 3.15.
I'm using Azure Table Storage for state, and an Azure app service for the bot. Also 3.2.5 for Microsoft.Bot.Builder.Azure. Dialogs for conversations. (No LUIS, no FormFlow, or anything fancy. Bog-standard MessagesController.Post -> scope.Resolve
After downgrading Bot.Builder and Bot.Builder.History from 3.15 to 3.13.1 (and BotConnector to 3.14.07, although I can't say right now which is making the difference) everything is OK again. WindowsAzure.Storage is at 9.1.1.
To Jason and the team - apologies, as this kind of random comment that is missing a workable repro is about as welcome as a kick in the teeth :)
If I can find a consistent repro, I will certainly post it, but as SenseVision notes, it is frustratingly random.
I got your invite, I will start looking into this. We recently released a new SDK version 3.15 can you please confirm if you are experiencing the same issue in 3.15. @oflahero I couldn't gather if 3.15 was working or not for you from your post.
Small update: Bot.Connector is not the issue. It's a problem in Bot.Builder (and/or .History) between 3.13.1 (OK) and 3.15 (not OK).
I can't test if it's Bot.Builder.History in isolation as upgrading that nupkg requires Bot.Builder.
I will try 3.14.1.1 at some point tomorrow and try to narrow it down.
I experienced the same issues with 3.14.1.1.
So the problem appears to have been introduced somewhere when moving between 3.13.1 to 3.14.1.1 in Microsoft.Bot.Builder and Microsoft.Bot.Builder.History (unless I was doing something stupid in my code that relied on something that was fixed after 3.13.1!) I don't have any further granularity info.
@oflahero Can you please share your project as well? I'm trying to make a minimum reproducible example and want to see if there are any similarities between your project and SenseVision's
So I have started with the most simple implementation of saving state data and I am adding pieces in from SenseVision's project to try to figure out the root cause, but I haven't been able to reproduce it yet. I will be spending the day looking into this
@oflahero @SenseVision Can either of you confirm this happens in the emulator as well? I have been trying all day to reproduce this with no luck.
Hey there. I spent a bit of time on this today too - I've settled on 3.14.0 for Bot.Builder, Bot.Builder.History and Bot.Connector as the most stable configuration. Anything higher (>=3.14.1) and the issue is back.
I tried to reproduce in the emulator with no luck. It's only happening for me in the channel, which in my case is Facebook Messenger.
My project's commercial, and currently pretty large, so no sharing I'm afraid. I understand the frustration of spending time trying to reproduce an issue like this. I will hopefully try to do a pared-down minimal project that exhibits the problem, but I wouldn't be surprised if it doesn't. The problem project has so many interconnected moving parts - targets full 4.7.1, also NetStandard by some NuGet packages. I have enough issues with assembly bindings as it is.
I can understand putting this on hold til more people claim to have the issue also.
Let me try facebook, no luck with repro in the emulator, webchat, (or facebook now) 👎
I would really prefer to get to the bottom of this rather than put it on hold. I would guess it is affecting more customers than just you 2. So any more help you can give would be greatly appreciated
Crazy busy today but I'll try to do a repro later if I can.
@JasonSowers sorry , i haven't answered , was super busy .
Check 3.15 -> the earliest i'll be able try would be on Tuesday, i guess.
Can either of you confirm this happens in the emulator as well? -> 3.14.1.1 - yes . Emulator and FB.
I've rechecked references history , looks like @oflahero is right 3.13.1 is OK
Hi guys, we experienced the same thing after upgrading from 3.13 to 3.14.1.1. dialogs seem to jump to a previous state. Seemed to be most reliably reproducible in the web versions of Slack and MS Teams (not sure if that has anything to do with it though)
Our guess was that something in the way the bot state is saved has changed between those versions (we are using TableStorage with a custom key for the bot state), but we couldn't find anything suspicious in the Bot.Builder commits...
So we just reverted back to 3.13.1 and everything works again. Would be nice to get this out of the way so we can upgrade again.
Welcome to the club!
I'm about halfway through setting up a minimal project - haven't forgotten about you Jason. No guarantees it'll exhibit the symptoms, mind.
No worries guys, been busy over here as well. I hope to make some more time to try to get to the bottom of this next week. All of my bots are running 3.15 (because I test the SDK before we release it) and I have not noticed any issues with state like you are seeing, so I'm hoping one of you can provide a clue here especially because I haven't been able to reproduce this.
I have a repo at https://github.com/oflahero/315Issue where it looks like I can reproduce the issue.
It's a simple bot that echoes what you type with the length of the input string. If you enter 'find', it also launches a dialog simulating searching for a phone or email contact by name substring.
There are only three contacts hardcoded. Once it either returns a match or tells you there are no matches, you're out of the dialog again.
Using branch 315 (Bot Builder & Connector at 3.15.0), and the Facebook channel, you can eventually (after repeatedly hitting 'find' and responding appropriately) observe it jump out of sync. It may take a good ten minutes of this.
Note: there's some code in the repo that isn't used - I was just trying to grab stuff from the problem project in order to reproduce the issue.
I have a publicly available FB messenger page hosting the 3.15.0 version, if anyone wants to try it without installing themselves. Just send me your FB profile identifier so I can add you as a test user to the app on FB Developer, and I'll let you know where to look.
Branch 314 is the stable (3.14.0) version.
The 'UnrelatedPostIssue' branch is another problem entirely - it's a one-character diff commit, but it causes a 'BadRequest' exception in the bot on PostAsync. It's completely floored me.
Hi Guys I have also noticed that our dialogs are jumping to a previous state ever since I updgraded the bot builder sdk from 3.13.1 to 3.15.0. It appears to specifically occur when there is a delay in the user response. We are storing our state using local SQL TableStorage. I have first noticed this via the directline channel. It also occurs in the emulator.
I went through the Bot.Builder commits between 3.13.1 and 3.14.1.1
The only change related to our issue seems to be CachingBotDataStore changes in commit https://github.com/Microsoft/BotBuilder/commit/992e29f4e0f61b98b31557fdba68d53fb8caabfc (the cache was changed from local Dictionary to MemoryCache)
I couldn't spot anything suspicious with the changes, but @oflahero could you register your TableBotDataStore directly as IBotDataStore and remove the CachingBotDataStore registration in your example and see if that helps?
As I understand it though @thschuetze , you need a CachingBotDataStore sitting on top of your keyed registration for it to work though? At least until we get BotBuilder v4... (see issue https://github.com/Microsoft/BotBuilder/issues/4376)
Maybe there's a simpler way. I'll be the first to admit that the tight integration of Autofac into BotBuilder is a pet hate of mine (mainly because it confuses the hell out of me, and I loathe DI on principle :) )
Awesome work guys, thanks for helping us work this out. I will be spending time on this today, but mainly tomorrow since its late in the day now trying to address the issue.
As a note when I spent nearly a whole day trying to reproduce I tried both with and without CachingBotDataStore
and was unsuccessful.
@seca23 this issue does not seem related can you please open a new issue with details about what you are experiencing.
@oflahero you are right, I wasn't aware of that issue. Could you try to register the old version of the CachingBotDataStore in the module to check if those changes have anything to do with the issue.
Btw. I got your example code and deployed it, but so far haven't been able to reproduce the issue with Slack (web), Teams (web) or Skype (iOS app). I don't have a facebook bot available that I could connect. Is there a specific way to reproduce it more reliably? Answering/typing as quick as possible for example?
Good news I was able to reproduce in the emulator, the bad news is I need to catch it on a turn it fails on this is going to be a tough one to find 👎
@seca23 I actually think your issue is related after looking into this. Are you using multiple dialogs/prompts are you able to share what your dialogs look like if so
@JasonSowers from the MessageController I call a LuisDialog. When an intent is triggered it calls a new dialog [that inherits from IDialog
Our bot endpoint is local and we store the state in local SQL table storage.
@JasonSowers great stuff!
@thschuetze It can be very frustrating trying to reproduce it. I gave up today after five solid minutes of banging the keyboard. I did on two occasions find that the problem appeared when I had left the conversation in the middle of a dialog for a significant time.
How long that has to be I can't say, but my Azure website had definitely spun down by then and took an extra ten secs or so to get going again (don't let anyone fool you that Azure's 'always on' setting really does what it says on the tin)
It seems prompts (which are dialogs themselves) seem to be the culprit here. Is anyone noticing this behavior outside of on a step when a prompt is involved?
That could fit my situation; I'm using a PromptDIalog.Text. My example's not big enough to rule out non-prompt interactions though, might have to add a few more dummy steps in the process.
Hi guys, I am also facing the same issue in my project. Bot Info: SDK Platform: .NET Active Channels: Skype For Business, Emulator Deployment Environment: Azure Tables
Package Version Info: Microsoft.Bot.Builder - 3.14.1.1 Microsoft.Bot.Builder.Azure - 3.2.5 Microsoft.Bot.Builder.CognitiveServices -1.1.2 Microsoft.bot.Builder.History - 3.14.1.1 Microsoft.Bot.Connector - 3.14.1.1
Issue: There are multiple dialogs for various chat flows and prompts used in the project. I have implemented bot state service using Azure Table and used CachingBotDataStore also. I have ended the context wherever required and returned to parent dialogs from child dialog. The issue is intermittent and the bot is deployed on SfB channel. The behavior is random as sometimes the code works without any issue but other times there is problem with context. I have used Prompts extensively in my project as it's a guided flow for user.
Hi, any more news on the issue? We kinda need to update to 3.15 soon because of the GDPR methods...
@oflahero I experienced same issue as you had described .The dialogs jumps to previous state randomly and other times the code runs fine. The issue was persistent even after upgrading to latest packages (upgraded previous week) so i decided downgrading Bot.Builder and Bot.Builder.History from 3.15 to 3.13.1 (and BotConnector to 3.14.07) as you had shared. Everything seems much stable now. There might be some issue with PromptDialogs. I have used many PromptDialogs in the project (PromptDialogs.Confirm, PromptDialog.Choice and PromptDialog.Text due to guided interaction with user). There are reprompts even when prompt has been answered with valid input, at other times the code runs smoothly without any reprompts or context issues. The issue is persistent for emulator and Skype For Business channel.
My hunch is that it's something either in the move to using MemoryCache or the singleton instance of ConnectorClient, in in 3.14.
Can anyone confirm that they are experiencing this in a non-Azure environment? As in, can we rule out the Microsoft.Bot.Builder.Azure package?
we do not use the Bot.Builder.Azure package, but we are using a customized version of the TableBotDataStore.
Apologies all, we haven't forgotten about you. Just busy around here with //build conference and all. Unfortunately, this is a very time-consuming issue to reproduce and try to solve. I will try my best to make sure I can get to this tomorrow.
Has anyone noticed these changes still happening with 3.15.2.2?
So an interesting thing I found when I was trying to hunt this issue down yesterday was that if I put a simple Trace Trace.TraceWarning("some text");
inside a dialog (in some task like below) or a prompt (such as in the code below) I was not able to reproduce the error, leading me to believe there is some race condition that the trace eliminates somehow. I tried for about 30 minutes straight and was unable to reproduce it with this condition. So this might be a possible workaround for the time being. If anyone is able to confirm that they are seeing the same behavior when adding a Trace event
using @oflahero's example: prompt
void PromptForContact(IDialogContext context)
{
Trace.TraceWarning("prompt");
PromptDialog.Text(context, SearchEntered, "Enter a few letters of the contact you're looking for.");
}
Task inside a dialog:
async Task ContactChosen(IDialogContext context, IAwaitable<GenericEntityChoosable> chosenEntity)
{
Trace.TraceWarning("ContactChosen");
var ct = await chosenEntity;
var chosenContact = _searchablePeople.Single(c => c.Id == ct.EntityID);
var sMsg = $"Details are: \nName: {chosenContact.Name} \nEmail: {chosenContact.Email} \nPhone: {chosenContact.Phone}";
await MessageAndContinueAsync(context, sMsg);
context.Done(true);
}
Thanks Jason. I'll have a bit of time on Monday to trial 3.15.2.2 and/or the proposed Trace workaround.
We are actively looking for the root cause, just nothing to report as of yet.
Hi, we are also having this issue, we have narrowed it down the versioning like mentioned and downgrading it to 3.13.1 fixes the issue. I was able to reproduce the issue using an example code from the repository and thats when I knew that the issue wasn't something in the code I wrote. For us, normal dialogs seem to be fine, and the problem only occurs when using the formdialog and formflow.
Hope my input is helpful
I'm so glad I found this, thought I was going crazy. I think my issues are related.
My setup:
ValidationDialog
that checks if the user's response contains an image and returns a message if it doesn't and awaits another input.Here's the code that waits for until a valid input is given:
public async Task StartAsync(IDialogContext context)
{
await context.PostAsync(InputPrompt);
context.Wait(InputGiven);
}
private async Task InputGiven(IDialogContext context, IAwaitable<IMessageActivity> result)
{
var message = await result;
if (!Validator.IsValid(message))
{
await context.PostAsync(WrongInputPrompt);
context.Wait(InputGiven);
}
else
context.Done(new ImageUploadResult(message.Attachments.First().ContentUrl, message.Attachments.First().ContentType));
}
This worked perfectly until upgrading from 3.13.1 to 3.15.2 (Bot Builder and Bot Builder.Azure packages).
Issues experienced:
After downgrading to 3.13.1 both of these issues no longer occur.
I'm happy to share my code and access to my dev environment if it would be helpful to resolve the issue.
The problem we are running into is that we cannot reproduce this on demand. We have tried multiple ways to automate it with no success. If anyone is able to produce a reproduction that happens in an even somewhat predictable or consistent manner please share with us. We will continue trying and update as we have more information.
I do also have this kind of issues. They do happen very often.
@thschuetze said, that the only thing, that has been changed is the use of the MemoryCache
in the CachingBotDataStore
.
I have looked at the code and i am taking a guess, only a guess. Feedback is welcome.
The documentation of MemoryCache PhysicalMemoryLimitPercentag says:
If the cache size exceeds the specified limit, the memory cache implementation removes cache entries.
Then I have looked at CachingBotDataStore
and I have noticed a heavy reliance on the MemoryCache
.
During the loading process LoadAsync
everything is fine. The cache will be used if possible.
During the saving process SaveAsync
on the other side I couldn't find the call of inner.SaveAsync
. It just puts everyhing in the MemoryCache
, which can be erased any time. This is build that way because of FlushAsync
.
To summarize my guess: CachingBotDataStore
puts everything in the MemoryCache
, which is cleared before FlushAsync
is called.
@JasonSowers Even if it is not the problem, I wouldn't recommend a cache for storing the data and then put it in the database.
For the purpose of testing I would try to trigger the MemoryCache clearing. Some ideas would be reducing the PhysicalMemoryLimitPercentag, reducing the available RAM or storing huge binary data in a dialog or the context.
The easiest way is to get the old CachingBotDataStore
from the repository and register ist.
Overriding CachingBotDataStore registration:
builder.RegisterType<OldCachingBotDataStore>().WithParameter((pi, c) => pi.ParameterType == typeof(CachingBotDataStoreConsistencyPolicy), (pi, c) => CachingBotDataStoreConsistencyPolicy.ETagBasedConsistency).Keyed<IBotDataStore<BotData>>(typeof(CachingBotDataStore)).SingleInstance();
Old version of CachingBotDataStore
with a dictionary:
Please be cautious. This seems to activate an old bug, where the bot greets with an error after a long time of inactivity.
I have a chatbot which doesn't use the cache, it's actually registering the TableBotDataStore exactly the same way you've described in your last comment Jason. What solved it for me is downgrading to SDK version 3.13.1. Related SO question: https://stackoverflow.com/questions/50568479/bot-framework-messes-up-dialog-state
@peterbozso How often does this happen with your current implementation? Using this method I tested for about an hour and was not able to reproduce. Whereas I'm usually able to produce this result every 10-15 minutes or so. What you are saying, and the solution on that StackOverflow question lead me to believe my initial thoughts were correct and it is not actually the MemoryCache.
With the current implementation? You mean by downgrading to 3.13.1? Now it doesn't happen.
I mean when you use 3.15, how often does the problem arise.
I see, sorry. :) I use autobot for testing my bot. Out of 10 tests 4 fails on average because of this issue when using the 3.15 SDK.
@peterbozso if you could get me a minimal reproduction that would fail 4 out of 10 times consistently you would be my hero. All the repros I have created I'm lucky to get a failure 1 out of 40 times, the more frustrating part is when I try to automate it, I never trigger it. This is making it very difficult to find time to really get down to the root cause of this.
@JasonSowers
This is the workaround we recommend right now.
Wonder how you can recommend that if CachingBotDataStore is the only thing that makes the state being saved (because of its FlushAsync implementation that loads state instances and calls SaveAsync on them). Even InMemoryDataStore doesn't work without it in tests.
I have the same issue with v3.15.2.2
. I'm using SqlBotDataStore
and it happens regardless I register CachingBotDataStore
or not. The bot uses only one dialog and keeps track of its state in a property, but sometimes the UPDATE [dbo].[SqlBotDataEntities]
is not called and on the next message the state is restored to the previous one.
@JasonSowers I'd be glad to provide you an example, but the code I am working on is non-shareable. I'll try to reproduce it in a small sample project as soon as I'll have some time.
Bot Info
Issue Description
It started about 2 weeks ago .
Bot doesn't save it's current state \ deserialize the previous state instead of current (the last one where he stopped ).
Occurs rarely.(It could be user get this bug , it could be not , user could get this bug several times per conversation)
Occurs in random moment of the workflow (doesn't depend on which dialog is \ context.Wait(Smth) ).
No Exceptions at all ! Neither in workflow nor http requests to Azure Tables.
Bot process looks like :
Bot state is 0 Bot moves to the state 1( send message 1) Bot waiting for an answer( user message 1) // User gets bot message 1 User respond (user message 1) // Bot accept answer (user message 1) , Bot state = 1( expected = 1) Bot moves to the state 2( send message 2) Bot waiting for an answer( user message 2) // User gets bot message 2 User respond (user message 2) // Bot accept answer (user message 2) , Bot state = 1 ( expected = 2)
Actual Results
Not really good example (because use nested dialog), but same was with dialogues , wich doesn't know anything about each other.
Later , didn't changed anything in code \ server config any elsewhere