microsoft / botbuilder-dotnet

Welcome to the Bot Framework SDK for .NET repository, which is the home for the libraries and packages that enable developers to build sophisticated bot applications using .NET.
https://github.com/Microsoft/botframework
MIT License
864 stars 480 forks source link

Failed to continue dialog. A dialog with id xxxx could not be found. #6716

Closed RuanCSoftsure closed 2 days ago

RuanCSoftsure commented 6 months ago

Github issues should be used for bugs and feature requests. Use Stack Overflow for general "how-to" questions.

Version

4.21.1

Describe the bug

Whenever a client is on a dialog and they respond to a message on the bot, it will intermittently fail with the "Failed to continue dialog....A Dialog with id xxxx could not be found" exception.

EXCEPTION MESSAGE: Failed to continue dialog. A dialog with id xxxx could not be found.

INNER EXCEPTION MESSAGE: no inner exception message provided

SOURCE: Microsoft.Bot.Builder.Dialogs

To Reproduce

I have really struggled to reproduce this. And when I do manage to reproduce it, I still don't really know why it is failing. This only started happening when we upgraded from 4.12 to 4.21.

Expected behavior

The client should just be able to continue with the dialog they are on.

Screenshots

Additional context

Happens specifically on this line of code.

// Run the Dialog with the new message Activity. await _dialog.RunAsync(turnContext, _conversationState.CreateProperty("DialogState"), cancellationToken);

I don't know whether I should be making other changes to our existing logic. And how child dialogs are added to the parent dialog. But out of over a 100 conversations a day...this will happen to probably 3 or 5 of them. We store the conversation and userstate on Redis. And I have had a look at the conversations that would fail with this error and the dialog that is supposedly not there...is in the list. So I am completely stumped as to where else I should look. And why this is happening intermittently.

I want to mention that I use dependency injection and register each dialog as Transient service. I then only add the dialog to the stack once I get to the waterfall step where I need it. Don't know whether this would could potentially be why I am getting this failure now. Before upgrading from 4.12 to 4.21 this was not a problem.

RuanCSoftsure commented 6 months ago

Anyone that could help with this one? Seems like it is getting worse. :(

dmvtech commented 5 months ago

I want to mention that I use dependency injection and register each dialog as Transient service. I then only add the dialog to the stack once I get to the waterfall step where I need it. Don't know whether this would could potentially be why I am getting this failure now.

Can you share how you do this? That's not standard approach. Not sure what side effects that may have.

RuanCSoftsure commented 5 months ago

Thanks for responding. To answer your questions...

- Does this happen with a specific bot Channel? Directline Channel - What state storage are you using? Storing it in the database, using Redis specifically with a time to live of 24 hours

Can you share how you do this? That's not standard approach. Not sure what side effects that may have.

I will show some code snippets yes. Please see below.

In this example I have ManagePolicyDialogBase as a dialog where the client can manage certain policy information.

Startup.cs in the bot has the following:

services.AddTransient<ManagePolicyDialogBase>();

This dialog is a child dialog of my WelcomeDialogBase.

WelcomeDialog is started in the StartWelcomeDialog step of my GreetingsDialog

private async Task<DialogTurnResult> StartWelcomeDialogAsync(WaterfallStepContext stepContext, CancellationToken cancellationToken)
        {        

                var dialogName = _welcomeDialog.GetType().Name;

                if (FindDialog(dialogName) == null) AddDialog(_welcomeDialog);

                return await stepContext.BeginDialogAsync(dialogName);
         }

My WelcomeDialogBase constructor looks like this: Please note I am only showing you snippets of it. Our bot is quite complex so don't want to overwhelm with unnecessary code. You will see the managePolicyDialogBase being injected in the constructor below

      public WelcomeDialogBase(string dialogId, IBotImplementations botImplementations, IDialogImplementations dialogImplementations,
            ICheckPublicHoliday checkPublicHoliday, ICheckOfficeHours checkOfficeHours,
            Func<BotClient, ClaimsIntentDialogBase> claimIntentDialogResolver,
            Func<BotClient, ProductInfoIntentDialogBase> productInfoIntentDialogResolver,
            PetQuoteDialogBase petQuoteDialog,
            ManagePolicyDialogBase managePolicyDialog) : base(dialogId, botImplementations)
        {      

         .....

           _managePolicyDialog = managePolicyDialog;

         .....

            AddDialog(new WaterfallDialog(nameof(WaterfallDialog), new WaterfallStep[] {
                StartDialogAsync,
                PromptOptionsAsync,
                OnOptionSelectedAsync,
                StartClaimsDialogAsync,
                ReturnFromClaimsDialogAsync,
                StartManagePolicyDialogAsync,
                ReturnFromManagePolicyAsync,
                StartQuoteDialogAsync,
                ReturnFromQuoteDialogAsync,
                StartProductInfoDialogAsync,
                ReturnFromProductInfoDialogAsync,
                PromptSpeakToHumanOptionsAsync,
                OnSpeakToHumanOptionSelectedAsync,
                StartCallBackDialogAsync,
                ReturnFromCallBackDialogAsync,
                StartAgentTransferDialogAsync,
                ReturnFromAgentTransferDialogAsync,
                EndWelcomeDialogAsync
            }));
            InitialDialogId = nameof(WaterfallDialog);
        }

The dialog is started in the StartManagePolicyDialogAsync step.

        protected async Task<DialogTurnResult> StartManagePolicyDialogAsync(WaterfallStepContext stepContext, CancellationToken cancellationToken)
        {
            var intent = MenuIntents.ManagePolicy.ToString();

            await _topIntent.LogAsync(stepContext.Context.Activity, intent, intent, _botClientId);

            var dialogName = _managePolicyDialog.GetType().Name;

            if (FindDialog(dialogName) == null) AddDialog(_managePolicyDialog);

            // navigatge to Manage Policy Dialog
            return await stepContext.BeginDialogAsync(dialogName);
        }

Like I mentioned...this is not something that happens all the time. It happens intermittently. Not a single one of these exceptions came through yesterday. A very strange problem indeed. So maybe I just need to tweak some logic, since upgrading to the latest framework. But I don't really know what logic i need to tweak. Before version 4.21.1 I never experienced this exception.

Thanks for your willingness to try and help. I will share as much code as I can to get to the bottom of this.

RuanCSoftsure commented 5 months ago

Don't know why I didn't check the Pod in AKS. That is where I bot is hosted. Ever since I upgraded the bot framework from 4.12 to 4.21.1 this error started happening. But it was very intermittently. Now that our bot is getting way more traffic via our own custom WhatsApp Channel, this error is happening more often and now I have noticed the Pod is restarting a few times a day, which restarts the whole bot...and that is why this error is occurring more regularly.

Getting an Exit Code 139 on Kuberneties.

And I believe this is the reason....

Incompatibilities This is by far the most common reason for SIGSEGV errors, and luckily one that is very easy to fix. After updating a library, if you forget to change the version number of that library, then your system may attempt to load the older binary library. If this older binary then tries to access memory addresses assigned to the newer library, then an incompatibility error exists across your binaries and libraries. This is a very common mistake. It can be fixed by simply updating the version number whenever you update a library and its binaries.

Source: https://techreport.com/blog/exit-code-139-kubernetes/

dmvtech commented 2 days ago

Closing as resolved (as it sounds like you have it all figured out). If you still are experiencing trouble, comment and we can reopen and revisit.