ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in

jason-daly commented 6 months ago

I've recently started using Netherite (Microsoft.Azure.Functions.Worker.Extensions.DurableTask.Netherite" Version="1.5.1") and have been seeing a few errors when starting orchestrations.

Orchestrations start fine most of the time, but occasionally I see the following error:

When starting an orchestration, as follows:

            var ingestionOrchestrationSpec = new IngestionOrchestrationSpec(etc...);
            var orchestrationInstanceId = $"ingestions.{tenant.ProjectId}.{pointcloud.Id}";
            await client.ScheduleNewOrchestrationInstanceAsync(nameof(IngestionsCreateAsyncOrchestrator), ingestionOrchestrationSpec, new StartOrchestrationOptions { InstanceId = orchestrationInstanceId });

This exception is thrown:

Exception: Grpc.Core.RpcException: Status(StatusCode="Unknown", Detail="Exception was thrown by handler.")
   at Microsoft.DurableTask.Client.Grpc.GrpcDurableTaskClient.ScheduleNewOrchestrationInstanceAsync(TaskName orchestratorName, Object input, StartOrchestrationOptions options, CancellationToken cancellation)
   at TrimbleGeospatial.Seurat.ApiApp.Functions.Ingestions_Create.Run(HttpRequestData req, DurableTaskClient client, FunctionContext functionContext, String projectId, String pointcloudId)

jason-daly commented 6 months ago

The relevant host.json settings are:

    "durableTask": {
      "hubName": "SeuratAPIDurableTasks",
      "UseGracefulShutdown": "true",
      "storageProvider": {
        "type": "Netherite",
        "PartitionCount": "12",
        "StorageConnectionName": "AzureWebJobsStorage",
        "EventHubsConnectionName": "EventHubsConnection"
      },
      "tracing": {
        "distributedTracingEnabled": true,
        "Version": "V2"
      },

      "TraceToBlob": true
    }

where "AzureWebJobsStorage" is a connection string to a Azure storage account: DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=srtfuncdevaue;AccountKey=this_part_is_secret

and "EventHubsConnection" is a connection string to an EventHub, as follows: Endpoint=sb://evhns-srt-netherite-dev-aue.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=secret_goes_here

jason-daly commented 6 months ago

more info...

Logs show that the error is raised in response to the host instance being shut down by the Scale Controller, but it appears that the orchestration actually started ok.

15/05/2024, 5:05:56.497 pm Scheduling new IngestionsCreateAsyncOrchestrator orchestration with instance ID 'ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873' and 3140 bytes of input data. 15/05/2024, 5:05:56.498 pm {"durabletask.task.execution_id":"b9e17ec1fb2a4464ae5ec25fbb3265cc","durabletask.task.name":"IngestionsCreateAsyncOrchestrator","durabletask.type":"orchestration","otel.status_description":"Client request was cancelled because host is shutting down.","durabletask.task.instance_id":"ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873"}

15/05/2024, 5:05:56.504 pm ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873: Function 'IngestionsCreateAsyncOrchestrator (Orchestrator)' scheduled. Reason: NewInstance. IsReplay: False. State: Scheduled. RuntimeStatus: Pending. HubName: SeuratAPIDurableTasks. AppName: func-srt-dev-aue. SlotName: Production. ExtensionVersion: 2.13.2. SequenceNumber: 21.

15/05/2024, 5:05:56.550 pm An instance was removed because all functions are either idle or seeing a steady decrease in load. 15/05/2024, 5:05:56.550 pm Instance count changed

15/05/2024, 5:05:56.551 pm ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873: Function 'IngestionsCreateAsyncOrchestrator (Orchestrator)' started. IsReplay: False. Input: (12560 bytes). State: Started. RuntimeStatus: Running. HubName: SeuratAPIDurableTasks. AppName: func-srt-dev-aue. SlotName: Production. ExtensionVersion: 2.13.2. SequenceNumber: 110. TaskEventId: -1

15/05/2024, 5:05:56.551 pm Executing 'Functions.IngestionsCreateAsyncOrchestrator' (Reason='(null)', Id=21415206-bfc1-446e-8e0c-287674404a50) 15/05/2024, 5:05:56.557 pm Executed 'Functions.IngestionsCreateAsyncOrchestrator' (Succeeded, Id=21415206-bfc1-446e-8e0c-287674404a50, Duration=5ms)

15/05/2024, 5:05:56.572 pm DrainMode mode enabled 15/05/2024, 5:05:56.572 pm Calling StopAsync on the registered listeners 15/05/2024, 5:05:56.590 pm Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableTaskListener' for function 'IngestionsCreateAsyncOrchestrator' 15/05/2024, 5:05:56.590 pm Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableTaskListener' for function 'IngestionsCreateAsyncOrchestrator' ... (I can supply more logs if needed)

15/05/2024, 5:05:56.888 pm Exception: Grpc.Core.RpcException: Status(StatusCode="Unknown", Detail="Exception was thrown by handler.") at Microsoft.DurableTask.Client.Grpc.GrpcDurableTaskClient.ScheduleNewOrchestrationInstanceAsync(TaskName orchestratorName, Object input, StartOrchestrationOptions options, CancellationToken cancellation)

15/05/2024, 5:05:56.888 pm Failed to start IngestionsCreateAsyncOrchestrator with orchestrationInstanceId=ingestions.nE0Z_35yWYM.b3e5a332-fae2-4046-9a5b-0e3eed3ba873

jason-daly commented 6 months ago

This seems to me to be similar to 2454, as explained in this comment.

davidmrdavid commented 6 months ago

Hi @jason-daly: can you share what Durable Functions packages you're suing? As suggested in the thread you linked, that interaction between the scale controller and DF should have been fixed.

You also mentioned that you encountered this error since moving to Netherite 1.5.1. What were you using before?

davidmrdavid commented 6 months ago

Also - I think the thread you linked mostly discussed orchestrators being marked as failed when scale controller removes a VM from the app. In this case, it appears to me that the client function is failing, not the orchestrator. Do I have that right?

If so - does your client function have a cancellation token that you can inspect during these shutdowns? I suspect the cancellation token has been "set"/ fired.

jason-daly commented 6 months ago

Also - I think the thread you linked mostly discussed orchestrators being marked as failed when scale controller removes a VM from the app. In this case, it appears to me that the client function is failing, not the orchestrator. Do I have that right?

If so - does your client function have a cancellation token that you can inspect during these shutdowns? I suspect the cancellation token has been "set"/ fired.

Ah, yes, you are correct. The exception occurs in the call to ScheduleNewOrchestrationInstanceAsync on the DurableTaskClient instance.

In the latest edit of this code, a CancellationToken was supplied, and I still encountered the RpcException, although the underlying error logged by the 'framework' changed to:

Hi @jason-daly: can you share what Durable Functions packages you're suing? As suggested in the thread you linked, that interaction between the scale controller and DF should have been fixed.

You also mentioned that you encountered this error since moving to Netherite 1.5.1. What were you using before?

Prior to Netherite, we have had been using the default storage provider, Azure Storage, and had not encountered these failures.

Here's the packages list:

    <PackageReference Include="AutoMapper" Version="13.0.1" />
    <PackageReference Include="Azure.Identity" Version="1.11.3" />
    <PackageReference Include="Azure.Messaging.EventGrid" Version="4.24.0" />
    <PackageReference Include="Azure.Storage.Files.DataLake" Version="12.18.0" />
    <PackageReference Include="Azure.Storage.Queues" Version="12.18.0" />
    <PackageReference Include="Gridify" Version="2.14.1" />
    <PackageReference Include="Microsoft.Azure.Core.NewtonsoftJson" Version="2.0.0" />
    <PackageReference Include="Microsoft.Azure.Cosmos" Version="3.40.0-preview.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.CosmosDB" Version="4.8.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.1.3" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask.Netherite" Version="1.5.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.EventGrid" Version="3.4.1" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.1.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Storage.Queues" Version="5.4.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.0" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.2" OutputItemType="Analyzer" />
    <PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.22.0" />
    <PackageReference Include="Microsoft.Extensions.Diagnostics.HealthChecks.Abstractions" Version="8.0.5" />
    <PackageReference Include="Microsoft.IdentityModel.Tokens" Version="7.5.2" />
    <PackageReference Include="Microsoft.IO.RecyclableMemoryStream" Version="3.0.0" />
    <PackageReference Include="System.Linq.Async" Version="6.0.1" />
    <PackageReference Include="System.Net.Http" Version="4.3.4" />
    <PackageReference Include="System.Text.RegularExpressions" Version="4.3.1" />

jason-daly commented 6 months ago

BTW, I have backed out Netherite now, and gone back to Azure Storage. If you are able to figure out the issue and get it resolved, I can give Netherite another try.

davidmrdavid commented 6 months ago

@jason-daly: are you able to provide me with your app name and time in UTC where this issue ocurred in Azure? I'd need that to be able to debug this further

jason-daly commented 6 months ago

@jason-daly: are you able to provide me with your app name and time in UTC where this issue ocurred in Azure? I'd need that to be able to debug this further

Sure. The function app name is: func-srt-dev-aue (in Australia East region); logs are with Application Insights 'appi-srt-dev-aue'.

The most recent occurrence was at 2024-05-16T02:58:06.6515245Z. See operation_id 27ab279dcd3d61ccdd43934ec76717fe See orchestration instanceId 'FeaturesServiceIntegration.nE0Z_35yWYM.82da1a0b-2812-4fee-9ca0-314369d386af.38d7f15b-99a5-4553-8a02-17557f50c1c8'.

To be honest, I kind of gave up on this already and reverted Netherite from our code. When I did this I also cleaned up the resources that it used, so this might hamper your troubleshooting efforts a bit... If you are unable to figure out what was going wrong from the logs, I can temporarily reinstate Netherite in this app, so we can repeat the behavior. Let me know if you need this.

davidmrdavid commented 6 months ago

Thanks. No worries, we don't need your app to be active for us to review our telemetry, it is historical, not live :)

microsoft / durabletask-netherite

ScheduleNewOrchestrationInstanceAsync fails when scale controller scaling in #398