Event Hub Client Timeouts when exceeding Event Hub limits

UMCPGrad commented 1 year ago

When testing out our durable function app against production level load, we get the following timeout error (when initially setting the Netherite Event Hub to have 1 Throughput Unit, which can auto-inflate up to 40 Throughput Units).

Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: EventHubStart ---> System.TimeoutException: Client request timed out. at DurableTask.Netherite.Client.CreateTaskOrchestrationAsync(UInt32 partitionId, TaskMessage creationMessage, OrchestrationStatus[] dedupeStatuses) in //src/DurableTask.Netherite/OrchestrationService/Client.cs:line 469 at DurableTask.Netherite.NetheriteOrchestrationService.DurableTask.Core.IOrchestrationServiceClient.CreateTaskOrchestrationAsync(TaskMessage creationMessage, OrchestrationStatus[] dedupeStatuses) in //src/DurableTask.Netherite/OrchestrationService/NetheriteOrchestrationService.cs:line 575 at DurableTask.Core.TaskHubClient.InternalCreateOrchestrationInstanceWithRaisedEventAsync(String orchestrationName, String orchestrationVersion, String orchestrationInstanceId, Object orchestrationInput, IDictionary2 orchestrationTags, OrchestrationStatus[] dedupeStatuses, String eventName, Object eventData, Nullable1 startAt) in /_/src/DurableTask.Core/TaskHubClient.cs:line 614 at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient.Microsoft.Azure.WebJobs.Extensions.DurableTask.IDurableOrchestrationClient.StartNewAsync[T](String orchestratorFunctionName, String instanceId, T input) in D:\a_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:line 210 at RedScope.Collectors.EventCollector.Orchestration.Trigger.EventHubStart(EventData[] events, IDurableOrchestrationClient starter) in C:__w\1\s\src\RedScope.Collectors.EventCollector\Orchestration\Trigger.cs:line 63 at Microsoft.Azure.WebJobs.Host.Executors.VoidTaskMethodInvoker2.InvokeAsync(TReflected instance, Object[] arguments) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\VoidTaskMethodInvoker.cs:line 20 at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker2.InvokeAsync(Object instance, Object[] arguments) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52 at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 523 at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 306 --- End of inner exception stack trace --- at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 352 at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in C:\projects\azure-webjobs-sdk-rqm4t\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 108

However, when initially setting the Throughput Units to the max (40 Throughput Units), the Durable Function was able to run without any exceptions.

We also noticed that once the client timeout exceptions occurred, they always seemed to occur, even after manually scaling up to 40 Throughput Units

sebastianburckhardt commented 1 year ago

Most likely this is a problem with the throughput of the EH senders. We have had several reports of this (e.g. https://github.com/microsoft/durabletask-netherite/issues/268 ) and I am working on a change that may help improve this throughput.

However, it would be good to double check whether this is true for your specific case. Do you have an app name and timestamp that I can look at?

UMCPGrad commented 1 year ago

I've actually been testing this in the AirGap Cloud (USNat) where we're able to run this against actual production load.

Just to give you a bit of a background, our production service has been running for a little more than a year. It's a relatively simple Durable Function App that has an Event Hub Trigger. It basically grabs events off of the event hub, does a little bit of processing on the events, and then sends the events to both a Log Analytics workspace, and also to storage accounts.

We've been using the default Azure Storage backend provider since we developed the service, and it's been pretty stable for about a year.

The main finding that I found these past few weeks of testing in USNat is that the Netherite Event Hub needs to have sufficent Throughput Units assigned to it. However, I also noticed that when running the same number of instances that we currently run in production (8 Instances), the Netherite version performs well for about 4 - 5 hours, then it starts to "slow" down, and eventually timeouts happen.

For example, I compared how long requests in our Event Hub Trigger function (storage backend vs Netherite). For some reason, the Trigger function requests start to take longer and longer in Netherite, until it slows to the point of creating timeouts.

I think this may be due to the load of the Event Hub that we are subscribing to, and I'm working to make sure we can produce the same load in our Test Environment so our Test Function App ("rs-ec-eastus-eh00-neth") shows the same behavior.

sebastianburckhardt commented 1 year ago

Traces showed that EH is delivering a corrupted packet that does not deserialize correctly. This sends everything into a tailspin (forever trying to recover but not really having a chance since it just receives the same corrupt packet).

There is a good chance this is a bug in the EH sender code. Since I am just about to replace this code anyway (with the solution to #268) I am prioritizing the fix for #268 which should then also fix this case (and hopefully improve performance as well).

sebastianburckhardt commented 1 year ago

The latest release 1.4.0 now addresses this issue. It contains a blob-batching optimization #275 that replaces most of the EH sender code (including the snippet you included). The idea is to detect cases where the data volume or the number of messages is not small, and then use Azure blobs to store those messages, just passing the blob reference through EH, instead of sending all the data through the EH which has been a bottleneck.

I think there is a good chance this will fix the issue you observed here. If not, let me know and I will take another look.

lilyjma commented 1 year ago

Hi @UMCPGrad, did you get a chance to try the latest release Sebastian mentioned? We'd love to know whether that resolved the problem you saw. Thanks!

microsoft / durabletask-netherite

Event Hub Client Timeouts when exceeding Event Hub limits #272