microsoft / durabletask-netherite

A new engine for Durable Functions. https://microsoft.github.io/durabletask-netherite
Other
222 stars 24 forks source link

Activity is executed twice under heavy load #304

Closed shibayan closed 1 year ago

shibayan commented 1 year ago

In the combined Functions Premium and Netherite environment, when a very large number of specific Activities were executed, we observed a rare case where an Activity was executed twice with the same data.

I had thought that Netherite's implementation of Event Hub in combination with FASTER would make it unlikely that such multiple execution of Activity would occur, but it seems that it is occurring more than I had expected.

Are there any recommended settings to prevent such multiple executions?

davidmrdavid commented 1 year ago

Hi @shibayan, thanks for reaching out. It's a bit difficult to diagnose this issue without having more context. Would you be able to provide us with an example instanceID that experienced this issue, the activity name that got doubly executed, the app name on Azure, and the timerange in UTC where this issue can be observed? From there, we should be able to assist better. Thanks!

shibayan commented 1 year ago

@davidmrdavid Thanks for the reply. Since this issue occurred in a production environment, I would like to exchange App Service information and other information via email. Can you provide me with an email address?

sebastianburckhardt commented 1 year ago

I had thought that Netherite's implementation of Event Hub in combination with FASTER would make it unlikely that such multiple execution of Activity would occur, but it seems that it is occurring more than I had expected.

That is not surprising. Duplicate activity executions are considered a normal occurrence. The most common reason for duplicate activity executions is when partitions are moved for load balancing reasons. The partition manager is not being careful about this, i.e. it will not hesitate to move a partition to a different host even if it means that such movement creates duplicate activity executions.

davidmrdavid commented 1 year ago

Thanks @shibayan.

Agreed, this isn't entirely surprising and duplicate activity executions are allowed by the framework (that's why activities need to be idempotent), they're just not supposed to happen very often.

If you still would like me to take a peek, you can DM me on twitter (https://twitter.com/davidjustodavid) and I'll send you my work email there (just trying to avoid making it public on GitHub). But yeah unless this is happening extremely frequently, this may very well be part of normal operation for the framework.

shibayan commented 1 year ago

@sebastianburckhardt @davidmrdavid Thank you for your reply, I understand that the multiple execution of Activities is an acceptable behavior by design.

This design behavior may not be well known by Durable Functions users. The documentation says that non-deterministic operations should be performed in Activity, so it seems that there are many cases of misunderstanding like mine.

Depending on which language you use, a built-in API for generating deterministic GUIDs or UUIDs may be available. Otherwise, use an activity function to return a randomly generated GUID or UUID.

https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-code-constraints?tabs=csharp#guids-and-uuids

davidmrdavid commented 1 year ago

Thanks @shibayan. I'm trying to understand how to make this guidance clearer. You're correct that Activities are where non-deterministic operations should be performed. At the same time, activities may execute more than once. What the Durable Functions framework does in this case is that, as soon as one activity has its output durably stored, the activity should no longer execute. Therefore, even though the activity may execute more than once, only one of it's results will be acknowledged by the orchestrator and therefore it will be as-if it had only executed once from the perspective of the workflow itself.

Does that makes sense? Do you have any suggestions on how to rephrase our guidance so that this can be made clearer? Thanks!

shibayan commented 1 year ago

@davidmrdavid It makes sense. My customer really likes Durable Functions/Netherite and uses them extensively, so I wanted to make the behavior clear.

The question was successfully resolved. I would like to provide feedback on the documentation separately. Thanks!