Entities become unresponsive under load

greg-zund commented 8 months ago

We are using durable functions in Azure with Netherite and an elastic premium plan (EP2). We are using a setup with only entity functions and no orchestrators. Each entity has a list of work items it needs to process and an operation to trigger the processing of one task in the list. If the list is not empty after the operation has finished, the entity signals itself to run the operation again.

Pseudocode:

class Worker {
  private List<WorkItem> workItems

  public AddWork(items) {
   workItems.append(items)
  }

  public Calculate {
    if(!workItems.Empty) {
      var wo = workItems.dequeue()
      doWork(wo) // side effect: write to db
      if(!workItems.Empty) {
        ctx.Signal(myself, "Calculate")
   }
}

The workers are created and initially signaled by another function:

Pseudocode

for(i = 1 to n) {
 client.Signal(worker+i, "AddWork", getWorkForI(i))
 client.Signal(worker+i, "Calculate")
}

The problem we are facing is that this setup runs ok for some time and then entities start becoming "stuck" somehow (they aren't doing the calculations) and a query to ListEntitiesAsync times out. The only method to revive the durable functions is to restart the durable function in Azure. We see some storage exceptions in the logs, but nothing really meaningful (to us). We don't see this problem without netherite (although it should be noted that we don't have the exact same system deployed with durable functions backed by Azure storage).

Is there a good way to debug these kind of problems when the durable runtime becomes unresponsive, or does someone see an obvious problem with the setup we are using?

davidmrdavid commented 8 months ago

Hi @greg-zund:

Can you tell me a bit more about how much "load" is put on these Entities?

Here's a few data points that should help us get a pulse here:

(1) How large are the inputs sent to the Entities? Like with Orchestrators, inputs and outputs to DF APIs should be kept small to avoid performance issues in the long run. I've seen plenty of cases where folks use Entities as a kind of "database replacement" where, over time, their Entities may get stuck (i.e slowed down greatly) due to the cost of de-/serializing large states repeatedly.

(2) How big does your workItem backlog tend to get? It would be good to log this as that becomes part of your Entity state, which means that if an Entity is accumulating a large backlog, then it's processing will slow down due to the cost of de-/serializing its state.

(3) How many signals are coming to the same Entity instanceID at any point in time?

Is there a good way to debug these kind of problems when the durable runtime becomes unresponsive, or does someone see an obvious problem with the setup we are using?

I think I'll be able to answer this more definitely after getting your thoughts on my questions above. Thanks!

greg-zund commented 8 months ago

Hi @davidmrdavid Thank you for your prompt response. Here is some feedback from our side:

(1) The state is a list (max 10000 elements, normally and on average only <10 elements) of custom objects, each with 3 long properties. The entities have a lot of properties with the [JsonIgnore] attribute, I assume these do not hurt.

(2) We have several metrics for the backlog and addition and removal of work items. However in the case we are looking at, it seems that some entities have stopped doing work all together. As mentioned above, 10000 elements in the backlog ist the max I have seen, but potentially this could be higher if the delay in processing never stops.

(3) For one entity, there are several signals (the point of time is hard to determine).

the entity signals itself if there is still work in the list
a periodic function sends each entity a signal to add an item to the list
a periodic function sends each entity a signal to do work after an item is added
a periodic function sends each entity a signal to do work once a day (this is our effort to try and get non-responsive entities working again, ideally we would not need this)

I hope we can get this working, any help appreciated.

greg-zund commented 8 months ago

Also, I was under the impression that the keep small part referred to the input and output and not the entity state?

greg-zund commented 8 months ago

Another addition: the doWork part in the Calculate function above also reads from the DB, so potentially there are many open requests to the database

greg-zund commented 7 months ago

@davidmrdavid any new ideas?

alexjebens commented 4 months ago

Have you looked at https://microsoft.github.io/durabletask-netherite/#/ptable Particularly look for timestamps that are not within 2 minutes, this is indicative of a "stuck" partition. If so you may be suffering from the same issue I have been having #342 We have seen this occur approx. every 1-2 Weeks. The only way we found to fix it was to reset the state, which of course deleted any open data. We have been using v1.5.0 (and 1.5.1) for over a week now and have not seen it reoccur yet.

davidmrdavid commented 4 months ago

My apologies here, this thread fell under the radar. I'll respond to the remaining questions I see in case it's still helpful.

Also, I was under the impression that the keep small part referred to the input and output and not the entity state?

The docs have been recently updated to reflect that entity state should also be kept small, sorry it wasn't there before: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-best-practice-reference#keep-entity-data-small

The only method to revive the durable functions is to restart the durable function in Azure.

The fact that a restart helps suggests this may not be a stuck partition in the same sense as what you experienced @alexjebens. In the issue you linked, I don't expect a restart to help as was a permanent error de-serializing the partition payload.

Have you looked at https://microsoft.github.io/durabletask-netherite/#/ptable Particularly look for timestamps that are not within 2 minutes, this is indicative of a "stuck" partition. This is a great tip. I'm hoping to document it soon, and to throw better errors by leveraging that in the clients here: https://github.com/microsoft/durabletask-netherite/pull/387

We have been using v1.5.0 (and 1.5.1) for over a week now and have not seen it reoccur yet.

This is great to hear. Please keep us posted @alexjebens if it re-occurs. We did encounter 1 more FASTER corruption bug which we're fixing here (https://github.com/microsoft/durabletask-netherite/pull/395), so on the next Netherite release, you should be getting that automatically.

@greg-zund - apologies for this thread falling off our radar. Did you work around this issue or is it still present? Please let me know, I can try to engage more team members here.

microsoft / durabletask-netherite

Entities become unresponsive under load #339