Closed gha-zund closed 9 months ago
There's a lot to sift through here. I'll respond to a couple things you mentioned:
The interesting thing is: why should this even be retried? The log says "OperationCanceledException". Why retrying a cancellation? (is there some cancellation-exception handling missing?)
I'm not aware of any case in which Azure Functions or the Durable Task Framework handles OperationCanceledException
from user code. We only handle that exception if it's raised by our internal code. I don't think you should expect the underlying framework to handle it any different, because we can't make any assumptions about what caused it if it's not code that we control. For that reason, retrying the operation is the "right behavior" per the current design. If you don't want the operation to be retried, then you should not be rethrowing your SQL exception at all.
I suspect the other errors you're seeing are related to the 100% CPU usage.
We assume that the issue (heavy load on the DB, from the commands made by entites) has something to do with mssql-provider since we never experience this behavior with the same application with azureStorage provider or netherite.
It's certainly possible that the high CPU usage by your app code could negatively impact the behavior of the durabletask-mssql provider code, or vise-versa. If you're maxing out on vCores, would it make sense to separate these into two different databases?
Hi @cgillum
The databases are already separated, they both are serverless Azure Sql database, even on different server instances. The durable task database seems to be pretty relaxed, according to the metrics.
We already experimented with scaling out the affected database. As we increased the maximum count of vCores (to 10), the CPU metric dropped to a normal range. However, the available resources were not fully utilized so we returned to the previous configuration (max 6 vCores). Then it happened again...
Well you could say that this means that we have to scale out and everything is fine. But we are kind of startled. On our productive system we have max 14 vCores with azureStorage-provider, but the activity of the software is nearly ten times the amount, than on our test system. The other thing is, with the same version of the software, with the same infrastructure but azureStorage-provider instead of mssql we have never experienced such metrics on the database.
I opened the issue in hope for an explanation :)
@gha-zund if the databases are separate, and the only difference is that you're using a different storage provider (MSSQL vs. Azure Storage) then another possible explanation is that there's a difference in behavior in terms of orchestration and activity throughput and/or concurrency that's causing the vCore usage difference. For example, it could be that you're running more orchestrations and activities per second compared to your previous configuration, which could explain why your app database vCore usage has increased.
If you want to throttle the orchestration executions, you can do things like adjusting the concurrency settings, reducing them until you're able to maintain an acceptable range of database vCore usage on your app. This can be configured in your host.json file via the maxConcurrentActivityFunctions
and maxConcurrentOrchestratorFunctions
settings. See here for the host.json configuration reference.
Thanks for your input!
We already played a lot with the concurrency throttles. Seems like we have not found a good combination of those config and scaling yet.
Based on your answers, I understand that there is not misbehavior or malfunction caused by mssql provider. Great to hear :)
We encountered a problem with our application when using durable entities with mssql provider.
We have several durable entities which we use to synchronize some data processing work. Those entity operation do some CPU work but also several I/O call. For example inserting or deleting data from/to a SQL database (another than we use for the durable task hub). The metrics of that database show that CPU is on 100% for hours and database scaled out to the configured max count of vCores. That causes pretty high cost. Query performance insights in the Azure portal gives us the hint, that the load come from a SQL statement which is executed by an entity operation.
Although we did not find an actual explanation for this behavior, it feels like the application is catched in a loop somehow. Just an assumption yet.
In the log of the function app, we found many exceptions and traces related to cancellation scenarios. We consider that normal since our function app scales out an in all the time (which is desired). Note: UseGracefulShutdown setting of the durable task extensions is set to "true"
However, there was one trace which catched my attention. It says "execution will be aborted and will be retried". Is this the hint to a never ending retry-loop? Is there a waiting period between the retries? The interesting thing is: why should this even be retried? The log says "OperationCanceledException". Why retrying a cancellation? (is there some cancellation-exception handling missing?)
Here is the log I'm refering to:
Please note that: "@calculator@G300L251734" is the enty of an entity in our application. Obviously, there was an SQL error response from database. We have a catch-clause which checks for this SQL-error and wraps the SqlException in an OperationCanceledException. (shouldn't we do this?) OperationCanceledException (or its derivates like TaskCanceledException) are never catched in our application since we want the function runtime to handle it (to retry it later, after scale-in, on another function app instance)
Together with this one we also find logs like this:
or:
an many scale recommendations, e.g.:
as well es errors from scale monitor:
in the exception table on application insights a found (as expected) lots of TaskCanceledExcpetion occurrences. From the SqlOrchestrationService:
(no clue how to get the stack trace as a simple string from application insights)
and of course the OperationCanceledException from our code (we log it an then re-throw it)
One last thing: We assume that the issue (heavy load on the DB, from the commands made by entites) has something to do with mssql-provider since we never experience this behavior with the same application with azureStorage provider or netherite.