microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
360 stars 29 forks source link

Azure container app not scaling down when not in use #1239

Closed sam-bradshaw-wcmc closed 3 weeks ago

sam-bradshaw-wcmc commented 1 month ago

This issue is a: (mark with an x)

Issue description

I have an azure function deployed as a container app that is failing to scale down when not in use, resulting in much higher than expected costs. The app is triggered by messages arriving on a queue in an Azure storage account.

The scale rule settings for the container app are set to have min replicas as 0, and max replicas as 2.

It seems that the number of replicas is stuck at the maximum replica count of 2, even though the number of requests I have received over the same period show that the app has only received a handful of messages.

I have a scale rule which is created automatically by the deployment to scale up when the number of messages on the queue reaches 5. My understanding is that it should scale up to 2 replicas when the queue gets to 5 messages, but it should by default scale back down to 0 when it has finished processing the messages. See the scaling behavior section here https://learn.microsoft.com/en-us/azure/container-apps/scale-app?pivots=azure-resource-manager#scale-behavior.

What do I need to do to make sure my container app scales down to 0 replicas when it is not being used? I am not sure if this is a bug in Azure or with my configuration

I am assuming this lack of scaling down explains why my cloud costs have been much higher than expected.

Expected behavior [What you expected to happen.]

The number of container app replicas should scale down to 0 when the app is idle.

Actual behavior [What actually happened.]

The number of container app replicas is stuck at the maximum replica count

Screenshots
Screenshot 2024-07-25 at 12 01 18 Screenshot 2024-07-25 at 12 02 55

Screenshot 2024-07-25 at 16 49 13 Screenshot 2024-07-25 at 15 07 23

Additional context

My app is deployed using an ARM template that defines a managed container environment, a storage account containing a queue, and a function app configured to pull a docker image from an ACR (the container app resource in Azure is created automatically by this deployment). If It is helpful I can include the relevant parts of the ARM template. The app is set up similarly to the steps outlined in this guide https://learn.microsoft.com/en-us/azure/azure-functions/functions-deploy-container-apps?tabs=acr%2Cbash&pivots=programming-language-python. This github issue was also helpful for me to get it working initially https://github.com/MicrosoftDocs/azure-docs/issues/36505

sam-bradshaw-wcmc commented 1 month ago

i think I have resolved the issue. The issue was related to the way that poison messages were handled.

My app defined 2 functions that are triggered by messages arriving on queues, one for processing messages on the original queue, and another for processing the messages on a '-poison' queue. When a message fails to be processed within the max dequeue count, it is automatically sent to the poison queue (see poison messages section here). If the poison queue does not exist when a message is first sent to it, then it is automatically created. This meant that I had not explicitly defined the poison queue in my ARM template. However, the scaling rules that are generated automatically for my app include a scaling rule for the '-poison' queue. If the '-poison' queue did not exist (which was the case when the app is first deployed), then this seems to break the scaling meaning the app never scaled down.

The solution was to explicitly define the '-poison' queue in my ARM template.

I think this should still be addressed as a bug in the KEDA scaling - if a scaling rule cannot connect to a queue (in this case because the queue did not exist), surely it should scale down by default, rather that retaining the maximum number of replicas running (and therefore causing additional costs to the customer).

anthonychu commented 1 month ago

Can you please create an issue at https://github.com/kedacore/keda/issues to see if this is the intended behavior or if it's a bug? If intended, looks like the storage queue scaler page might be missing a note stating the queue must exist (their Service Bus docs has this note).

JorTurFer commented 1 month ago

Hello, I'm Jorge from KEDA 😄 KEDA already supports the feature that you request with the fallback. Using KEDA, you can define the desired fallback and the workload will be scaled to 0. We actively decided to not modify the current value of the workload without information about the upstream proactively because there are cases where going to 0 is the best, other cases where going to max is the best, and others where do nothing is the best. Taking these options into account, we developed the fallback feature to give the power to the users and to enforce any behavior. For example, I personally use KEDA + Prometheus scaler to scale HTTP applications and scaling to 0 on Prometheus issues would be crazy, in our case we prefer scaling to max replicas to protect the user experience

IDK if ACA supports this feature somehow or not 🤷

anthonychu commented 1 month ago

@JorTurFer Thanks for the info. I think the current behavior when scaler has an error is the ideal default behavior for the reasons you outlined. It's not a user configurable setting in ACA today.

@sam-bradshaw-wcmc Sounds like this confirms that the Azure Storage queue scaler does require the queue to be present to function correctly. Are you deploying function apps to ACA using these instructions? Perhaps we can add a note in docs that all trigger sources need to exist before the app is deployed. @raorugan

microsoft-github-policy-service[bot] commented 1 month ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.