microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
360 stars 29 forks source link

Crashed Container App Job Causes Infinite Loop in KEDA Scaler #1234

Open robrennie opened 1 month ago

robrennie commented 1 month ago

This issue is a: (mark with an x)

Issue description

When following the steps to retrieve a message from the queue and process it in a Container App Job per #1216, an infinite loop is created in the KEDA scaler if the app job crashes without deleting the queue message. This infinite loop is racking up usage charges causing over-billing.

In this case, I don't know why my container app crashed - there were no system logs and the console logs just stopped without error. It appears Azure infrastructure work was being done as this has worked before. Sometimes, jobs just get torn down as "Failed", no system logs, console log looks fine up to a point then just stop.

Steps to reproduce

  1. Create an app job that runs for a while that crashes on purpose (so it isn't able to delete the queue message).
  2. Per the steps in #1216, the container should:
    • Pull a message from the queue
    • Set the visibility timeout to something longer than what is needed to process the message
    • Crash (or allow Azure to crash your job in the evening PST).
  3. At this point, the KEDA scaler will see the message remaining in the queue and constantly try to start a container.
  4. When the container starts, it sees no message in the queue (due to visibility timeout) and thus stops.
  5. Infinite loop back to 3 above.

I can understand the argument that you should catch any exception (or panic in my case) and delete the queue message, but in this case, Azure Container App infrastructure is crashing the containers, and there are no system logs to explain why. So, that's not an option.

I did add queue message expiration, and that does appear to stop the infinite loop, but I'm sure many have no message expiration time. Regardless, when this situation happens for any amount of time, Azure customers are being over-charged for the CPU time when the KEDA scaler is constantly trying to service an "invisible" queue message.

If the KEDA scaler is trying to match the number of container app jobs running to the number of visible or invisible queue messages, then this is a design flaw that appears to be intractable.

anthonychu commented 1 month ago

This is the default behavior for most queues to guarantee that a message that fails in processing is not lost. I don't believe there are any built in deadletter capabilities in Azure Storage queues. You can implement your own by first checking the message's dequeue count when you receive it. If it's greater than a certain number, instead of processing it, you can delete it and optionally send it to a separate deadletter/poison message queue.

Azure Service Bus has a built-in deadletter queue.

We're investigating the cause of the job's crashes in #1235.

robrennie commented 1 month ago

@anthonychu sorry, having this issue again. I've come to the conclusion that using queues and a KEDA scaler for long running jobs is just a bad idea. With due respect, your explanation misunderstands the problem:

It's not that the queue is trying to make sure a message is processed, it's the KEDA scaler spinning up a container to process a message that is invisible due to a prior crashed container. The KEDA scaler will keep starting containers, which will find no message, shut down, and then rinse and repeat. This will continue until the visibility timeout clears and a container finally processes the message (in my case deleting it due to the dequeueCount).

I understand why this happens, but it's a design flaw in the Azure Container Job App architecture plain and simple. When the KEDA scaler is starting these useless jobs, mine run for about 10s, over and over until the visibility timeout clears, Microsoft is overcharging the consumer. Useless jobs running over and over is equivalent to the job's CPU usage setting running the entire time until the visibility timeout clears! This is really bad - consider a job that may need 6 hours to run and fails in the first hour! The customer is overcharged by 5 hours!

If the queue realized the message failed to be processed, then it would clear the visibility timeout and let something else process it. But the queue has no idea the message processing failed, however the KEDA scaler does.

We are going to have to switch to manually starting (via the management API) these jobs, unfortunately, which is one heck of an ugly API (the authentication) and basically rewrite the entire job management piece.

I wish you guys had a simpler way to start jobs like any other Azure API just needing a SAS key. Can you think of any other way to start 20 containers and if one fails, just let it fail?

P.S. - Even stranger, my "Replica retry limit" is set to zero. So, this is being ignored. Perhaps this just needs to get fixed.

anthonychu commented 4 weeks ago

@robrennie Yes that makes sense. Agree that behavior is not great. Looks like KEDA added a queueLengthStrategy configuration to address this. You will be able to set it to visibleonly.

This property is in KEDA 2.15 that was just released. It usually takes at least a few weeks of testing and deployment before a new version is available on Container Apps.

A temporary workaround is to use a short visibility timeout. Then have a thread in the job with a loop that extends the timeout periodically.

For manually starting jobs, what language are you using? You should be able to get a token using the Azure Identity SDK and managed identity.

robrennie commented 4 weeks ago

@anthonychu I've rewritten the C# code to use the management API to start jobs (which are written in Rust) manually and removed the queue altogether. Perhaps in my use case, starting exactly n containers at the same time, with slightly different starting parameters, not needing retry/restarts of containers, I was barking up the wrong tree using queues.

Now, I write the startup parameters to an Azure Table, then manually start the job passing the partition/row keys to the container via environment variables. I was able to delete about 30% of my code dealing with this, so that's always good. No more infinite loops, things work much more as expected.

The management API requires a bunch more config - subscription ids, client ids - setting up RBAC for the client on a subscription (despite it being unclear which roles are needed for containers) - etc. Whereas, for example, accessing Azure storage just requires a connection string and a container/table/queue. It would be nice if there was an analogous, easier, way to start manual jobs.

Thanks!

robrennie commented 4 weeks ago

@anthonychu - Actually, in pondering the KEDA fix with queueLengthStrategy, I think this might actually make things worse. Simple example:

  1. Queue message scales up one Job that loads external data into a database, expected to run for at least 6 hours, in the middle of the night (downtime).
  2. Job crashes (e.g., due to unexpected infrastructure event).
  3. IT person comes in and sees job crashed but forgets to manually delete the now invisible message from queue.
  4. KEDA scales back up job 8+ hours later once message becomes visible again, starts data load in the middle of working hours, brings down customer, potentially creates data corruption issue.

I'm tellin' ya, using a queue to start long running jobs is simply hammering a square peg in a round hole. The word "scale" doesn't jibe with running long running jobs. Scale implies, to me at least, a continuous workload that increases or decreases.

Also, in starting these manual jobs, it's strange that I can use an HTTP request to scale up a container app (non-job), yet, I have to go through the management API to start a manual one.

I think Azure Container Jobs are missing 1) easier ability to start them w/o KEDA scaler 2) event grid events - e.g., status in particular 3) memory/CPU real-time visibility.

anthonychu commented 4 weeks ago

For this scenario your outlined where you want to trigger a job during specific times, I agree a queue is probably not the solution.

Queues are useful for ensuring at least once, asynchronous processing of events.

I also agree that "scale" isn't the best term to describe how job executions are triggered by events. It's what KEDA uses and we wanted to remain consistent.