Open jason-berk-k1x opened 1 month ago
same here but only since enabling NAT GW
I opened a support ticket.... didn't here anything from anyone for over a week. They finally got back to me on 5/24 and after a 90 minute phone call they weren't able to offer any help or insights as to what this error even means. They said they were going to escalate my ticket 🤷
ran the query today just to have something to compare to:
so in 90 days I've gotten 25361 failures which is roughly 281 times a day
We're having the same problem. Using the D16 workload profile, and we're seeing 50% of job executions encounter this error, increasing their time from 20 seconds by over 2 minutes.
my jobs are configured to timeout after 15 minutes and I poll the service bus queue ever 10 seconds.
my current theory is that we run into issues mostly when, for whatever reason, there's no capacity on the nodes that Azure manages and we have to wait for them to add more nodes to the pool....then pull the image...which is severely bloated at almost 9GiB, onto that new hardware....then start running the job. Also, mind you, the 15 minute clock starts ticking when the message hits the queue and KEDA tries to start a job.... NOT when the job actually starts running
I've seen the exact same processing take three minutes when it runs on a node pool that already has pulled the image and has capacity to run a job.....then take 12 to 13 minutes when everything is starting cold
the problem is, if you set Min Executions to 0 there is no node to schedule the container on. Upon start of the job the node will be initialized but that takes more time than the timeout. On the next run however it will work - until after some time the node is destroyed again and scaled down to 0.
You can avoid the issue by setting at least 1 node as minimum although minimum of 3 is recommended. This does not happen on the consumption plan because there the jobs run on shared nodes, they are always running.
This does not happen on the consumption plan because there the jobs run on shared nodes, they are always running.
right, those consumption plan nodes are always running.....but they might not have capacity to run your job...and now your job is just waiting for new nodes to be added to that node pool....or maybe for capacity to become available. If new nodes are provisioned, then the job will likely need to pull the image fresh....adding more delay
as an aside, I'm also struggling with the math of the dedicated plans. For example, if I create a new container app environment, and define a D4 profile, I can run my job on either:
on the consumption plan with max resources of 4 cores and 8GiB of RAM, at less than one cent / minute my 5 minute job will cost me $0.05
if I use the D4 profile instead......the price now depends on how you have the min/max nodes configured. If I set the min to zero to emulate the consumption plan, then I don't feel like I'm really gaining anything because I still have to tolerate a cold node start if my jobs are intermittent and bursty (which they are). Even after the code start, if I'm comparing apples to apples.....I'd be using the entire 4 cores of the D4 meaning I could only run one job per node. I understand I could set my max nodes to 25 and totally have 25 nodes each running a single job...which is an experiment I'm about to go run.
All that said, if I just keep one D4 node running, that would run me about $225 / month and the vast majority of that time, NOTHING would be running on that node.
I don't see how I could get anywhere near 4 cores and 8GB per job and have capacity for 25 jobs (the app env has a soft limit of 100 cores) outside of the consumption plan that's price comparable..... especially when
The first 180,000 vCPU-seconds, 360,000 GiB-seconds, and 2 million requests each month are free. Beyond that, you pay for what you use on a per second basis based on the number of vCPU-s and GiB-s your applications are allocated.
This issue is a: (mark with an x)
Issue description
Looking at my execution history, I see a ton of jobs that failed. In an attempt to understand why they failed, I ran this query on my logs:
got back this result:
what is
AssigningReplicaFailed
? I can't find any documentation as to why this is happening or what I can do to fix the issue.Steps to reproduce
setup a container app job that is triggered by KEDA when a queue has any messages
Expected behavior [What you expected to happen.]
some sort of documentation around these error Reasons
Actual behavior [What actually happened.]
I'm dead in the water with no idea why I'm getting this error so often
Screenshots
If applicable, add screenshots to help explain your problem.