Closed BeChaRem closed 6 months ago
I am facing the same issue. Was this addressed already?
@BeChaRem we investigated the behavior and can see that there is an internal process which goes into a race condition intermittently because of the which the job execution completion is delayed. For the execution that you mentioned, we checked that it was successful as the job container exited successfully at 10:47, but since the internal process still continued to run till 11:13, the job execution did not complete. We are adding logs to get more details about the race condition and root cause the issue. We will update this thread with an update on the fix when we have deployed it.
Also, just wanted to make sure if this behavior is reproducible or do you face it intermittently?
Thank you for the update! It is reproducible almost all the time. When I first got this issue, I changed the "replica-retry-limit" to zero. Then two attempts it worked nicely. I made some updates to our processing (nothing about queue), then again it is continuously failing. No exceptions whatsoever in console or system logs. Last step of code has executed successfully, and system log shows container terminated.
@anandanthony Hi, Thanks for the update.
It is reproducible. Since it's doing a container to container copy, a specific source container will always create the timeout. I did further investigation and Managed Identity doesn't seem to be the cause. I changed all my code to remove the param --mi-system-assigned
and do the authenticate using SAS and connection string instead and the timeout still occur.
System Log of a run without any managed identity:
Last few attempts worked fine. Will keep a watch and update.
SOLUTION: IHostApplicationLifetime.StopApplication(); I had migrated BackgroundService ACA to ACA Job and needed to specify exit explicitly upon completion.
Original issue: I am experiencing this same issue with my job which finished running in 10 minutes (600 seconds) but ACA continued to list as running. Replica-retry = 0, Replica-timeout = 1800 after which the job is listed as failed.
@gcrockenberg I tried your solution but it does not solve the issue for me. I'm running a console application with the Cocona framework.
@rajan1962 could you please send an email to acasupport@microsoft.com with your job details and image?
@BeChaRem can you please send an email to acasupport@microsoft.com with your job details?
I do not see the problem recur for past week
I tried your solution but it does not solve the issue for me. I'm running a console application with the Cocona framework.
@igelineau I'm not familiar with this framework. Can you please share a link?
@anthonychu https://github.com/mayuki/Cocona However I just checked and I'm not using the .NET generic host integration, so makes sense that the previous workaround did not work. It probably just behaves like a normal CLI app anyway.
Just tried it again today, same issue. The container will exit, but the job will keep running and reach the timeout, then fail. Here are the last system logs:
I get no other insightful log in the Console logs.
@igelineau @rajan1962 @BeChaRem We have identified the issue. The root cause is a timeout in one of the downstream components. We have made a hotfix rolling out and it should be available for all regions by end of next week.
The hotfix should be rolled out. Please let us know if you're still seeing issues.
@anthonychu I still appear to be seeing this in some of my jobs that have started in the last hour. Happy to provide any info you need to be able to look into our specific jobs :)
@ttq-ak Can you please email the specifics of the job executions that are experiencing this to acasupport [at] microsoft.com? Thanks!
This issue is a: (mark with an x)
Issue description
Container app job that last more than ~4 minutes never completes even after the container exit.
I have a container who does Azcopy copy operation between two azure blob container using a managed identity. Copy that last less than 4 minutes works fine, but when they go over the 4 minutes range, the job never completes until the timeout.
Steps to reproduce
Expected behavior [What you expected to happen.] Container job stop after the container exited
Actual behavior [What actually happened.] Container job last until the timeout.
Screenshots
In this screenshot, you can see that job in the minute or 3 minutes range completed without issue. However, as the size and time to copy container increase, I hit a spot in the 3 - 4 minutes range where the container never stop before hitting the --replica-timeout 3600 I have.
System log of a run that correctly completed: Console log of that same run:
System log of a run that ran until the timeout: Console log of that same run:
As you can see, the container exited at 10:47, however the job ran until 11:13.
Additional context