microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
362 stars 29 forks source link

Long running Container Apps Job never completes #1071

Closed BeChaRem closed 6 months ago

BeChaRem commented 7 months ago

Please provide us with the following information:

This issue is a: (mark with an x)

Issue description

Container app job that last more than ~4 minutes never completes even after the container exit.

I have a container who does Azcopy copy operation between two azure blob container using a managed identity. Copy that last less than 4 minutes works fine, but when they go over the 4 minutes range, the job never completes until the timeout.

Steps to reproduce

  1. Create a container job with the trigger type of events
  2. Trigger the event
  3. Job doesn't stop

Expected behavior [What you expected to happen.] Container job stop after the container exited

Actual behavior [What actually happened.] Container job last until the timeout.

Screenshots
In this screenshot, you can see that job in the minute or 3 minutes range completed without issue. However, as the size and time to copy container increase, I hit a spot in the 3 - 4 minutes range where the container never stop before hitting the --replica-timeout 3600 I have. image

System log of a run that correctly completed: image Console log of that same run: image

System log of a run that ran until the timeout: image Console log of that same run: image

As you can see, the container exited at 10:47, however the job ran until 11:13.

Additional context

az containerapp job create `
  --name my-container-app `
  --resource-group dev-RealityDataServices-eus-rg  `
  --environment  my-container-apps  `
  --image testrealitycontainer.azurecr.io/realitydataservice-azcopyprocessor:latest  `
  --cpu 2.0 `
  --memory 4.0 `
  --trigger-type "Event"  `
  --scale-rule-name queue `
    --replica-retry-limit "1" `
    --replica-timeout 3600 `
   --mi-system-assigned `
    --replica-completion-count "1" `
    --registry-identity system `
    --registry-server testrealitycontainer.azurecr.io `
  --scale-rule-type azure-servicebus `
  --scale-rule-metadata "queueName=clone-realitydata" "namespace=reality-dev-eus-sb-01"  `
  --scale-rule-auth "connection=connection-string-secret" `
    --secrets "connection-string-secret=a-secret" `
   --env-vars  "AZCOPYPATH=/usr/local/bin/azcopy"
rajan1962 commented 7 months ago

I am facing the same issue. Was this addressed already?

anandanthony commented 7 months ago

@BeChaRem we investigated the behavior and can see that there is an internal process which goes into a race condition intermittently because of the which the job execution completion is delayed. For the execution that you mentioned, we checked that it was successful as the job container exited successfully at 10:47, but since the internal process still continued to run till 11:13, the job execution did not complete. We are adding logs to get more details about the race condition and root cause the issue. We will update this thread with an update on the fix when we have deployed it.

Also, just wanted to make sure if this behavior is reproducible or do you face it intermittently?

rajan1962 commented 7 months ago

Thank you for the update! It is reproducible almost all the time. When I first got this issue, I changed the "replica-retry-limit" to zero. Then two attempts it worked nicely. I made some updates to our processing (nothing about queue), then again it is continuously failing. No exceptions whatsoever in console or system logs. Last step of code has executed successfully, and system log shows container terminated.

BeChaRem commented 7 months ago

@anandanthony Hi, Thanks for the update.

It is reproducible. Since it's doing a container to container copy, a specific source container will always create the timeout. I did further investigation and Managed Identity doesn't seem to be the cause. I changed all my code to remove the param --mi-system-assigned and do the authenticate using SAS and connection string instead and the timeout still occur.

System Log of a run without any managed identity: image

rajan1962 commented 7 months ago

Last few attempts worked fine. Will keep a watch and update.

gcrockenberg commented 7 months ago

SOLUTION: IHostApplicationLifetime.StopApplication(); I had migrated BackgroundService ACA to ACA Job and needed to specify exit explicitly upon completion.

Original issue: I am experiencing this same issue with my job which finished running in 10 minutes (600 seconds) but ACA continued to list as running. Replica-retry = 0, Replica-timeout = 1800 after which the job is listed as failed.

image image

igelineau commented 7 months ago

@gcrockenberg I tried your solution but it does not solve the issue for me. I'm running a console application with the Cocona framework.

lihaMSFT commented 7 months ago

@rajan1962 could you please send an email to acasupport@microsoft.com with your job details and image?

lihaMSFT commented 7 months ago

@BeChaRem can you please send an email to acasupport@microsoft.com with your job details?

rajan1962 commented 7 months ago

I do not see the problem recur for past week

anthonychu commented 7 months ago

I tried your solution but it does not solve the issue for me. I'm running a console application with the Cocona framework.

@igelineau I'm not familiar with this framework. Can you please share a link?

igelineau commented 7 months ago

@anthonychu https://github.com/mayuki/Cocona However I just checked and I'm not using the .NET generic host integration, so makes sense that the previous workaround did not work. It probably just behaves like a normal CLI app anyway.

igelineau commented 7 months ago

Just tried it again today, same issue. The container will exit, but the job will keep running and reach the timeout, then fail. Here are the last system logs: image

I get no other insightful log in the Console logs.

lihaMSFT commented 7 months ago

@igelineau @rajan1962 @BeChaRem We have identified the issue. The root cause is a timeout in one of the downstream components. We have made a hotfix rolling out and it should be available for all regions by end of next week.

anthonychu commented 7 months ago

The hotfix should be rolled out. Please let us know if you're still seeing issues.

ttq-ak commented 7 months ago

@anthonychu I still appear to be seeing this in some of my jobs that have started in the last hour. Happy to provide any info you need to be able to look into our specific jobs :)

image

anthonychu commented 6 months ago

@ttq-ak Can you please email the specifics of the job executions that are experiencing this to acasupport [at] microsoft.com? Thanks!