Running container apps jobs for Azure DevOps Self hosted agents, jobs end before completing

funkel1989 commented 2 months ago

This issue is a: (mark with an x)

[x] bug report -> please search issues before submitting
[ ] documentation issue or request
[ ] regression (a behavior that used to work and stopped in a new release)

Issue description

1 out of every 15 jobs on average will stop and azure devops will lose connection with the agent before the job has completed.

The system logs are reporting "Job was active longer than specified deadline" but these jobs are running for 2-3 minutes while others with identical configuration run for 10 plus minutes without a problem. Re-running the job solves the problem even if it ran longer than it did before.

On my run.sh I am using the --once flag so the jobs close after every task and a new job instance is create (if my understanding of how this works is correct?). I am seeing in the logs though at job execution is greater than 30 minutes which would make sense if its closing. I'm having trouble understanding how to fix this.

Here is a snippet from my run.sh

# Configure the Azure Pipelines agent
echo "Configuring Azure Pipelines agent..."
./config.sh --unattended \
    --agent "${AZP_AGENT_NAME:-$(hostname)}" \
    --url "$AZP_URL" \
    --auth PAT \
    --token "$(cat "$AZP_TOKEN_FILE")" \
    --pool "${AZP_POOL:-Default}" \
    --work "${AZP_WORK:-_work}" \
    --replace \
    --acceptTeeEula

# Run the Azure Pipelines agent
echo "Running Azure Pipelines agent..."
./run.sh "$@" --once &

wait $!

Steps to reproduce

Jobs Execute
Win, or in this case lose

Expected behavior All started long running jobs should complete

Actual behavior Jobs are randomly ending early

Screenshots

vinisoto commented 1 month ago

@funkel1989 - thanks for reaching out. There is an issue in which some container jobs finish execution you would see a message "Container container-name was terminated with exit code 0" but a "Successful Delete" message doesn't appear until after the timeout period has been exceeded.

What happens here is that the container job finished successfully (hence you see that GitHub stops hearing from the agent) but the container app job object is not removed. This doesn't have a functional impact, but it results in misleading logs.

We are testing a fix and will begin rolling it out in the next few weeks.

Does this address your question? Feel free to comment otherwise

funkel1989 commented 1 month ago

@vinisoto I don't believe this is the same issue that we are seeing. Azure is reporting messages about how the container is no longer responding and then the devops pipeline task just times out.

CezaryKlus commented 1 month ago

Experiencing the same issue

vinisoto commented 1 month ago

@funkel1989, @CezaryKlus - can you please send an email to acasupport at microsoft dot com?

Please include your subscription Id, environment name, container app job name, and a timestamp when you saw this behavior. Please include an execution/replica name if possible, to speed up the process.

vinisoto commented 1 week ago

@funkel1989 @CezaryKlus - we deployed fixes related to this issue. Please feel free to open a new issue here and a support request if you continue to see similar issues.

microsoft / azure-container-apps