microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
362 stars 29 forks source link

Running container apps jobs for Azure DevOps Self hosted agents, jobs end before completing #1292

Open funkel1989 opened 1 week ago

funkel1989 commented 1 week ago

This issue is a: (mark with an x)

Issue description

1 out of every 15 jobs on average will stop and azure devops will lose connection with the agent before the job has completed. image

The system logs are reporting "Job was active longer than specified deadline" but these jobs are running for 2-3 minutes while others with identical configuration run for 10 plus minutes without a problem. Re-running the job solves the problem even if it ran longer than it did before.

On my run.sh I am using the --once flag so the jobs close after every task and a new job instance is create (if my understanding of how this works is correct?). I am seeing in the logs though at job execution is greater than 30 minutes which would make sense if its closing. I'm having trouble understanding how to fix this.

Here is a snippet from my run.sh

# Configure the Azure Pipelines agent
echo "Configuring Azure Pipelines agent..."
./config.sh --unattended \
    --agent "${AZP_AGENT_NAME:-$(hostname)}" \
    --url "$AZP_URL" \
    --auth PAT \
    --token "$(cat "$AZP_TOKEN_FILE")" \
    --pool "${AZP_POOL:-Default}" \
    --work "${AZP_WORK:-_work}" \
    --replace \
    --acceptTeeEula

# Run the Azure Pipelines agent
echo "Running Azure Pipelines agent..."
./run.sh "$@" --once &

wait $!

Steps to reproduce

  1. Jobs Execute
  2. Win, or in this case lose

Expected behavior All started long running jobs should complete

Actual behavior Jobs are randomly ending early

Screenshots
image image

vinisoto commented 5 days ago

@funkel1989 - thanks for reaching out. There is an issue in which some container jobs finish execution you would see a message "Container container-name was terminated with exit code 0" but a "Successful Delete" message doesn't appear until after the timeout period has been exceeded.

What happens here is that the container job finished successfully (hence you see that GitHub stops hearing from the agent) but the container app job object is not removed. This doesn't have a functional impact, but it results in misleading logs.

We are testing a fix and will begin rolling it out in the next few weeks.

Does this address your question? Feel free to comment otherwise