microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
356 stars 27 forks source link

Container was terminated with exit code '137' #534

Open shavgath opened 1 year ago

shavgath commented 1 year ago

Using container apps as azure devops agents and recently have started seeing agents drop or hang midway through jobs. When looking at the logs I can see the below error:

"Container was terminated with exit code '137'"

Not sure where this is coming from since it was working perfectly fine and no changes were made. This happens very frequently to the point that I can't run any pipelines and have to look at other solutions.

ahmelsayed commented 1 year ago

137 is out of memory error. It happens when the container memory usage exceeds the limit set on it. The default is 1GB, you can try to increase it.

shavgath commented 1 year ago

Yes and we can also see the following in the logs: Container exceeded its local ephemeral storage limit "1Gi"

However it looks like you cannot override the ephemeral storage value in the schema as it fails to apply it. What other possible solutions are there to increase disk space as Im not able to see this in the Microsoft documentation for container apps.

SophCarp commented 1 year ago

Hi @shavgath could you describe the steps to reproduce this issue and the results you expected? Thanks!

shavgath commented 1 year ago

Hey @SophCarp:

It's weird how we've been using ACA's for several months now as azure devops agents and they've been working perfectly fine and I've never seen those errors/warnings in the logs before. Wondering if anything has changed in the backend?

howang-ms commented 1 year ago

@shavgath , the size limit for ephemeral storage is 1Gi in the container apps, and it is not customizable. If your job is required bigger storage the suggestion is to mount an Azure File to your container. https://learn.microsoft.com/en-us/azure/container-apps/storage-mounts-azure-files?tabs=bash

vincentspaa commented 1 year ago

@howang-ms , per the recommendation from the docs, we also started mounting an Azure Storage account File Share inside our Azure build agent container. We decided to mount the file share on the _work directory inside the agent's installation directory.

Initially we ran into a permission issue where, during the very first build step in a pipeline, the agent would crash and claim that it didn't have permission to access the very same .js file it just downloaded. Thanks to the detailed logs under _diag we were able to figure out that it was caused by a bug in the ZipFile implementation (fixed by this PR: https://github.com/dotnet/runtime/pull/56370). We then upgraded to version 3.x of the build agent, to make use of that fix. After upgrading the build agent, we continued to the next issue:

error: chmod on /opt/devops-agent/_work/2/s/.git/config.lock failed: Operation not permitted
fatal: could not set 'core.filemode' to 'false'
##[error]Unable to use git.exe init repository under /opt/devops-agent/_work/2/s, 'git init' failed with exit code: 128

This happens during the "checkout" step of the Azure Pipeline. We've set the Azure Files mountpoint to /opt/devops-agent/_work thus the config.lock mentioned in the error is placed on the fileshare. There seems to be no way for us to prevent git init from attempting to take ownership of the config.lock. Which fails because of a shortcoming of the way Azure Files are mounted: https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/could-not-change-permissions-azure-files We're fairly stuck at this point since manually mounting the fileshare by running mount -t cifs -o ... with a different uid and gid only ends up giving us: mount: /mnt/workdirectory: permission denied.. We've ran that command as root and we've also tried just putting it in /etc/fstab which leads to mount -a throwing the same error.

Do you have any pointers for us?

To be clear, our Azure Build agent ran perfectly fine inside the Azure Container App, without the Azure Files mount. We only switched to mounting the fileshare because we ran into the same error as mentioned by the OP.

anthonychu commented 1 year ago

@vincentspaa Do you know if your issues are related to https://github.com/microsoft/azure-container-apps/issues/520?

vincentspaa commented 1 year ago

@anthonychu Being able to add "uid=1000", "gid=1000" to the mount options (as mentioned in that issue) will most likely fix the aforementioned problem. It's hard to estimate whether that will allow the build agent to then run off of the mounted _work directory without any further issues (i.e. due to further Azure File Share limitations). But it would definitely help out a lot.

Josverl commented 1 year ago

@vincentspaa

This happens during the "checkout" step of the Azure Pipeline. We've set the Azure Files mountpoint to /opt/devops-agent/_work thus the config.lock mentioned in the error is placed on the fileshare. There seems to be no way for us to prevent git init from attempting to take ownership of the config.lock. Which fails because of a shortcoming of the way Azure Files are mounted: ...

Do you have any pointers for us?

Possibly you could work around this by having git store the .git folder on local storage while the rest of the repo is on the mounted storage.

With git init you can use use --separate-git-dir=<git-dir> or set $GIT_DIR to have git store the .git folder elsewhere See git documentation

vincentspaa commented 1 year ago

@Josverl Thank you for the suggestion. That's not really something we have control over, we only get to control which branch a given DevOps pipeline triggers on. From there DevOps makes sure to run the appropriate git commands. And even if we were to move the .git directory outside of the fileshare we would:

anthonychu commented 1 year ago

We are increasing the amount of ephemeral storage. More details will be shared later this month when the changes have been applied.

SophCarp commented 1 year ago

You can keep track of the ephemeral storage in issue #599

thisispaulsmith commented 11 months ago

@shavgath Did you resolve this?

Seeing the same issue randomly during pipeline runs

dsczltch commented 6 days ago

Hi @anthonychu , we also encounter this issue randomly in production workload. A pod has 2,5Go of allocated RAM and a code 137 is produced while this pod only used 400Mo : "Container 'prd-xxx-ca' was terminated with exit code '137'". Also, we believe it happened during an Azure Container App Maintenance because all pods on our Azure Container App Environment have been restarted. There is no notification about this maintenance even though we subscribed to the Azure Service health Alert information.