[Question]: Using azure pipelines agent with rootless docker

ktdharan commented 1 year ago

Describe your question

It looks like where are running into the same issue as reported in -> https://github.com/microsoft/azure-pipelines-agent/issues/3312

When using rootless docker we run into the error when running the pipeline

"##[error]Unhandled: EACCES: permission denied, open '/__w/_temp/.taskkey'"

It does seem to be an permission with the workspace that is being created by Azure-devops .

Do we have any alternatives suggestions -> installing the agent with ./svc install doesnt work . We get the following error

"/usr/bin/docker version --format '{{.Server.APIVersion}}' Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? "

The understanding is that the docker demo runs with different unix file descriptor. where 100 would be id of the user " unix:///run/user/100/docker.sock "

Versions

3.218.0/Linux

Environment type (Please select at least one enviroment where you face this issue)

[X] Self-Hosted
[ ] Microsoft Hosted
[ ] VMSS Pool
[ ] Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Operation system

Red HAT 8.7

Version controll system

azure devops

Azure DevOps Server Version (if applicable)

No response

ivanduplenskikh commented 1 year ago

Hi @ktdharan thanks for reporting! We are working on more prioritized issues at the moment, but will get back to this one soon.

ktdharan commented 1 year ago

@ivanduplenskikh -> Can you let us know if this rootless docker is supported by the pipline-agent ? if not we would like to know if there will be any support for this in the future road map

pixdrift commented 1 year ago

Thanks for posting this @ktdharan, I have seen the same error message on RHEL 8, agent version 3.220.2, although using podman rootless that ships with RHEL (with docker compatibility packages).

@ktdharan can you confirm if you have SELinux on and enforcing in your configuration?

pixdrift commented 1 year ago

An additional note, there seems to be a permission issue with the initial creation of the temp directory when the agent first executes, so it is worth validating the permissions on the directories it creates.

The /__w/_temp directory in the container this error is referring to is mounted in from <agent_install_directory>/_work/_temp from the host when the container is launched. Validate the permissions on the host directory <agent_install_directory>/_work/_temp to confirm the the container process has read access.

-edit-

Interestingly there is a specific mount of a .taskkey file in the docker create command in 'Initialize Containers' which is <agent_install_directory>/_work/.taskkey mounting to /__w/.taskkey but the error is referring to the .taskkey file in /__w/_temp/.taskkey. Is this a bug in the agent? has the .taskkey file moved into _temp at some point and the container create command wasn't updated?

phaneendrakotha commented 1 year ago

Yes we do have similar issues and our containers are running with podman rootless. Interestingly it happens intermittent and it changes the permission of the _temp directory under agent to "S" like shared. Either we have to give permissions manually or recycle the agent to overcome the issue. We have about 100s of agents runnings and doing it manually is not a easy task for us. Even we automated this work around its impacting our user experience as the builds will get cancelled during restarts. Please advise any possible solution for this.

ktdharan commented 1 year ago

@phaneendrakotha -> when you say " Either we have to give permissions manually" it means the path / right ? so what permissions did you give to solve the problem.

pixdrift commented 1 year ago

Have been working through multiple issues when using podman, and had some success solving this specific issue by changing permissions on the following directory:

/<agent_install_location>/_work/_temp

It appears this needs to be accessible to:

The user the ADO agent is running as for writing the taskkey
The user on the host that the root user in the container is mapped to in /etc/subuid

Note: In our experience a consistent working configuration is to make these the same user. (ie. the user that root in the container is mapped to on the host is the same user configured to run the ADO agent).

If the user and group is correct, 770 should resolve the issue on the _temp directory. You will need to do this for all instances of the temp directory if you are running multiple instances of the container agent to achieve parallel execution. I don't have a full answer for what is creating/setting/changing the permissions here, so this 'solution' is more a workaround at this point.

This solution may also be dependant on the user mapping method you are using in containers.conf, I have had different issues using nomap, keep-id and auto.

pixdrift commented 1 year ago

Working through all the permutations of issues running rootless docker or podman, I think the majority of the issues can be solved if a configuration knob was added to disable the user creation scripts during the container initialisation step, leaving the user management as an exercise for the developer.

https://github.com/microsoft/azure-pipelines-agent/blob/d7e0704fb4cb4bc6361710df8e6963847774b27c/src/Agent.Worker/ContainerOperationProvider.cs#L546

These steps don't really provide much benefit in rootless container configurations. Perhaps the assumptions here about running as 'root', (especially in rootless environments that map to unprivileged users) need to be revisited?

agerchev commented 10 months ago

Hello,

We use on premise version of Azure Devops 2022. Agent version: 2.217.2 running on Redhat 8 with rootless podman.

I changed the permissions on _temp directory Now our problem is that dotnet pack command gives us : Access to the path '/__w/38/s/src/...../obj' is denied

The mounted folder __w is owned by root in the container. I assume it is because the default value of --userns="" - as pointed in https://www.redhat.com/sysadmin/rootless-podman-user-namespace-modes.
And the user inside the container (..._azpcontainer) does not have permissions to write there.

I cannot use options keep-id because when the agent initialize the container it does not have permissions to complete the tasks mentioned in this #4332

If i execute the following command from user bldagent (uid=1000) podman run -i -u=0 -v "/home/bldagent/_work/38":"/__w/38" registry/build-image ls -lan /__w/38/ the result is this:

drwxr-xr-x 9     0     0 4096 Oct 25 13:18 .
drwxr-xr-x 3     0     0 4096 Oct 25 16:36 ..
drwxr-xr-x 2     0     0 4096 Oct 25 07:08 TestResults
drwxr-xr-x 2     0     0 4096 Oct 25 07:08 a
drwxr-xr-x 2     0     0 4096 Oct 25 07:08 b
drwxr-xr-x 5     0     0 4096 Oct 25 07:08 s
drwxrwxr-x 2     0     0 4096 Oct 25 12:40 test
drwxrwxrwx 3     0     0 4096 Oct 25 12:45 test2

if i execute the same command from root, the result is:

drwxr-xr-x 9 1000 1000 4096 Oct 25 13:18 .
drwxr-xr-x 3    0    0 4096 Oct 25 16:35 ..
drwxr-xr-x 2 1000 1000 4096 Oct 25 07:08 TestResults
drwxr-xr-x 2 1000 1000 4096 Oct 25 07:08 a
drwxr-xr-x 2 1000 1000 4096 Oct 25 07:08 b
drwxr-xr-x 5 1000 1000 4096 Oct 25 07:08 s
drwxrwxr-x 2 1000 1000 4096 Oct 25 12:40 test
drwxrwxrwx 3 1000 1000 4096 Oct 25 12:45 test2

which is the uid of the bldagent from the host.

Are we missing something with the configuration ? @pixdrift do you have any success on this or we have to use root for the agent ?

pixdrift commented 10 months ago

I cannot use options keep-id because when the agent initialize the container it does not have permissions to complete the tasks mentioned in this #4332

Hi @agerchev I do have it working.. but I am the first to admit it's pretty hacky due to all the different scripts/assumptions that are made with the ADO agent.

To get this to work I had to use keep-id in the containers.conf file, and you are correct that this breaks the ADO container initialisation scripts. The only way I have found to work around this is the following 'options' in the pipeline yaml file container resource block:

resources:
  containers:
  - container: my-container
    image: your.image.host/image-name:tag
    options: --user 0:0

In the container image itself, I specify the user I actually want the container to run as (has matching UID to the ADO agent installation user).

Another approach to work around the failing ADO initialisation script was to basically replace the commands used by ADO to configure the initial user etc. with dummy commands that return 0 so they don't fail on execution with a non-root user, but I never worked all the way through this approach to validate it.

I hope that helps, and apologies if it's not clear.. it's taken a lot of iterating through solutions to get a working combination of configuration, both on the host and in the container.

agerchev commented 10 months ago

@pixdrift, thank you, It worked. Are there any plans for more clean solution?

pixdrift commented 10 months ago

Sorry, no idea @agerchev, I am just an admin working around the problem. I am trying to avoid having to fork it and maintain a custom copy/build of the agent that removes all the container initialisation.

There is some additional discussion on this over in this issue https://github.com/microsoft/azure-pipelines-agent/issues/4332

@joshchngs has an alternative workaround for the same issues in his first post, and that solution doesn't required the --user configuration but will need minor modification to the container image (which you are probably already doing to make them work).

Ultimately, I hope that the agent can be updated to make these scripts optional in the agent/container initialisation through configuration or a 'configuration knob' as it's called in the code.

moqmar commented 10 months ago

I have another workaround that works really well & doesn't require any modifications to the pipelines, and I think I now understand the underlying issue:

The host user has UID 1001 in my case.
It also has an entry in /etc/subuid, stating that sub-UIDs shall start at 165536.
By default, the agent starts the container (more or less, it's a bit more complicated than that) with --user=$(id -u) (i.e. --user=1001). That normally ensures that the container is running with the same permissions as the agent, and the agent can e.g. delete files created from the container afterwards (otherwise they'd be owned by root).
A container started with --user=0 (the default) is basically "root" within the sub-UID range (i.e. has the host UID 165536), and more or less directly maps to UID 1001 on the host.
As the container is now started with --user=1001, this doesn't happen - instead, we're now user 166537 (sub-UID range start + 1001). That user doesn't have (especially write) access to most files, and for example can't modify the working directory in a pipeline as well.

This means, that the issue can be solved by setting an ACL for the user for which the agent thinks the UID is correct:

MISTAKEN_UID=$(($(grep -F "$(id -un):" /etc/subuid | cut -d: -f2) + $(id -u)))
setfacl -m default:user:$MISTAKEN_UID:rwx \
        -m default:group:$MISTAKEN_UID:rwx \
        -m user:$MISTAKEN_UID:rwx \
        -m group:$MISTAKEN_UID:rwx \
        --recursive work

With the "default" ACLs, this now also means that new files will have the ACLs set, which means that this should be a permanent solution. I am not using SELinux, so I have no idea if something additional is required there, but for me this works really well now. Of course now, everything is executable inside the container, but I don't know ACLs well enough to solve that.

pixdrift commented 9 months ago

I am not clear on how this solves the specific problems with the startup/initialisation script and the user creation / sudo steps @moqmar, but keen to better understand it. I had the same or similar issues when implementing various iterations of subuid configuration, so interested in this specific combination.

Are you using 'keep-id' in the containers.conf file?

What host OS (distro) are you using? and are you using docker or podman?

moqmar commented 9 months ago

I'm using Docker on Debian 12, and didn't touch a containers.conf file. The startup script then has access to all the files it wants to access, as the files are accessible by the user within the container (which has a sub-uid). It is possible that --user=0 is still required on some systems like when using Podman, but that wasn't the case for me - the default user should have the correct UID.

SerDroide commented 9 months ago

I'm facing a similar issue using rootless podman with selinux enabled in enforcing mode. /etc/subuid and /etc/subgid are correctly configured. We are using RedHat 8.8 for the host and ubi8 as base image for the container

In my opinion a valid solution will be to enable to mount the volumes using :z/:Z SELinux:

without:

$ podman run --rm -it -v "/home/azure_agent/myagent/_work:/__w" my-container-job-image:latest ls /__w

ls: cannot open directory '/__w': Permission denied

with Z

$ podman run --rm -it -v "/home/azure_agent/myagent/_work:/__w:Z" my-container-job-image:latest ls -l /__w

drwxr-xr-x.  6 root root   52 Nov  7 05:09 10
drwxr-xr-x.  6 root root   52 Nov 17 10:42 11
drwxr-xr-x.  6 root root   52 Oct 31 14:25 6
drwxr-xr-x.  6 root root   52 Oct 31 13:38 7
drwxr-xr-x.  8 root root  102 Oct 27 13:03 8
drwxr-xr-x.  6 root root   52 Oct 28 08:33 9
drwxr-xr-x.  4 root root   81 May 21 05:01 SourceRootMapping
drwxr-xr-x. 13 root root 4096 Nov 16 17:30 _tasks
drwxr-xr-x.  2 root root    6 Nov 17 10:43 _temp
drwxr-xr-x.  2 root root    6 Apr 13  2023 _tool
drwxr-xr-x.  3 root root   20 Oct 26 08:57 context
drwxr-xr-x.  3 root root   26 Apr 13  2023 node_modules

that it is correctly mapped to the root user.

Maybe adding a new option here to support SELinux should be enough

These are similar issues that I found:

https://github.com/microsoft/azure-pipelines-agent/issues/3312
https://github.com/microsoft/azure-pipelines-agent/issues/3780 - This issue match my error

microsoft / azure-pipelines-agent