k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.21k stars 2.36k forks source link

KubeletInUserNamespace is not set in unprivileged LXD containers when k3s is run as root #4249

Closed itoffshore closed 1 year ago

itoffshore commented 3 years ago

Environmental Info:

K3s Version:

Node(s) CPU architecture, OS, and Version:

Cluster Configuration:

Describe the bug:

Steps To Reproduce:

Expected behavior:

Actual behavior:

Additional context / logs:

E1019 02:54:37.458733     215 container_manager_linux.go:456] "Updating kernel flag failed (Hint: enable KubeletInUserNamespace feature flag to ignore the error)" err="open /proc/sys/kernel/panic_on_oops: permission denied" flag="kernel/panic_on_oops"
E1019 02:54:37.458850     215 container_manager_linux.go:456] "Updating kernel flag failed (Hint: enable KubeletInUserNamespace feature flag to ignore the error)" err="open /proc/sys/vm/overcommit_memory: permission denied" flag="vm/overcommit_memory"
E1019 02:54:37.459114     215 kubelet.go:1423] "Failed to start ContainerManager" err="[open /proc/sys/kernel/panic_on_oops: permission denied, open /proc/sys/vm/overcommit_memory: permission denied]"

Backporting


Trying to run k3s rootless inside unprivileged LXD on zfs is problematic (btrfs gives a similar error):

WARN[0000] The host root filesystem is mounted as "master:258". Setting child propagation to "" is not supported. 

(This causes sandbox creation to fail)

This error disappears when running rootful k3s inside unprivileged LXD but the service fails due to KubeletInUserNamespace feature gate not being enabled.

An easy way to check if running inside a container is to check /proc/1/environ which contains container=lxc inside LXD containers.

brandond commented 3 years ago

So you're running it as root but in an unprivileged container? How would we detect that in order to enable this feature gate automatically? You're welcome to enable the feature gate yourself if you're running K3s in an odd configuration like this that requires it. I'm honestly not convinced this is something we're doing wrong?

itoffshore commented 3 years ago

checking /proc/1/environ shows container=lxc inside an LXD container

how do I manually enable k8 feature gates ?

brandond commented 3 years ago

--kubelet-arg=feature-gates=KubeletInUserNamespace=true - same for kube-controller-manager, kube-apiserver, etc.

https://rancher.com/docs/k3s/latest/en/installation/install-options/server-config/#customized-flags

How would you tell that it's an unprivileged container?

itoffshore commented 3 years ago

It's only possible to check if the LXD container is unprivileged from outside the container. From inside the container it's only possible to check if it's in a container or not.

running the service on zfs with:

k3s server --snapshotter=fuse-overlayfs --kubelet-arg=feature-gates=KubeletInUserNamespace=true --kube-controller-manager-arg=feature-gates=KubeletInUserNamespace=true --kube-apiserver-arg=feature-gates=KubeletInUserNamespace=true

Brings up all of the servers:

tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      1641/k3s server     
tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      1641/k3s server     
tcp        0      0 127.0.0.1:6444          0.0.0.0:*               LISTEN      1641/k3s server     
tcp        0      0 127.0.0.1:10256         0.0.0.0:*               LISTEN      1641/k3s server     
tcp        0      0 127.0.0.1:10257         0.0.0.0:*               LISTEN      1641/k3s server     
tcp        0      0 127.0.0.1:10258         0.0.0.0:*               LISTEN      1641/k3s server     
tcp        0      0 127.0.0.1:10259         0.0.0.0:*               LISTEN      1641/k3s server     
tcp6       0      0 :::10250                :::*                    LISTEN      1641/k3s server     
tcp6       0      0 :::10251                :::*                    LISTEN      1641/k3s server     
tcp6       0      0 :::6443                 :::*                    LISTEN      1641/k3s server 

I think this is a problem with the default LXD Linux Bridge. Using VXLAN networking with LXD + openvswitch is probably required to make unprivileged LXD work fully with k3s

Many thanks for looking at this issue - now I just have to fix the networking (it looks like my forwarding rules on the host need to be less strict)

brandond commented 3 years ago

Just to be clear - do you think there's anything K3s can do better? It sounds like there's not any way for us to detect unprivileged operation, so users will need to be responsible for setting the feature-gates on their own.

itoffshore commented 3 years ago

I was thinking about this earlier - I think it will be useful for LXD to create a file somewhere in containers to show unprivileged operation (so software running inside it can configure itself accordingly)

I will suggest it as an LXD feature & see what they think.

itoffshore commented 3 years ago

Inside LXD /proc/self/uid_map & /proc/self/gid_map can be checked:

Privileged LXD (root maps to root):

# cat /proc/self/uid_map 
         0          0 4294967295
# cat /proc/self/gid_map 
         0          0 4294967295

Unprivileged LXD (root maps to user namespace):

# cat /proc/self/gid_map 
         0    1000000 1000000000
# cat /proc/self/uid_map 
         0    1000000 1000000000

These values can also be read by the rootless user:

Connected to the local host. Press ^] three times within 1s to exit session.
starting: dbus
podman@u2110:~$ cat /proc/self/gid_map 
         0    1000000 1000000000
podman@u2110:~$ cat /proc/self/uid_map 
         0    1000000 1000000000
brandond commented 3 years ago

That just sounds like user and group remapping; is there anything unique that can be used to identify unprivileged operation?

itoffshore commented 3 years ago

if uid 0 maps to anything other than 0 you are in an unprivileged container

itoffshore commented 3 years ago

Running unprivileged LXD with lvm as the storage driver (which uses ext4 by default) makes the filesystem problems disappear in rootless mode:

k3s-lxd

In rootless mode I see OOM warnings (one per minute):

container_manager_linux.go:675] "Failed to ensure state" containerName="/k3s" err="failed to apply oom score -999 to PID 30: write /proc/30/oom_score_adj: permission denied"

In rootful mode inside unprivileged LXD (with the userspace feature gates enabled) all the ports seem to come up:

[root@u2110 ~]# ns | grep k3s
tcp        0      0 127.0.0.1:6444          0.0.0.0:*               LISTEN      9097/k3s server     
tcp        0      0 127.0.0.1:10256         0.0.0.0:*               LISTEN      9097/k3s server     
tcp        0      0 127.0.0.1:10257         0.0.0.0:*               LISTEN      9097/k3s server     
tcp        0      0 127.0.0.1:10258         0.0.0.0:*               LISTEN      9097/k3s server     
tcp        0      0 127.0.0.1:10259         0.0.0.0:*               LISTEN      9097/k3s server     
tcp        0      0 0.0.0.0:31164           0.0.0.0:*               LISTEN      9097/k3s server     
tcp        0      0 127.0.0.1:10248         0.0.0.0:*               LISTEN      9097/k3s server     
tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      9097/k3s server     
tcp        0      0 0.0.0.0:30250           0.0.0.0:*               LISTEN      9097/k3s server     
tcp6       0      0 :::10250                :::*                    LISTEN      9097/k3s server     
tcp6       0      0 :::10251                :::*                    LISTEN      9097/k3s server     
tcp6       0      0 :::6443                 :::*                    LISTEN      9097/k3s server

The flannel.1 & cni0 interfaces both come up in rootful mode - but do not go down when the k3s service is stopped (which makes the container take a long time to stop)

AkihiroSuda commented 3 years ago

I think we can unconditionally set KubeletInUserNamespace feature gate without detecting whether we are in LXD.

(When we are outside userns, the feature gate is safely ignored)

itoffshore commented 3 years ago

OK sounds good ;o)

k3s run as root inside an unprivileged Ubuntu 21.10 LXD container (with nesting enabled) seems to work ok on both zfs & lvm (ext4):

k3s-lxd2

Using ufw as the host iptables firewall with a libvirt virbr0 bridge works

Using nftables on the host & inside LXD also works in rootful & rootless modes

Debian 11.1 unprivileged LXD containers also work with nftables:

k3s-lxd-debian

itoffshore commented 3 years ago

the system-upgrade-controller also works in rootful & rootless modes with v0.8.0 - change to the docs proposed:

k3s-lxd-upgrade

k3s-lxd-upgrade-unpriv

stale[bot] commented 2 years ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

itoffshore commented 2 years ago

no problems in ubuntu-22.04 LXD containers either with an ext4 zvol mounted for containerd

itoffshore commented 2 years ago

As of LXD 5.6 (& until kernel 5.19) - you also need to add to /etc/environment:

LXD_IDMAPPED_MOUNTS_DISABLE=1

for overlayfs / stargz snapshotters to work on lvm storage volumes. At the moment with the latest k3s I use in the service script:

ExecStart=/usr/local/bin/k3s \
    server --snapshotter=stargz \
    --kubelet-arg=feature-gates=KubeletInUserNamespace=true \
    --kube-controller-manager-arg=feature-gates=KubeletInUserNamespace=true \
    --kube-apiserver-arg=feature-gates=KubeletInUserNamespace=true \
    --disable=servicelb --cluster-init

Screenshot_2022-10-02_21-13-15

stale[bot] commented 1 year ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

dalbani commented 1 year ago

Hi @itoffshore, would you mind sharing the full commands/settings that you used to setup the LXD container and then K3s inside it? I've tried to replicate your commands in an Ubuntu 22.04 container in LXD 5.0.2, but it doesn't seem to work when starting up K3s. For example:

...
"Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting \"proc\" to rootfs at \"/proc\": mount proc:/proc (via /proc/self/fd/6), flags: 0xe: permission denied: unknown" pod="kube-system/helm-install-traefik-crd-zsnr5"
...

Did you use a particular LXD container profile for example?

itoffshore commented 1 year ago

@dalbani - this profile should work using lvm as the LXD backing store (NB: I used nftables for my firewall - so if you use iptables your kernel_modules in your profile may need to be slightly different:

config:
  limits.cpu: "2"
  limits.memory: 2GB
  limits.memory.swap: "false"
  linux.kernel_modules: ip_vs,ip_vs_rr,ip_vs_wrr,ip_vs_sh,nf_tables,netlink_diag,nf_nat,overlay
  raw.lxc: |-
    lxc.apparmor.profile=unconfined
    lxc.mount.auto=proc:rw sys:rw cgroup:rw
    lxc.cgroup.devices.allow=a
    lxc.cap.drop=
  security.nesting: "true"
  security.privileged: "false"
description: K3s LXD profile
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  root:
    path: /
    pool: default
    propagation: shared
    type: disk
name: k3s

I also made k3s work in unprivileged LXD on zfs if I created an ext4 zvol & mounted it inside the container under /var/lib/rancher (wherever kubelet runs it expects an ext4 filesystem) - possibly in the agent subdirectory of /var/lib/rancher ?

You should probably start with lvm until you get it working - also note the service script settings above. Everything seemed to work - I even had the stargz snapshotter working.

I successfully ran k3s under LXD on lvm / zfs on Ubuntu 22.04 / & zfs on Arch Linux (although I expect both to work)

dalbani commented 1 year ago

Thanks @itoffshore, I've indeed managed to run K3s within an unprivileged container, using storage from a ZFS pool being "delegated" (zoned=on).

I'm curious though what changes have been applied to the K3s codebase to be able to mark this issue as completed, as it happened a couple of weeks ago?

And how does that relate to the so-called "rootless mode" (e.g. commit https://github.com/k3s-io/k3s/commit/6e8284e3d4d3595824ffb5c6fa305a1dd9aa9274)?

itoffshore commented 1 year ago

@dalbani - this is probably why the issue was closed. Thanks for the new rootless note.