rootless-containers / usernetes

Kubernetes without the root privileges
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2033-kubelet-in-userns-aka-rootless
Apache License 2.0
853 stars 58 forks source link

Support Rocky Linux 9 and AlmaLinux 9 hosts #301

Closed AkihiroSuda closed 9 months ago

vsoch commented 9 months ago

Do you want me to ping some rocky devs? I think they might be able to provide insight.

AkihiroSuda commented 9 months ago

Do you want me to ping some rocky devs? I think they might be able to provide insight.

Probably not yet, until VXLAN works for me on Rocky

vsoch commented 9 months ago

Probably not yet, until VXLAN works for me on Rocky

Okay I won't! But if they could be of help here (getting it working) let me know and I can.

AkihiroSuda commented 9 months ago

WIP: this seems to somehow enable VXLAN functional

(sysctl values are from https://qiita.com/tom7/items/1bc7f4e568b20c306845)

# Execute inside `nsenter -t $(pgrep dockerd) -n -U` before running `make up`

# VRF
sysctl -w net.ipv4.ip_forward=1
sysctl -w net.ipv4.tcp_l3mdev_accept=1
sysctl -w net.ipv4.udp_l3mdev_accept=1
sysctl -w net.ipv4.conf.default.rp_filter=0
sysctl -w net.ipv4.conf.all.rp_filter=0

# Inspiered by Cumulus
sysctl -w net.ipv4.conf.default.arp_accept=0
sysctl -w net.ipv4.conf.default.arp_announce=2
sysctl -w net.ipv4.conf.default.arp_filter=0
sysctl -w net.ipv4.conf.default.arp_ignore=1
sysctl -w net.ipv4.conf.default.arp_notify=1
vsoch commented 9 months ago

Woot! So just to clarify - if I run this on the host nodes (not in containers) right before make up, this should work?

I can try this tonight (after you confirm the above!) It would be so great to get this working on rocky because our networking is good there, but we haven't figured out ubuntu yet.

AkihiroSuda commented 9 months ago

It turns out that net.ipv4.conf.default.rp_filter is set to 1 (strict) on Rocky 9.

This has to be 0 (disabled) or 2 (loose) in the rootless dockerd's network namespace. (Setting this value for the node container isn't enough).

This value may still remain 1 on the host.

AkihiroSuda commented 9 months ago

Now this is ready for testing.

vsoch commented 9 months ago

Excellent! So should I test this branch as it is now, no changes to my rocky base images, or do we need further changes?

AkihiroSuda commented 9 months ago

Excellent! So should I test this branch as it is now, no changes to my rocky base images, or do we need further changes?

No further change is expected to be needed

vsoch commented 9 months ago

Awesome! My rocky image is building now and I should be able to bring up a testing cluster after dinner. Will send you an update when I do! 🎉

AkihiroSuda commented 9 months ago

Confirmed that this works on AlmaLinux 9.2 too, of course

vsoch commented 8 months ago

hey @AkihiroSuda! Congrats on your award today, you and your contributions are amazing and we so appreciate you!

I was running into some issues (related to this one, but on ubuntu) and wanted to post what I learned for some future person. First, I was still getting a dbus error with the make up command:

cat: /sys/fs/cgroup/user.slice/user-501043911.slice/user@501043911.service/cgroup.controllers: No such file or directory
Failed to connect to bus: No such file or directory
[INFO] systemd not detected, dockerd-rootless.sh needs to be started manually:

And the fix was to rebuild my base image, and I added apt-get upgrade to update the kernel (that worked!) Then I was getting an error about net.ipv4.conf.default.rp_filter, specifically that it was still 1. But the rootless init script did create this to set to 2:

$ cat /etc/sysctl.d/99-usernetes.conf 
net.ipv4.conf.default.rp_filter = 2

I has already run this ./init-host/init-host.rootless.sh and here was the full error:

[INFO] Detected container engine type: docker
[WARNING] systemd lingering is not enabled. Run `sudo loginctl enable-linger $(whoami)` to enable it, otherwise Kubernetes will exit on logging out.
[WARNING] Kernel module "ip6_tables" does not seem loaded? (negligible if built-in to the kernel)
[WARNING] Kernel module "ip6table_nat" does not seem loaded? (negligible if built-in to the kernel)
[WARNING] Kernel module "iptable_nat" does not seem loaded? (negligible if built-in to the kernel)
[ERROR] sysctl value "net.ipv4.conf.default.rp_filter" must be 0 (disabled) or 2 (loose) in the container engine's network namespace
make: *** [Makefile:60: check-preflight] Error 1

(sidenote) no matter how many times I run this, I always see this warning and I haven't figured out why that's the case yet:

[WARNING] systemd lingering is not enabled. Run `sudo loginctl enable-linger $(whoami)` to enable it, otherwise Kubernetes will exit on logging out.

But I determined that I think it's still set to 1 on my host?

$ grep [01] /proc/sys/net/ipv4/conf/*/rp_filter|egrep "default|all"
/proc/sys/net/ipv4/conf/all/rp_filter:1

So I did:

$ sudo vim /etc/sysctl.conf
vsochat_gmail_com@usernetes-compute-001:/opt/usernetes$ sudo sysctl -p
net.ipv4.conf.default.rp_filter = 2

(changing it to 2) and restarted docker:

systemctl --user restart docker.service

And then the make up worked! But I wonder why that wasn't fixed to start? Now I have a control plane!

NAMESPACE      NAME                                                READY   STATUS    RESTARTS   AGE
kube-flannel   kube-flannel-ds-7wstg                               1/1     Running   0          23m
kube-system    coredns-5dd5756b68-ccwtd                            1/1     Running   0          23m
kube-system    coredns-5dd5756b68-m7c7v                            1/1     Running   0          23m
kube-system    etcd-u7s-usernetes-compute-001                      1/1     Running   0          23m
kube-system    kube-apiserver-u7s-usernetes-compute-001            1/1     Running   0          23m
kube-system    kube-controller-manager-u7s-usernetes-compute-001   1/1     Running   0          23m
kube-system    kube-proxy-gzxg8                                    1/1     Running   0          23m
kube-system    kube-scheduler-u7s-usernetes-compute-001            1/1     Running   0          23m

For the worker node, my power went out and I didn't get to test it fully, but when I ran the script to bring up the worker it seemed to hang:

./Makefile.d/check-preflight.sh
[INFO] Detected container engine type: docker
[WARNING] systemd lingering is not enabled. Run `sudo loginctl enable-linger $(whoami)` to enable it, otherwise Kubernetes will exit on logging out.
[WARNING] Kernel module "ip6_tables" does not seem loaded? (negligible if built-in to the kernel)
[WARNING] Kernel module "ip6table_nat" does not seem loaded? (negligible if built-in to the kernel)
[WARNING] Kernel module "iptable_nat" does not seem loaded? (negligible if built-in to the kernel)
docker compose up --build -d
[+] Building 0.2s (9/9) FINISHED                                 docker:default
 => [node internal] load build definition from Dockerfile                  0.0s
 => => transferring dockerfile: 809B                                       0.0s
 => [node internal] load .dockerignore                                     0.0s
 => => transferring context: 75B                                           0.0s
 => [node internal] load metadata for docker.io/kindest/node:v1.28.0       0.2s
 => [node 1/4] FROM docker.io/kindest/node:v1.28.0@sha256:b7a4cad12c197af  0.0s
 => [node internal] load build context                                     0.0s
 => => transferring context: 84B                                           0.0s
 => CACHED [node 2/4] RUN arch="$(uname -m | sed -e s/x86_64/amd64/ -e s/  0.0s
 => CACHED [node 3/4] RUN apt-get update && apt-get install -y --no-insta  0.0s
 => CACHED [node 4/4] ADD Dockerfile.d/u7s-entrypoint.sh /                 0.0s
 => [node] exporting to image                                              0.0s
 => => exporting layers                                                    0.0s
 => => writing image sha256:ef1a52ff46bc2c33546f1db882bb04667aecb3e532c5b  0.0s
 => => naming to docker.io/library/usernetes-node                          0.0s
[+] Running 1/0
 ✔ Container usernetes-node-1  Running                                     0.0s 
docker compose exec -e U7S_HOST_IP=10.10.0.3 -e U7S_NODE_NAME=u7s-usernetes-compute-003 -e U7S_NODE_SUBNET=10.100.5.0/24 node sh -euc '$(cat /usernetes/join-command)'
[preflight] Running pre-flight checks
    [WARNING SystemVerification]: missing optional cgroups: hugetlb

I think the above was running make -C /opt/usernetes up kubeadm-join with the copied over join-command. But I didn't see the node with kubectl get nodes. What should I try to debug next? I had to bring my cluster down from my phone when my power went off in case it was an all day thing and I was burning cloud monies. :laughing:

AkihiroSuda commented 8 months ago

Congrats on your award today, you and your contributions are amazing and we so appreciate you!

Thank you

But I wonder why that wasn't fixed to start?

Because the sysctl value of the dockerd process is propagated to the container.

But I didn't see the node with kubectl get nodes. What should I try to debug next?

Any error from kubeadm-join?

I had to bring my cluster down from my phone when my power went off in case it was an all day thing and I was burning cloud monies. 😆

I'd suggest to use local VMs for an exercise

e.g., with https://lima-vm.io/ :

limactl start --network=lima:user-v2 --name=vm0 template://rockylinux-9
limactl start --network=lima:user-v2 --name=vm1 template://rockylinux-9
vsoch commented 8 months ago

oh neat - I am not familiar with this tool. I'll try this out after a meeting / later this evening and give you an update!

vsoch commented 8 months ago

okay so I created two rocky VMs - but I don't really know how to get them networked or even the basics. I do see there are templates:

vsoch commented 8 months ago

Okay I installed lima and QEMU and created two rocky VMs - and I don't know enough basics to even get a ping working from one VM to the other. I do see there are templates:

image

And namely some for k8s and k3s - is there any reason there isn't a template for usernetes? is it that a template == one vm? It seems like if one person has stepped through this process of using lima (and knows how to do it) it would be logical to provide a template for a control plan and then N workers for someone else to easily deploy.

vsoch commented 8 months ago

Any error from kubeadm-join?

Will bring up a cluster now and look into this! I've been working for months on these terraform (now OpenTofu) templates and it feels daunting to start from scratch with a VM tool I've never used before. I'm hoping I'm close with the tofu configs on GCP to have something working more quickly.

vsoch commented 8 months ago

okay here is the error from kubeadm-join:

docker compose exec -e U7S_HOST_IP=10.10.0.5 -e U7S_NODE_NAME=u7s-usernetes-compute-002 -e U7S_NODE_SUBNET=10.100.153.0/24 node sh -euc '$(cat /usernetes/join-command)'
[preflight] Running pre-flight checks
    [WARNING SystemVerification]: missing optional cgroups: hugetlb
error execution phase preflight: [preflight] Some fatal errors occurred:
    [ERROR CRI]: container runtime is not running: output: time="2023-11-09T04:50:45Z" level=fatal msg="validate service connection: validate CRI v1 runtime API for endpoint \"unix:///var/run/containerd/containerd.sock\": rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: no such file or directory\""
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
make: *** [Makefile:112: kubeadm-join] Error 1
vsoch commented 8 months ago

If I shell in (or just run again from the outside) it hangs here:

[preflight] Running pre-flight checks
    [WARNING SystemVerification]: missing optional cgroups: hugetlb
vsoch commented 8 months ago

For the control plane (that appears to work) what I see in make logs

:509] "Failed to ensure process in container with oom score" err="failed to apply oom score -999 to PID 1013: write /proc/1013/oom_score_adj: permission denied"
Nov 09 04:56:43 u7s-usernetes-compute-001 kubelet[1013]: E1109 04:56:43.260899    1013 container_manager_linux.go:509] "Failed to ensure process in container with oom score" err="failed to apply oom score -999 to PID 1013: write /proc/1013/oom_score_adj: permission denied"

And the worker node (hanging) I see:

Nov 09 04:50:45 u7s-usernetes-compute-002 containerd[181]: time="2023-11-09T04:50:45.432440353Z" level=warning msg="The image docker.io/kindest/local-path-helper:v20230510-486859a6 is not unpacked."
Nov 09 04:50:45 u7s-usernetes-compute-002 systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Nov 09 04:50:45 u7s-usernetes-compute-002 systemd[1]: Finished Update UTMP about System Runlevel Changes.
Nov 09 04:50:45 u7s-usernetes-compute-002 systemd[1]: Startup finished in 199ms.
Nov 09 04:50:45 u7s-usernetes-compute-002 containerd[181]: time="2023-11-09T04:50:45.444757902Z" level=info msg="Start event monitor"
Nov 09 04:50:45 u7s-usernetes-compute-002 containerd[181]: time="2023-11-09T04:50:45.444784147Z" level=info msg="Start snapshots syncer"
Nov 09 04:50:45 u7s-usernetes-compute-002 containerd[181]: time="2023-11-09T04:50:45.444793703Z" level=info msg="Start cni network conf syncer for default"
Nov 09 04:50:45 u7s-usernetes-compute-002 containerd[181]: time="2023-11-09T04:50:45.444799589Z" level=info msg="Start streaming server"

But I don't see the node is registered:

$ kubectl get nodes
NAME                        STATUS   ROLES           AGE   VERSION
u7s-usernetes-compute-001   Ready    control-plane   11m   v1.28.0

This did work once for me, when it was in the middle of development! I wish I knew what changed :/ I could try going back to rocky since that works now, but I had thought ubuntu was a more sound option.

vsoch commented 8 months ago

The hanging terminal finally timed out:

    [WARNING SystemVerification]: missing optional cgroups: hugetlb
error execution phase preflight: couldn't validate the identity of the API Server: Get "https://10.10.0.3:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
To see the stack trace of this error execute with --v=5 or higher
root@u7s-usernetes-compute-002:/usernetes#