nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.78k stars 152 forks source link

Sysbox installation on Rancher managed cluster failed #380

Closed pwurbs closed 3 years ago

pwurbs commented 3 years ago

I tried to install Sysbox in a k8S cluster using the user guide.

So Sysbox requirements should be fulfilled.

RBAC and RuntimeClass have been successfully deployed. But there are issues with the Daemonset sysbox-deploy-k8s, the Pod is continously crashing. This is the log line before crashing: Job for kubelet-config-helper.service failed because the control process exited with error code. See "systemctl status kubelet-config-helper.service" and "journalctl -xe" for details.

This is the result of "systemctl status kubelet-config-helper.service":

kubelet-config-helper.service - Kubelet config service
     Loaded: loaded (/lib/systemd/system/kubelet-config-helper.service; static; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2021-08-11 09:34:05 UTC; 1min 3s ago
    Process: 98727 ExecStart=/bin/sh -c /usr/local/bin/kubelet-config-helper.sh (code=exited, status=1/FAILURE)
   Main PID: 98727 (code=exited, status=1/FAILURE)
Aug 11 09:34:05 rancher02-testsysbox systemd[1]: Starting Kubelet config service...
Aug 11 09:34:05 rancher02-testsysbox sh[98756]: Usage: grep [OPTION]... PATTERNS [FILE]...
Aug 11 09:34:05 rancher02-testsysbox sh[98756]: Try 'grep --help' for more information.
Aug 11 09:34:05 rancher02-testsysbox sh[98755]: Unit kubelet.service could not be found.
Aug 11 09:34:05 rancher02-testsysbox sh[98728]: Soft-linking dockershim socket to CRI-O socket on the host ...
Aug 11 09:34:05 rancher02-testsysbox sh[98777]: cp: cannot stat '/etc/default/kubelet': No such file or directory
Aug 11 09:34:05 rancher02-testsysbox systemd[1]: kubelet-config-helper.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 09:34:05 rancher02-testsysbox systemd[1]: kubelet-config-helper.service: Failed with result 'exit-code'.
Aug 11 09:34:05 rancher02-testsysbox systemd[1]: Failed to start Kubelet config service.

The cluster has been created in Rancher using the option "Create a new Kubernetes cluster", based on existing nodes. So the single node has been prepared and imported to create the new (downstream) cluster. Attached, there is the cluster-config, exported from Rancher cluster-config.txt

rodnymolina commented 3 years ago

Thanks for filing this issue @pwurbs!

Sysbox-PODs feature has not been validated / tested on Rancher yet. Will take a look at this one tomorrow.

rodnymolina commented 3 years ago

I was able to reproduce the issue by deploying a cluster directly through rke -- had too many issues trying to import pre-existing nodes into rancher. Even though the setup may not be exactly the same as the one originally described, there shouldn't be any relevant differences for us as rancher internally relies on rke too.

There are various issues at play here:

@pwurbs, how does rke2 sounds for you? Is rke-to-rke2 migration already part of your roadmap?

pwurbs commented 3 years ago

@rodnymolina Thx for the analysis. So I understand that Sysbox can't be deployed currently on a Rancher managed K8S cluster (RKE based). Right? Unfortunately we currently don't intend to move to RKE2. Would it be a workaround to install Sysbox using the host installation procedure instead of deploying it using the K8S manifests?

rodnymolina commented 3 years ago

Would it be a workaround to install Sysbox using the host installation procedure instead of deploying it using the K8S manifests?

Installing Sysbox through the traditional package won't help here as Rancher (and its provisioning tools: rke, rke2, ks3) won't be aware of its existence in the remote hosts. For that integration process to happen is that we have the 'sysbox-k8s-deploy' daemon-set.

Having said that, there may be an alternative approach that we are currently investigating to make this all work. Please stay tuned.

rodnymolina commented 3 years ago

At the end we were able to make it work (see details below). RKE can now deploy sysbox-powered pods in a cluster. Changes have been pushed to the latest Sysbox-deploy-k8s installer, which will deploy both CRI-O and Sysbox in the desired k8s-nodes.

In terms of implementation, we went for the following approach:

As mentioned above, RKE heavily relies on docker to create both the k8s control-plane as well as its data-plane. The former components are spawned as docker containers (i.e. kubelet, kube-proxy and nginx-proxy), whereas the latter ones (e.g. cni pods and all user workloads) are created as PODs through the docker-shim interface.

As we don't want / we can't change RKE, we are still relying on docker to create the basic control-plane components. However, we have switched all the data-plane components from docker-shim to CRI-O.

As it's usually the case, we have incorporated all the required configuration steps as part of the sysbox-deploy-k8s daemonset. All that is required is the execution of the following steps -- k8s-nodes' re-configuration process shouldn't take more than a minute:

kubectl label nodes <node-name> sysbox-install=yes
kubectl apply -f https://raw.githubusercontent.com/nestybox/sysbox/master/sysbox-k8s-manifests/rbac/sysbox-deploy-rbac.yaml
kubectl apply -f https://raw.githubusercontent.com/nestybox/sysbox/master/sysbox-k8s-manifests/daemonset/sysbox-deploy-k8s.yaml
kubectl apply -f https://raw.githubusercontent.com/nestybox/sysbox/master/sysbox-k8s-manifests/runtime-class/sysbox-runtimeclass.yaml

Refer to our k8s installation guide for more details.

pwurbs commented 3 years ago

I could now successfully deploy Sysbox at a Rancher managed (RKE) cluster node using the K8S manifest files. I used Ubuntu 20.04-latest, Docker 20.x and Kubernetes v1.20.10 The testing pod according to https://github.com/nestybox/sysbox/blob/master/docs/user-guide/install-k8s.md#pod-deployment could be successfully deployed (without any privileged mode). Within that container I could successfully pull and start a nginx container. So far everything is fine, thank you.

Then I started successfully a pod with docker:dind image (docker:19.03.15-dind-alpine3.13) Trying "docker pull nginx" in this container results in this error: failed to register layer: Error processing tar file(exit status 1): replaceDirWithOverlayOpaque("/docker-entrypoint.d") failed: createDirWithOverlayOpaque("/rdwoo655593762") failed: failed to rmdir /rdwoo655593762/m/d: remove /rdwoo655593762/m/d: operation not permitted

This is the Docker version info from within the container:

Server: Docker Engine - Community
 Engine:
  Version:          19.03.15
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       99e3ed8
  Built:            Sat Jan 30 03:18:13 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.9
  GitCommit:        ea765aba0d05254012b0b9e595e995c09186427f
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

These versions are a bit different from your ubuntu-bionic-systemd-docker image. I am not sure, if this issue is K8S / RKE related. I only wanted to let you know...

ctalledo commented 3 years ago

Hi @pwurbs,

Glad you were able to install Sysbox on your RKE nodes (great work by @rodnymolina to enable this).

Regarding the latest problem you reported:

failed to register layer: Error processing tar file(exit status 1): replaceDirWithOverlayOpaque("/docker-entrypoint.d") failed: createDirWithOverlayOpaque("/rdwoo655593762") failed: failed to rmdir /rdwoo655593762/m/d: remove /rdwoo655593762/m/d: operation not permitted

This looks very similar to issue #254, where the problem showed up when the inner Docker uses slightly older versions.

However, in that issue we reported that the problem occurs when the inner Docker has version < 19.03, but in your case the inner Docker has version 19.03.

Could you retry with a docker dind image using Docker 20+ please?

I am not sure, if this issue is K8S / RKE related. I only wanted to let you know...

I don't believe so. Thus, it makes sense for us to move this discussion to issue #254. I'll copy your prior comment and this current one to that issue, so we can continue the discussion there. I'll close this one.