k3s --disable-agent flag never starts kube-scheduler in newer k3s versions

FabianKramm commented 2 years ago

Environmental Info: K3s Version:

k3s version v1.22.6+k3s1 (https://github.com/k3s-io/k3s/commit/3228d9cb9a4727d48f60de4f1ab472f7c50df904)
go version go1.16.10

Node(s) CPU architecture, OS, and Version:

Linux test-0 5.10.76-linuxkit #1 SMP Mon Nov 8 10:21:19 UTC 2021 x86_64 GNU/Linux

Cluster Configuration:

container k3s, single server, no agents

Describe the bug: Hello! Thanks again a lot for the great project! This is a problem related to using --disable-agent with k3s.

The PR #4345 changed that the kube scheduler is only started if the nodeConfig in the embedded executer is set, which never happens, because the agent is never started, which in turn leads to kube-scheduler never starting.

You can see the problematic code at: https://github.com/k3s-io/k3s/blob/feb6feeaeccc857a5744ef10efd82b18e8790e78/pkg/daemons/executor/embed.go#L113-L124

And the agent bootstrap that is skipped at: https://github.com/k3s-io/k3s/blob/bb856c67dcd9063ca68691a3321a458f2663d71d/pkg/cli/server/server.go#L451-L455

I know --disable-agent its an unsupported flag, but since we are relying on it for correct functionality in vcluster, I hope you could consider fixing this as it has worked before and it would be in my opinion a minimal non-invasive change to k3s. If you decide to go forward with this change, I'm willing to submit a PR that fixes this as well.

Steps To Reproduce: Specify the --disable-agent flag and notice kube-scheduler is never getting started

Expected behavior: Kube scheduler starting up if --disable-agent is set to true.

Actual behavior: Kube scheduler never starting up

brandond commented 2 years ago

Hmm, in your vcluster use case you're using the default scheduler, but do not ever have any nodes? How does that work exactly - wouldn't that leave the scheduler without any nodes to schedule to?

FabianKramm commented 2 years ago

@brandond thanks for the reply! Currently we use the scheduler of the underlying host cluster to decide where a pod should be scheduled on and then sync back the node into the virtual k3s cluster.

In an effort to allow users to taint and label nodes within the virtual cluster and move vcluster closer to the behaviour of a real Kubernetes cluster on the scheduling features, we actually want to enable the scheduler inside the virtual k3s cluster, let it decide on which node a pod should be scheduled and then create the pod in the underlying host cluster bound to the scheduled node already.

This works because we sync the nodes from the host cluster into the virtual one by creating the node objects in there without actually installing a separate kubelet or kube proxy on them, which is why we don't need the agent of k3s at all. Rather we only need the control plane part (kube api server, storage, controller manager and scheduler) which is virtualized completely in vcluster and works like in a normal Kubernetes cluster, while the workloads will then be executed on the host cluster nodes where we create pods in the host cluster that map to pods in the virtual cluster.

brandond commented 2 years ago

This works because we sync the nodes from the host cluster into the virtual one by creating the node objects in there without actually installing a separate kubelet or kube proxy on them

In that case, I'm not sure we need to change anything - kube-scheduler (in its current state) should start up as soon as an untainted node is sync'd into the virtual K3s cluster.

FabianKramm commented 2 years ago

@brandond but this condition will never be true if you use --disable-agent as the agent is never started and therefore this will never be non nil: https://github.com/k3s-io/k3s/blob/feb6feeaeccc857a5744ef10efd82b18e8790e78/pkg/daemons/executor/embed.go#L113-L116

brandond commented 2 years ago

Ahh, I see. Sorry, I'd missed that part; I thought it was just waiting on a node to show up.

brandond commented 2 years ago

See if that PR fixes it for you?

FabianKramm commented 2 years ago

@brandond thanks a lot for the quick PR! It works for me without running k3s in an unprivileged docker container, but if I run k3s inside a container I get the following errors:

INFO[0091] Waiting to retrieve agent configuration; server is not ready: "overlayfs" snapshotter cannot be enabled for "/data/agent/containerd", try using "fuse-overlayfs" or "native": failed to mount overlay: operation not permitted

Seems like the problem is this part here that is now executed to retrieve the agent node config: https://github.com/k3s-io/k3s/blob/bb856c67dcd9063ca68691a3321a458f2663d71d/pkg/agent/config/config.go#L440-L459

I'm no expert here, but wouldn't it be much easier to use the kube scheduler kube config here instead of initializing the whole agent config? Or is the node kube config required here?

I though about something like this:

func (e *Embedded) Scheduler(ctx context.Context, disableCCM bool, apiReady <-chan struct{}, args []string) error {
    command := sapp.NewSchedulerCommand()
    command.SetArgs(args)

    go func() {
        <-apiReady
        // If we're running the embedded cloud controller, wait for it to untaint at least one
        // node (usually, the local node) before starting the scheduler to ensure that it
        // finds a node that is ready to run pods during its initial scheduling loop.
        if !disableCCM {
            kubeconfig := ""
            for _, arg := range args {
                if strings.HasPrefix(arg, "--kubeconfig") {
                    kubeconfig = arg[len("--kubeconfig"):] 
                    break
                }
            }
            if kubeconfig != "" {
                if err := waitForUntaintedNode(ctx, kubeconfig); err != nil {
                    logrus.Fatalf("failed to wait for untained node: %v", err)
                }
            }
        }
        defer func() {
            if err := recover(); err != nil {
                logrus.Fatalf("scheduler panic: %v", err)
            }
        }()
        logrus.Fatalf("scheduler exited: %v", command.ExecuteContext(ctx))
    }()

    return nil
}

brandond commented 2 years ago

I was hoping to avoid having to pass that in explicitly since the nodeconfig already has all the various bits of information we need filled in properly, if we properly bootstrap the executor before using it.

brandond commented 2 years ago

Should be sorted now; I am able to run the server in an unprivileged container. Even rootless should work, if you give it a writable path for $HOME:

docker run --rm -it --user 1000:1000 -e HOME=/tmp/k3s rancher/k3s server --disable-agent --token=token --rootless

FabianKramm commented 2 years ago

@brandond just verified it and it works perfectly now, thanks so much for the quick fix!

ShylajaDevadiga commented 2 years ago

Validated on v1.23.5-rc1+k3s1 Installed node1 passing --disable-agent

$ kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS    RESTARTS   AGE
kube-system   helm-install-traefik-crd-xjkjh            0/1     Pending   0          7m17s
kube-system   helm-install-traefik-4nrvz                0/1     Pending   0          7m17s
kube-system   local-path-provisioner-6c79684f77-c97zn   0/1     Pending   0          7m17s
kube-system   metrics-server-7cd5fcb6b7-gcjdr           0/1     Pending   0          7m17s
kube-system   coredns-d76bd69b-p86fc                    0/1     Pending   0          7m17s

Joined an agent node

$ kubectl get nodes
NAME               STATUS   ROLES    AGE     VERSION
ip-172-31-15-177   Ready    <none>   6m44s   v1.23.5-rc1+k3s1
$ kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS      RESTARTS   AGE
kube-system   coredns-d76bd69b-p86fc                    1/1     Running     0          12m
kube-system   local-path-provisioner-6c79684f77-c97zn   1/1     Running     0          12m
kube-system   helm-install-traefik-crd-xjkjh            0/1     Completed   0          12m
kube-system   helm-install-traefik-4nrvz                0/1     Completed   1          12m
kube-system   svclb-traefik-fxbth                       2/2     Running     0          4m13s
kube-system   metrics-server-7cd5fcb6b7-gcjdr           1/1     Running     0          12m
kube-system   traefik-58b759688b-zjx4j                  1/1     Running     0          4m13s

Metrics server fails to fetch metrics in the above setup as shared in the issue https://github.com/k3s-io/k3s/issues/5330

k3s-io / k3s

k3s --disable-agent flag never starts kube-scheduler in newer k3s versions #5118