nestybox / sysbox-pkgr

Sysbox-pkgr repository
5 stars 14 forks source link

Support k3s-agent deployment #128

Open jvassev opened 5 months ago

jvassev commented 5 months ago

I discovered a few more missing pieces and added them too. There is a strange issue when pods get rescheduled on crio-o where I occasionally see:

level=error err="listen tcp :9100: bind: address already in use"

Simple pod recreation solves it. That's why I'm adding sleep 20 between k3s-agent restarting.

Is there a smarter way to solve this?

ctalledo commented 5 months ago

I discovered a few more missing pieces and added them too. There is a strange issue when pods get rescheduled on crio-o where I occasionally see:

level=error err="listen tcp :9100: bind: address already in use"

Simple pod recreation solves it. That's why I'm adding sleep 20 between k3s-agent restarting.

Is there a smarter way to solve this?

I don't know, but it's certainly not ideal.

Is it because the k3s agent is not fully stopped after systemctl stop k3s-agent? If that's the case, then we could try looping until it is. Do you know what agent is using that tcp 9100 port?

jvassev commented 5 months ago

In my case, it was node-exporter but it happens on other pods like calico-node. I'm sure systemctl stop k3s-agent blocks until the process is down. Maybe the containerd-manged pods need to get wiped out too? I see this in do_config_kubelet_docker_systemd: https://github.com/nestybox/sysbox-pkgr/blob/master/k8s/scripts/kubelet-config-helper.sh#L1396-L1401

jvassev commented 5 months ago

With that latest change I think the pod running the kubelet-config-helper.sh scripts is stopped because of the call to clean_runtime_state "$runtime" and the final systemctl start k3s-agent never has a chance to run. Starting it manually fixes the node

jvassev commented 5 months ago

After some more debugging i noticed that it just takes too long to kill the old *.slice units. So the last change is to stop them in parallel.

ctalledo commented 5 months ago

With that latest change I think the pod running the kubelet-config-helper.sh scripts is stopped because of the call to clean_runtime_state "$runtime" and the final systemctl start k3s-agent never has a chance to run. Starting it manually fixes the node

Mmm ... not sure about this. The kubelet-config-helper.sh does not run within a pod; it runs directly on the host (i.e., k8s node) as a systemd unit. The systemd unit is created and then started by the sysbox-deploy-k8s.sh script running inside the sysbox-deploy-k8s pod.

Thus the call to clean_runtime_state should not affect the execution of the kubelet-config-helper.sh. Maybe something else is going on?

ctalledo commented 5 months ago

Hi @jvassev, thanks again for the contribution.

Where is this PR at? Is is ready for merging or are you still debugging/testing it?

Thanks!