kubewharf / godel-scheduler

a unified scheduler for online and offline tasks
Apache License 2.0
377 stars 58 forks source link

The "godel-demo-labels" cluster and "godel-demo-default" cluster created by kind in Quick Start cannot coexist. #51

Closed slipegg closed 1 month ago

slipegg commented 1 month ago

Hello~ I recently tried to run some of the examples given in Quick Start, but I found that the "godel-demo-labels" cluster and the "godel-demo-default" cluster do not seem to be able to coexist yet. If I use kind to create a cluster and then create another cluster, some errors will occur and the creation cannot be successful. If I delete the originally created cluster and create it again, there will be no problem.

I'm not sure if this is an issue that needs to be addressed, and if not, could we consider giving some hints in Quick Start.

Environment

My running environment is as follows (meeting the version requirements given in Quick Start):

Problems that occur when installing the "godel-demo-labels" cluster after installing the "godel-demo-default" cluster

Creating cluster "godel-demo-labels" ...
 βœ“ Ensuring node image (kindest/node:v1.21.1) πŸ–Ό
 βœ“ Preparing nodes πŸ“¦ πŸ“¦ πŸ“¦ πŸ“¦ πŸ“¦ πŸ“¦ πŸ“¦  
 βœ“ Writing configuration πŸ“œ 
 βœ“ Starting control-plane πŸ•ΉοΈ 
 βœ“ Installing CNI πŸ”Œ 
 βœ“ Installing StorageClass πŸ’Ύ 
 βœ— Joining worker nodes 🚜 
Deleted nodes: ["godel-demo-labels-worker5" "godel-demo-labels-control-plane" "godel-demo-labels-worker4" "godel-demo-labels-worker3" "godel-demo-labels-worker6" "godel-demo-labels-worker" "godel-demo-labels-worker2"]
ERROR: failed to create cluster: failed to join node with kubeadm: command "docker exec --privileged godel-demo-labels-worker6 kubeadm join --config /kind/kubeadm.conf --skip-phases=preflight --v=6" failed with error: exit status 1
Command Output: I0522 07:39:10.021678     216 join.go:395] [preflight] found NodeName empty; using OS hostname as NodeName
I0522 07:39:10.021732     216 joinconfiguration.go:74] loading configuration from "/kind/kubeadm.conf"
I0522 07:39:10.023306     216 controlplaneprepare.go:211] [download-certs] Skipping certs download
I0522 07:39:10.023332     216 join.go:465] [preflight] Discovering cluster-info
I0522 07:39:10.023382     216 token.go:78] [discovery] Created cluster-info discovery client, requesting info from "godel-demo-labels-control-plane:6443"
I0522 07:39:10.048625     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s 200 OK in 24 milliseconds
I0522 07:39:10.049448     216 token.go:221] [discovery] The cluster-info ConfigMap does not yet contain a JWS signature for token ID "abcdef", will try again
I0522 07:39:15.959644     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s 200 OK in 2 milliseconds
I0522 07:39:15.959860     216 token.go:221] [discovery] The cluster-info ConfigMap does not yet contain a JWS signature for token ID "abcdef", will try again
I0522 07:39:22.383210     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s 200 OK in 10 milliseconds
I0522 07:39:22.383362     216 token.go:221] [discovery] The cluster-info ConfigMap does not yet contain a JWS signature for token ID "abcdef", will try again
I0522 07:39:28.389410     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s 200 OK in 6 milliseconds
...
I0522 07:40:06.164226     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/nodes/godel-demo-labels-worker6?timeout=10s 404 Not Found in 1 milliseconds
I0522 07:40:06.664261     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/nodes/godel-demo-labels-worker6?timeout=10s 404 Not Found in 1 milliseconds
I0522 07:40:07.164477     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/nodes/godel-demo-labels-worker6?timeout=10s 404 Not Found in 1 milliseconds
I0522 07:40:07.664581     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/nodes/godel-demo-labels-worker6?timeout=10s 404 Not Found in 1 milliseconds
I0522 07:40:08.164830     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/nodes/godel-demo-labels-worker6?timeout=10s 404 Not Found in 1 milliseconds
I0522 07:40:08.664483     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/nodes/godel-demo-labels-worker6?timeout=10s 404 Not Found in 1 milliseconds
I0522 07:40:09.164724     216 round_trippers.go:454] GET https://godel-demo-labels-control-plane:6443/api/v1/nodes/godel-demo-labels-worker6?timeout=10s 404 Not Found in 1 milliseconds
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.

Problems that occur when installing the "godel-demo-default" cluster after installing the "godel-demo-labels" cluster

Creating cluster "godel-demo-default" ...
 βœ“ Ensuring node image (kindest/node:v1.21.1) πŸ–Ό
 βœ“ Preparing nodes πŸ“¦ πŸ“¦  
 βœ“ Writing configuration πŸ“œ 
 βœ— Starting control-plane πŸ•ΉοΈ 
Deleted nodes: ["godel-demo-default-worker" "godel-demo-default-control-plane"]
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged godel-demo-default-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0522 07:55:56.694109     189 initconfiguration.go:246] loading configuration from "/kind/kubeadm.conf"
[config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta2, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.21.1
I0522 07:55:56.782474     189 certs.go:110] creating a new certificate authority for ca
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
I0522 07:55:56.929762     189 certs.go:487] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [godel-demo-default-control-plane kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.18.0.14 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0522 07:55:57.494677     189 certs.go:110] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I0522 07:55:57.753423     189 certs.go:487] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I0522 07:55:57.814377     189 certs.go:110] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I0522 07:55:57.948703     189 certs.go:487] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [godel-demo-default-control-plane localhost] and IPs [172.18.0.14 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [godel-demo-default-control-plane localhost] and IPs [172.18.0.14 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0522 07:55:58.720578     189 certs.go:76] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
I0522 07:55:59.142258     189 kubeconfig.go:101] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I0522 07:55:59.258969     189 kubeconfig.go:101] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0522 07:55:59.377400     189 kubeconfig.go:101] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0522 07:55:59.789019     189 kubeconfig.go:101] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I0522 07:56:00.034177     189 kubelet.go:63] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0522 07:56:00.200412     189 manifests.go:96] [control-plane] getting StaticPodSpecs
I0522 07:56:00.200893     189 certs.go:487] validating certificate period for CA certificate
I0522 07:56:00.200979     189 manifests.go:109] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0522 07:56:00.200990     189 manifests.go:109] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0522 07:56:00.200993     189 manifests.go:109] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0522 07:56:00.200997     189 manifests.go:109] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0522 07:56:00.201001     189 manifests.go:109] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0522 07:56:00.208696     189 manifests.go:126] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0522 07:56:00.208722     189 manifests.go:96] [control-plane] getting StaticPodSpecs
I0522 07:56:00.208998     189 manifests.go:109] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0522 07:56:00.209011     189 manifests.go:109] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0522 07:56:00.209016     189 manifests.go:109] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0522 07:56:00.209020     189 manifests.go:109] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0522 07:56:00.209025     189 manifests.go:109] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0522 07:56:00.209029     189 manifests.go:109] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0522 07:56:00.209033     189 manifests.go:109] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
I0522 07:56:00.209891     189 manifests.go:126] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0522 07:56:00.209913     189 manifests.go:96] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0522 07:56:00.210165     189 manifests.go:109] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0522 07:56:00.210693     189 manifests.go:126] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0522 07:56:00.211475     189 local.go:74] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0522 07:56:00.211497     189 waitcontrolplane.go:87] [wait-control-plane] Waiting for the API server to be healthy
I0522 07:56:00.212272     189 loader.go:372] Config loaded from file:  /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0522 07:56:00.214071     189 round_trippers.go:454] GET https://godel-demo-default-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0522 07:56:00.715008     189 round_trippers.go:454] GET https://godel-demo-default-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0522 07:56:01.215989     189 round_trippers.go:454] GET https://godel-demo-default-control-plane:6443/healthz?timeout=10s  in 1 milliseconds
...
I0522 07:56:39.215169     189 round_trippers.go:454] GET https://godel-demo-default-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0522 07:56:39.716828     189 round_trippers.go:454] GET https://godel-demo-default-control-plane:6443/healthz?timeout=10s  in 1 milliseconds
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
NickrenREN commented 1 month ago

cc @yuey002 PTAL when you get a chance, thanks

yuey002 commented 1 month ago

@slipegg Thanks for submitting the issue.

I tried but did not run into the issue as described. Could you share your configured Docker resource limit? If it's low, you could try increasing the resources like CPU, Memory, and try creating the cluster again. Or, if there are other Kind clusters running in your env, you could try delete those to release some resources.

slipegg commented 1 month ago

I'm sorry, it does appear to be an issue caused by insufficient resources on my server. I just cleaned up the other clusters created by kind and tried again, and now I can create both clusters at the same time. I will close this issue.

Thank you very much for your help!πŸ’•