k3d-io / k3d

Little helper to run CNCF's k3s in Docker
https://k3d.io/
MIT License
5.39k stars 458 forks source link

[BUG] Issue to create multi-servers cluster #1169

Open ygouzerh opened 2 years ago

ygouzerh commented 2 years ago

What did you do

I was trying to create a cluster using : k3d cluster create --server 3 --trace cluster-dev, unfortunately impossible to create.

However, when running k3d cluster create cluster-dev, it works well.

Screenshots or terminal output

Output of k3d cluster create --server 3 --trace cluster-dev :

TRAC[0004] NodeWaitForLogMessage: Node 'k3d-cluster-dev-01-server-1' waiting for log message 'k3s is up and running' since '2022-10-10 09:39:33 +0000 UTC'
TRAC[0026] Found target message `k3s is up and running` in log line `Ctime="2022-10-10T09:39:55Z" level=info msg="k3s is up and running"`
DEBU[0026] Finished waiting for log message 'k3s is up and running' from node 'k3d-cluster-dev-01-server-1'
DEBU[0026] DOCKER_SOCK=/var/run/docker.sock
DEBU[0026] No fix enabled.
DEBU[0026] Node k3d-cluster-dev-01-server-2 Start Time: 2022-10-10 17:39:55.809368828 +0800 HKT m=+27.000659477
TRAC[0026] Starting node 'k3d-cluster-dev-01-server-2'
INFO[0027] Starting Node 'k3d-cluster-dev-01-server-2'
DEBU[0027] Truncated 2022-10-10 09:39:56.228619637 +0000 UTC to 2022-10-10 09:39:56 +0000 UTC
DEBU[0027] Waiting for node k3d-cluster-dev-01-server-2 to get ready (Log: 'k3s is up and running')
TRAC[0027] NodeWaitForLogMessage: Node 'k3d-cluster-dev-01-server-2' waiting for log message 'k3s is up and running' since '2022-10-10 09:39:56 +0000 UTC'
TRAC[0038] Non-fatal last log line in node k3d-cluster-dev-01-server-2: rE1010 09:40:07.511134       7 kubelet.go:1361] "Failed to start cAdvisor" err="inotify_init: too many open files"
ERRO[0039] Failed Cluster Start: Failed to start server k3d-cluster-dev-01-server-2: Node k3d-cluster-dev-01-server-2 failed to get ready: error waiting for log line `k3s is up and running` from node 'k3d-cluster-dev-01-server-2': stopped returning log lines: node k3d-cluster-dev-01-server-2 is running=true in status=restarting
ERRO[0039] Failed to create cluster >>> Rolling Back
INFO[0039] Deleting cluster 'cluster-dev-01'
TRAC[0039] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-cluster-dev-01-serverlb
TRAC[0039] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-cluster-dev-01-server-2
TRAC[0039] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-cluster-dev-01-server-1
TRAC[0039] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-cluster-dev-01-server-0
DEBU[0039] Cluster Details: &{Name:cluster-dev-01 Network:{Name:k3d-cluster-dev-01 ID:6e49fed5c2b54b614e8696551137ce1ed1facd9a0175d655efc355a319e1abe0 External:false IPAM:{IPPrefix:192.168.16.0/20 IPsUsed:[] Managed:false} Members:[]} Token:ZskOYcBRTblCVDColJeM Nodes:[0xc0006029c0 0xc000602b60 0xc000602d00 0xc000602ea0] InitNode:0xc000602b60 ExternalDatastore:<nil> KubeAPI:0xc0001b7340 ServerLoadBalancer:0xc000489010 ImageVolume:k3d-cluster-dev-01-images Volumes:[k3d-cluster-dev-01-images k3d-cluster-dev-01-images]}
DEBU[0039] Deleting node k3d-cluster-dev-01-serverlb ...
TRAC[0039] [Docker] Deleted Container k3d-cluster-dev-01-serverlb
DEBU[0039] Deleting node k3d-cluster-dev-01-server-0 ...
TRAC[0040] [Docker] Deleted Container k3d-cluster-dev-01-server-0
DEBU[0040] Deleting node k3d-cluster-dev-01-server-1 ...
TRAC[0040] [Docker] Deleted Container k3d-cluster-dev-01-server-1
DEBU[0040] Deleting node k3d-cluster-dev-01-server-2 ...
TRAC[0040] [Docker] Deleted Container k3d-cluster-dev-01-server-2
INFO[0040] Deleting cluster network 'k3d-cluster-dev-01'
INFO[0040] Deleting 2 attached volumes...
DEBU[0040] Deleting volume k3d-cluster-dev-01-images...
DEBU[0040] Deleting volume k3d-cluster-dev-01-images...
WARN[0040] Failed to delete volume 'k3d-cluster-dev-01-images' of cluster 'cluster-dev-01': failed to find volume 'k3d-cluster-dev-01-images': Error: No such volume: k3d-cluster-dev-01-images -> Try to delete it manually
FATA[0040] Cluster creation FAILED, all changes have been rolled back!

Output of k3d cluster create dev-cluster :

INFO[0000] Prep: Network
INFO[0000] Created network 'k3d-cluster-dev-01'
INFO[0000] Created image volume k3d-cluster-dev-01-images
INFO[0000] Starting new tools node...
INFO[0000] Starting Node 'k3d-cluster-dev-01-tools'
INFO[0001] Creating node 'k3d-cluster-dev-01-server-0'
INFO[0001] Creating LoadBalancer 'k3d-cluster-dev-01-serverlb'
INFO[0001] Using the k3d-tools node to gather environment information
INFO[0001] HostIP: using network gateway 192.168.48.1 address
INFO[0001] Starting cluster 'cluster-dev-01'
INFO[0001] Starting servers...
INFO[0001] Starting Node 'k3d-cluster-dev-01-server-0'
INFO[0006] All agents already running.
INFO[0006] Starting helpers...
INFO[0006] Starting Node 'k3d-cluster-dev-01-serverlb'
INFO[0013] Injecting records for hostAliases (incl. host.k3d.internal) and for 2 network members into CoreDNS configmap...
INFO[0015] Cluster 'cluster-dev-01' created successfully!
INFO[0016] You can now use it like this:
kubectl cluster-info

However, deleting this cluster will throw the warning :

WARN[0001] Failed to delete volume 'k3d-cluster-dev-01-images' of cluster 'cluster-dev-01': failed to find volume 'k3d-cluster-dev-01-images': Error: No such volume: k3d-cluster-dev-01-images -> Try to delete it manually

But it seems well created :

Output of docker volume ls | grep k3d

...
local     k3d-cluster-dev-01-images

Which OS & Architecture

arch: x86_64
cgroupdriver: cgroupfs
cgroupversion: "1"
endpoint: /var/run/docker.sock
filesystem: xfs
name: docker
os: CentOS Linux 8
ostype: linux
version: 20.10.17

Which version of k3d

k3d version v5.4.6
k3s version v1.24.4-k3s1 (default)

Which version of docker

Output of docker version :

Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:05:12 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:03:33 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
ygouzerh commented 2 years ago

Actually I found the issue, we need to put --wait, k3d was terminating the command to early.

You can close this ticket when you want, as it might be interesting maybe to includes it in the documentation or increase the timeout time.

EDIT : False alarm, the issue is still pending

ygouzerh commented 2 years ago

Actually I found the issue, we need to put --wait, k3d was terminating the command to early.

You can close this ticket when you want, as it might be interesting maybe to includes it in the documentation or increase the timeout time.

Actually, I realized that one server keeps restarting, with the following logs:

E1010 10:44:51.326517       7 dynamic_cafile_content.go:166] "Failed to watch CA file, will retry later" err="error creating fsnotify watcher: too many open files"
time="2022-10-10T10:44:51Z" level=info msg="Stopped tunnel to 127.0.0.1:6443"
time="2022-10-10T10:44:51Z" level=info msg="Connecting to proxy" url="wss://192.168.128.3:6443/v1-k3s/connect"
time="2022-10-10T10:44:51Z" level=info msg="Connecting to proxy" url="wss://192.168.128.2:6443/v1-k3s/connect"
time="2022-10-10T10:44:51Z" level=info msg="Proxy done" err="context canceled" url="wss://127.0.0.1:6443/v1-k3s/connect"
time="2022-10-10T10:44:51Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
W1010 10:44:51.335516       7 lease.go:234] Resetting endpoints for master service "kubernetes" to [192.168.128.2 192.168.128.3 192.168.128.4]
time="2022-10-10T10:44:51Z" level=info msg="Connecting to proxy" url="wss://192.168.128.4:6443/v1-k3s/connect"
time="2022-10-10T10:44:51Z" level=info msg="Handling backend connection request [k3d-cryptoblk-kentro-01-server-2]"
time="2022-10-10T10:44:51Z" level=info msg="Handling backend connection request [k3d-cryptoblk-kentro-01-server-1]"
time="2022-10-10T10:44:51Z" level=info msg="Handling backend connection request [k3d-cryptoblk-kentro-01-server-0]"
W1010 10:44:51.393843       7 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
I1010 10:44:51.394249       7 server.go:644] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
I1010 10:44:51.394691       7 container_manager_linux.go:262] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
I1010 10:44:51.394760       7 container_manager_linux.go:267] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName:/k3s SystemCgroupsName: KubeletCgroupsName:/k3s KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:false CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
I1010 10:44:51.394781       7 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
I1010 10:44:51.394793       7 container_manager_linux.go:302] "Creating device plugin manager" devicePluginEnabled=true
I1010 10:44:51.394836       7 state_mem.go:36] "Initialized new in-memory state store"
I1010 10:44:51.399001       7 kubelet.go:376] "Attempting to sync node with API server"
I1010 10:44:51.399037       7 kubelet.go:267] "Adding static pod path" path="/var/lib/rancher/k3s/agent/pod-manifests"
I1010 10:44:51.399069       7 kubelet.go:278] "Adding apiserver pod source"
I1010 10:44:51.399094       7 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
E1010 10:44:51.399163       7 file_linux.go:61] "Unable to read config path" err="unable to create inotify: too many open files" path="/var/lib/rancher/k3s/agent/pod-manifests"
I1010 10:44:51.399927       7 kuberuntime_manager.go:239] "Container runtime initialized" containerRuntime="containerd" version="v1.6.6-k3s1" apiVersion="v1"
E1010 10:44:51.400113       7 plugins.go:604] "Error initializing dynamic plugin prober" err="error initializing watcher: too many open files"
I1010 10:44:51.400303       7 server.go:1177] "Started kubelet"
I1010 10:44:51.400402       7 server.go:150] "Starting to listen" address="0.0.0.0" port=10250
E1010 10:44:51.400566       7 cri_stats_provider.go:455] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs"
E1010 10:44:51.400595       7 kubelet.go:1298] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
I1010 10:44:51.401182       7 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
I1010 10:44:51.401261       7 volume_manager.go:289] "Starting Kubelet Volume Manager"
I1010 10:44:51.401312       7 desired_state_of_world_populator.go:145] "Desired state populator starts to run"
I1010 10:44:51.401999       7 server.go:410] "Adding debug handlers to kubelet server"
E1010 10:44:51.404186       7 manager.go:279] Registration of the raw container factory failed: inotify_init: too many open files
E1010 10:44:51.404245       7 kubelet.go:1361] "Failed to start cAdvisor" err="inotify_init: too many open files"
ygouzerh commented 2 years ago

I found in the Github Actions that you bypass this error when running the E2E test, using :

sudo prlimit --pid $$ --nofile=1048576:1048576
sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360

I added that inside the script executing my k3d create ..., and it works ! (Doesn't work if I execute it before running the script)

I tried multiple time, it still works, good news !

As before, you can close this ticket when you prefer, in case you want to put it in the documentation