k3d-io / k3d

Little helper to run CNCF's k3s in Docker
https://k3d.io/
MIT License
5.36k stars 456 forks source link

[BUG] k3d hangs while adding node to remote cluster #797

Open JohnCalin opened 2 years ago

JohnCalin commented 2 years ago

What did you do

What did you expect to happen

A new agent node would be created on machine 2, joined to mycluster on machine 1

Screenshots or terminal output

INFO[0000] Adding 1 node(s) to the remote cluster 'https://x.x.x.x:yyyy'...
INFO[0001] Starting Node 'k3d-myagent-0'

k3d did not complete or exit. Left it for several hours without change.

Which OS & Architecture

Machine 1: CentOS Machine 2: Windows with Docker running on WSL

Which version of k3d

Which version of docker

Server: Docker Engine - Community Engine: Version: 20.10.9 API version: 1.41 (minimum version 1.12) Go version: go1.16.8 Git commit: 79ea9d3 Built: Mon Oct 4 16:06:37 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.11 GitCommit: 5b46e404f6b9f661a205e28d59c982d3634148f8 runc: Version: 1.0.2 GitCommit: v1.0.2-0-g52b36a2 docker-init: Version: 0.19.0 GitCommit: de40ad0

$ docker info Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Build with BuildKit (Docker Inc., v0.6.3-docker) scan: Docker Scan (Docker Inc., v0.8.0)

Server: Containers: 2 Running: 2 Paused: 0 Stopped: 0 Images: 11 Server Version: 20.10.9 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: 5b46e404f6b9f661a205e28d59c982d3634148f8 runc version: v1.0.2-0-g52b36a2 init version: de40ad0 Security Options: seccomp Profile: default Kernel Version: 3.10.0-1160.42.2.el7.x86_64 Operating System: CentOS Linux 7 (Core) OSType: linux Architecture: x86_64 CPUs: 4 Total Memory: 15.51GiB Name: XXXXXXXX ID: IWAK:X5YS:NSUD:R676:IPX6:BZKJ:X4HF:5HIM:EPPN:GTUE:D5CG:Y7UW Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled WARNING: bridge-nf-call-ip6tables is disabled

Machine 2:

H:>docker version Client: Cloud integration: 1.0.17 Version: 20.10.8 API version: 1.41 Go version: go1.16.6 Git commit: 3967b7d Built: Fri Jul 30 19:58:50 2021 OS/Arch: windows/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.8 API version: 1.41 (minimum version 1.12) Go version: go1.16.6 Git commit: 75249d8 Built: Fri Jul 30 19:52:31 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.9 GitCommit: e25210fe30a0a703442421b0f60afac609f950a3 runc: Version: 1.0.1 GitCommit: v1.0.1-0-g4144b63 docker-init: Version: 0.19.0 GitCommit: de40ad0

H:>docker info Client: Context: default Debug Mode: false Plugins: buildx: Build with BuildKit (Docker Inc., v0.6.3) compose: Docker Compose (Docker Inc., v2.0.0) scan: Docker Scan (Docker Inc., v0.8.0)

Server: Containers: 1 Running: 1 Paused: 0 Stopped: 0 Images: 3 Server Version: 20.10.8 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: e25210fe30a0a703442421b0f60afac609f950a3 runc version: v1.0.1-0-g4144b63 init version: de40ad0 Security Options: seccomp Profile: default Kernel Version: 5.10.16.3-microsoft-standard-WSL2 Operating System: Docker Desktop OSType: linux Architecture: x86_64 CPUs: 16 Total Memory: 9.329GiB Name: docker-desktop ID: 4W5S:OTDS:E3IP:JRZN:4TFZ:FNCF:J2EF:SU3W:LC7A:OLXD:6N3D:ZVEM Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

WARNING: No blkio throttle.read_bps_device support WARNING: No blkio throttle.write_bps_device support WARNING: No blkio throttle.read_iops_device support WARNING: No blkio throttle.write_iops_device support

iwilltry42 commented 2 years ago

Hi @JohnCalin , thanks for opening this issue! Sorry, I just found the time to go through the backlog today... Can you please paste the logs with --trace set in the k3d command, so we can debug where it's hanging? Also, please paste the output of docker logs k3d-myagent-0, as I assume it's not starting and that's why k3d is not returning (waiting for the "up and running" or "registered" log line).

There is not much testing around adding nodes to a remote cluster, so it's still considered an experimental feature / use-case.

habibutsu commented 1 year ago
k3d version v5.4.9
k3s version v1.25.7-k3s1 (default)

The same issue, agent hangs

Logs from remote agent:

ime="2023-03-28T12:17:09Z" level=info msg="Starting k3s agent v1.25.7+k3s1 (f7c20e23)"
time="2023-03-28T12:17:09Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [10.0.20.111:6500]"
time="2023-03-28T12:17:09Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
The connection to the server localhost:8080 was refused - did you specify the right host or port?
time="2023-03-28T12:17:10Z" level=info msg="Module overlay was already loaded"
time="2023-03-28T12:17:10Z" level=info msg="Module nf_conntrack was already loaded"
time="2023-03-28T12:17:10Z" level=info msg="Module br_netfilter was already loaded"
time="2023-03-28T12:17:10Z" level=info msg="Module iptable_nat was already loaded"
time="2023-03-28T12:17:10Z" level=info msg="Module iptable_filter was already loaded"
time="2023-03-28T12:17:10Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400"
time="2023-03-28T12:17:10Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600"
time="2023-03-28T12:17:10Z" level=info msg="Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log"
time="2023-03-28T12:17:10Z" level=info msg="Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd"
time="2023-03-28T12:17:11Z" level=info msg="containerd is now running"
time="2023-03-28T12:17:11Z" level=info msg="Getting list of apiserver endpoints from server"
time="2023-03-28T12:17:11Z" level=info msg="Updating load balancer k3s-agent-load-balancer default server address -> 172.23.0.2:6443"
time="2023-03-28T12:17:11Z" level=info msg="Updating load balancer k3s-agent-load-balancer server addresses -> [172.23.0.2:6443]"
time="2023-03-28T12:17:11Z" level=info msg="Connecting to proxy" url="wss://172.23.0.2:6443/v1-k3s/connect"
time="2023-03-28T12:17:11Z" level=info msg="Running kubelet --address=0.0.0.0 --allowed-unsafe-sysctls=net.ipv4.ip_forward,net.ipv6.conf.all.forwarding --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/var/lib/rancher/k3s/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=k3d-bb2-0 --kubeconfig=/var/lib/rancher/k3s/agent/kubelet.kubeconfig --kubelet-cgroups=/k3s --node-labels= --pod-infra-container-image=rancher/mirrored-pause:3.6 --pod-manifest-path=/var/lib/rancher/k3s/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --runtime-cgroups=/k3s --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/k3s/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/k3s/agent/serving-kubelet.key"
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
raminqaf commented 1 year ago

@habibutsu I have the same issue on this version

k3d --version
k3d version v5.4.9
k3s version v1.25.7-k3s1 (default)

Running MacOS 13.1 M1 Pro.

claywd commented 1 year ago

Can confirm this is still an issue even if you specify the number of agents at create time.

running in wsl2 ubuntu 22.04 on windows 11 with rancher desktop running on windows.

image

image above shows the cluster says servers and agents are ready but k3d is just hanging there.

log output

❯ /root/.k1/kubefirst/tools/k3d cluster create kubefirst --image rancher/k3s:v1.26.3-k3s1 --agents 1 --registry-create k3d-kubefirst-registry --volume /root/.k1/kubefirst/minio-storage:/var/lib/rancher/k3s/storage@all --port 443:443@loadbalancer
INFO[0000] portmapping '443:443' targets the loadbalancer: defaulting to [servers:*:proxy agents:*:proxy]
INFO[0000] Prep: Network
INFO[0000] Created network 'k3d-kubefirst'
INFO[0000] Created image volume k3d-kubefirst-images
INFO[0000] Creating node 'k3d-kubefirst-registry'
INFO[0000] Successfully created registry 'k3d-kubefirst-registry'
INFO[0000] Starting new tools node...
INFO[0000] Starting Node 'k3d-kubefirst-tools'
INFO[0001] Creating node 'k3d-kubefirst-server-0'
INFO[0001] Creating node 'k3d-kubefirst-agent-0'
INFO[0001] Creating LoadBalancer 'k3d-kubefirst-serverlb'
INFO[0001] Using the k3d-tools node to gather environment information
INFO[0001] HostIP: using network gateway 172.30.0.1 address
INFO[0001] Starting cluster 'kubefirst'
INFO[0001] Starting servers...
INFO[0001] Starting Node 'k3d-kubefirst-server-0'
INFO[0005] Starting agents...
INFO[0005] Starting Node 'k3d-kubefirst-agent-0'
^C
❯ k3d cluster delete kubefirst
ERRO[0000] error getting loadbalancer config from k3d-kubefirst-serverlb: runtime failed to read loadbalancer config '/etc/confd/values.yaml' from node 'k3d-kubefirst-serverlb': Error response from daemon: Could not find the file /etc/confd/values.yaml in container 874d0d1b120593e5e7cfcd852e5536d1230ff50a3b7b4fe7181d02a7db1d304d: file not found
INFO[0000] Deleting cluster 'kubefirst'
INFO[0001] Deleting cluster network 'k3d-kubefirst'
INFO[0001] Deleting 1 attached volumes...
INFO[0001] Removing cluster details from default kubeconfig...
INFO[0001] Removing standalone kubeconfig file (if there is one)...
INFO[0001] Successfully deleted cluster kubefirst!
❯ /root/.k1/kubefirst/tools/k3d cluster create kubefirst --image rancher/k3s:v1.26.3-k3s1 --agents 1 --registry-create k3d-kubefirst-registry --volume /root/.k1/kubefirst/minio-storage:/var/lib/rancher/k3s/storage@all --port 443:443@loadbalancer --verbose
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] Runtime Info:
&{Name:docker Endpoint:/var/run/docker.sock Version:20.10.21 OSType:linux OS:Rancher Desktop WSL Distribution Arch:x86_64 CgroupVersion:1 CgroupDriver:cgroupfs Filesystem:extfs InfoName:LaptopStudio}
DEBU[0000] Additional CLI Configuration:
cli:
  api-port: ""
  env: []
  k3s-node-labels: []
  k3sargs: []
  ports:
  - 443:443@loadbalancer
  registries:
    create: k3d-kubefirst-registry
  runtime-labels: []
  volumes:
  - /root/.k1/kubefirst/minio-storage:/var/lib/rancher/k3s/storage@all
hostaliases: []
DEBU[0000] Configuration:
agents: 1
image: rancher/k3s:v1.26.3-k3s1
network: ""
options:
  k3d:
    disableimagevolume: false
    disableloadbalancer: false
    disablerollback: false
    loadbalancer:
      configoverrides: []
    timeout: 0s
    wait: true
  kubeconfig:
    switchcurrentcontext: true
    updatedefaultkubeconfig: true
  runtime:
    agentsmemory: ""
    gpurequest: ""
    hostpidmode: false
    serversmemory: ""
registries:
  config: ""
  use: []
servers: 1
subnet: ""
token: ""
DEBU[0000] ========== Simple Config ==========
{TypeMeta:{Kind:Simple APIVersion:k3d.io/v1alpha4} ObjectMeta:{Name:} Servers:1 Agents:1 ExposeAPI:{Host: HostIP: HostPort:} Image:rancher/k3s:v1.26.3-k3s1 Network: Subnet: ClusterToken: Volumes:[] Ports:[] Options:{K3dOptions:{Wait:true Timeout:0s DisableLoadbalancer:false DisableImageVolume:false NoRollback:false NodeHookActions:[] Loadbalancer:{ConfigOverrides:[]}} K3sOptions:{ExtraArgs:[] NodeLabels:[]} KubeconfigOptions:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true} Runtime:{GPURequest: ServersMemory: AgentsMemory: HostPidMode:false Labels:[]}} Env:[] Registries:{Use:[] Create:<nil> Config:} HostAliases:[]}
==========================
DEBU[0000] ========== Merged Simple Config ==========
{TypeMeta:{Kind:Simple APIVersion:k3d.io/v1alpha4} ObjectMeta:{Name:} Servers:1 Agents:1 ExposeAPI:{Host: HostIP: HostPort:38873} Image:rancher/k3s:v1.26.3-k3s1 Network: Subnet: ClusterToken: Volumes:[{Volume:/root/.k1/kubefirst/minio-storage:/var/lib/rancher/k3s/storage NodeFilters:[all]}] Ports:[{Port:443:443 NodeFilters:[loadbalancer]}] Options:{K3dOptions:{Wait:true Timeout:0s DisableLoadbalancer:false DisableImageVolume:false NoRollback:false NodeHookActions:[] Loadbalancer:{ConfigOverrides:[]}} K3sOptions:{ExtraArgs:[] NodeLabels:[]} KubeconfigOptions:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true} Runtime:{GPURequest: ServersMemory: AgentsMemory: HostPidMode:false Labels:[]}} Env:[] Registries:{Use:[] Create:0xc00029a000 Config:} HostAliases:[]}
==========================
INFO[0000] portmapping '443:443' targets the loadbalancer: defaulting to [servers:*:proxy agents:*:proxy]
DEBU[0000] generated loadbalancer config:
ports:
  443.tcp:
  - k3d-kubefirst-server-0
  - k3d-kubefirst-agent-0
  6443.tcp:
  - k3d-kubefirst-server-0
settings:
  workerConnections: 1024
DEBU[0000] Port Exposure Mapping didn't specify hostPort, choosing one randomly...
DEBU[0000] Got free port for Port Exposure: '44361'
DEBU[0000] ===== Merged Cluster Config =====
&{TypeMeta:{Kind: APIVersion:} Cluster:{Name:kubefirst Network:{Name:k3d-kubefirst ID: External:false IPAM:{IPPrefix:zero IPPrefix IPsUsed:[] Managed:false} Members:[]} Token: Nodes:[0xc0005a2ea0 0xc0005a3040 0xc0005a31e0] InitNode:<nil> ExternalDatastore:<nil> KubeAPI:0xc0006a1940 ServerLoadBalancer:0xc000226cd0 ImageVolume: Volumes:[]} ClusterCreateOpts:{DisableImageVolume:false WaitForServer:true Timeout:0s DisableLoadBalancer:false GPURequest: ServersMemory: AgentsMemory: NodeHooks:[] GlobalLabels:map[app:k3d] GlobalEnv:[] HostAliases:[] Registries:{Create:0xc0000341e0 Use:[] Config:<nil>}} KubeconfigOpts:{UpdateDefaultKubeconfig:true SwitchCurrentContext:true}}
===== ===== =====
DEBU[0000] '--kubeconfig-update-default set: enabling wait-for-server
INFO[0000] Prep: Network
INFO[0000] Created network 'k3d-kubefirst'
INFO[0000] Created image volume k3d-kubefirst-images
INFO[0000] Creating node 'k3d-kubefirst-registry'
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] Created container k3d-kubefirst-registry (ID: 475022ae93fd23e915bd1a6dfa43430ae0d222be955523d78d10af1dafae0a91)
INFO[0000] Successfully created registry 'k3d-kubefirst-registry'
DEBU[0000] no netlabel present on container /k3d-kubefirst-registry
DEBU[0000] failed to get IP for container /k3d-kubefirst-registry as we couldn't find the cluster network
DEBU[0000] no netlabel present on container /k3d-kubefirst-registry
DEBU[0000] failed to get IP for container /k3d-kubefirst-registry as we couldn't find the cluster network
INFO[0000] Starting new tools node...
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock
DEBU[0000] [Docker] DockerHost: '' ()
DEBU[0000] Created container k3d-kubefirst-tools (ID: 38efb0c614af4276f9f94d10fcc8666b042944f1d2e82125fa54dcbd4101169a)
DEBU[0000] Node k3d-kubefirst-tools Start Time: 2023-05-23 21:58:29.528190044 -0500 CDT m=+0.121665357
INFO[0000] Starting Node 'k3d-kubefirst-tools'
DEBU[0000] Truncated 2023-05-24 02:58:29.886200305 +0000 UTC to 2023-05-24 02:58:29 +0000 UTC
INFO[0001] Creating node 'k3d-kubefirst-server-0'
DEBU[0001] DOCKER_SOCK=/var/run/docker.sock
DEBU[0001] Created container k3d-kubefirst-server-0 (ID: fd0b01cbe3da49c68974a19597c1287e92eb893beffbd57d9a5156d2d1b4c036)
DEBU[0001] Created node 'k3d-kubefirst-server-0'
INFO[0001] Creating node 'k3d-kubefirst-agent-0'
DEBU[0001] DOCKER_SOCK=/var/run/docker.sock
DEBU[0001] Created container k3d-kubefirst-agent-0 (ID: 174b7d47cdbb0d9ee0d5f7aa7d67e0f9730e34ad14734320af1be88e113c6aa9)
DEBU[0001] Created node 'k3d-kubefirst-agent-0'
INFO[0001] Creating LoadBalancer 'k3d-kubefirst-serverlb'
DEBU[0001] DOCKER_SOCK=/var/run/docker.sock
DEBU[0001] Created container k3d-kubefirst-serverlb (ID: 83710d2413ac2324e165666c1e80c77de2749b1d6baaa70d73f9f87041c51b70)
DEBU[0001] Created loadbalancer 'k3d-kubefirst-serverlb'
DEBU[0001] DOCKER_SOCK=/var/run/docker.sock
INFO[0001] Using the k3d-tools node to gather environment information
DEBU[0001] no netlabel present on container /k3d-kubefirst-tools
DEBU[0001] failed to get IP for container /k3d-kubefirst-tools as we couldn't find the cluster network
DEBU[0001] Deleting node k3d-kubefirst-tools ...
DEBU[0001] DOCKER_SOCK=/var/run/docker.sock
INFO[0001] HostIP: using network gateway 172.31.0.1 address
INFO[0001] Starting cluster 'kubefirst'
INFO[0001] Starting servers...
DEBU[0001] DOCKER_SOCK=/var/run/docker.sock
DEBU[0001] No fix enabled.
DEBU[0001] Node k3d-kubefirst-server-0 Start Time: 2023-05-23 21:58:31.006247299 -0500 CDT m=+1.599722627
INFO[0001] Starting Node 'k3d-kubefirst-server-0'
DEBU[0002] Truncated 2023-05-24 02:58:31.414452587 +0000 UTC to 2023-05-24 02:58:31 +0000 UTC
DEBU[0002] Waiting for node k3d-kubefirst-server-0 to get ready (Log: 'k3s is up and running')
DEBU[0005] Finished waiting for log message 'k3s is up and running' from node 'k3d-kubefirst-server-0'
INFO[0005] Starting agents...
DEBU[0005] DOCKER_SOCK=/var/run/docker.sock
DEBU[0005] No fix enabled.
DEBU[0005] Node k3d-kubefirst-agent-0 Start Time: 2023-05-23 21:58:34.766487856 -0500 CDT m=+5.359963170
INFO[0005] Starting Node 'k3d-kubefirst-agent-0'
DEBU[0005] Truncated 2023-05-24 02:58:35.207079775 +0000 UTC to 2023-05-24 02:58:35 +0000 UTC
DEBU[0005] Waiting for node k3d-kubefirst-agent-0 to get ready (Log: 'Successfully registered node')
2fxprogeeme commented 9 months ago

Same issue on this version: k3d --version k3d version v5.6.0 k3s version v1.27.4-k3s1 (default) In addition I use podman 4.6.2 in rootless mode on Linux Mint (Ubuntu). Creating a cluster without specifying --agents works. If I specify --agents cluster creation hangs after creation of the agents. Same with adding an agent node after successfully created a cluster.

Is there a chance that this will work in the future?

luxor37 commented 8 months ago

Same issue as @habibutsu on mac m1 14.2.1 k3d version v5.6.0 k3s version v1.27.5-k3s1 (default)

noamsan commented 6 months ago

In case this helps anyone, I was using a different default-runtime in my /etc/docker/daemon.json, switching back to the default runc (by removing my previously set override) fixed the issue for me.

amirshabanics commented 5 months ago

I had this issue when I installed k3d from Homebrew. I think it's because Homebrew doesn't have proper access to docker and can't use docker. so when I run k3d with sudo it's worked.

ycherrak commented 1 month ago

I had the same issue. Turns out, in my case, it was related to a systemd bridge network config, I removed the following options in my /etc/systemd/network/10-bridge.network and everything went smoothly after that:

[Match]
Name=veth*

[Network]
IPForward=yes

My guess is that "maybe" the ip forwarding didn't work as expected. I use both systemd-networkd and NetworkManager.

jpadmin commented 1 month ago

We were using k3d with 1 server and 2 agents in a gcp debian vm and pretty much getting same state as the startup is not going further beyond 'Starting agents'

k3d cluster start qa-cluster
INFO[0000] Using the k3d-tools node to gather environment information 
INFO[0000] Starting new tools node...                   
INFO[0000] Starting node 'k3d-qa-cluster-tools' 
INFO[0000] HostIP: using network gateway 172.18.0.1 address 
INFO[0000] Starting cluster 'qa-cluster'      
INFO[0000] Starting servers...                          
INFO[0000] Starting node 'k3d-qa-cluster-server-0' 
INFO[0004] Starting agents...                           
INFO[0004] Starting node 'k3d-qa-cluster-agent-1' 
INFO[0005] Starting node 'k3d-qa-cluster-agent-0' 

Docker logs are also same

The connection to the server localhost:8080 was refused - did you specify the right host or port?
E0814 12:02:42.253177    7093 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E0814 12:02:42.253751    7093 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E0814 12:02:42.255270    7093 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E0814 12:02:42.255786    7093 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?
E0814 12:02:45.344156    7103 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E0814 12:02:45.344663    7103 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E0814 12:02:45.346155    7103 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E0814 12:02:45.346630    7103 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?