[BUG] `host.k3d.internal` breaks on system reboot

mdshack commented 1 year ago

What did you do

How was the cluster created?
- k3d cluster create [clustername] --volume ... --volume ... --registry-config [local path to registry config] --agents 1 --servers 1 --port 8081:8081@loadbalancer --port ... --port ... --port ... --k3s-arg --disable=traefik@server:0 --k3s-arg --disable=metrics-server@server:0
What did you do afterwards?
1. Shell into the cluster agent: docker exec -it k3d-relay-agent-0 /bin/sh
2. wget a running registry that exists on the host machine successfully: wget host.k3d.internal:5000 [refer to Screenshots or terminal output -> Successful wget below]
3. Restart host machine (an actual power cycle)
4. Shell into the cluster agent: docker exec -it k3d-relay-agent-0 /bin/sh
5. wget a running registry that exists on the host machine without success: wget host.k3d.internal:5000 [refer to Screenshots or terminal output -> Unsuccessful wget below]
6. Stop your cluster: k3d cluster stop [clustername]
7. Start your cluster: k3d cluster start [clustername]
8. Repeat steps 1-2 and ensure you get another successful wget

What did you expect to happen

Expect host.k3d.internal:5000 to be reachable on machine restart

Screenshots or terminal output

If applicable, add screenshots or terminal output (code block) to help explain your problem.

Successful wget

Connecting to host.k3d.internal:5000 (172.20.0.1:5000)
saving to 'index.html'
'index.html' saved

Unsuccessful wget

wget: bad address 'host.k3d.internal:5000'

Which OS & Architecture

output of k3d runtime-info

arch: x86_64
cgroupdriver: cgroupfs
cgroupversion: "1"
endpoint: /var/run/docker.sock
filesystem: extfs
name: docker
os: Ubuntu 20.04.5 LTS
ostype: linux
version: 20.10.23

Which version of `k3d`

output of k3d version

k3d version v5.4.6
k3s version v1.24.4-k3s1 (default)

Which version of docker

output of docker version and docker info


Client: Docker Engine - Community
Version:           20.10.23
API version:       1.41
Go version:        go1.18.10
Git commit:        7155243
Built:             Thu Jan 19 17:36:25 2023
OS/Arch:           linux/amd64
Context:           default
Experimental:      true

Server: Docker Engine - Community Engine: Version: 20.10.23 API version: 1.41 (minimum version 1.12) Go version: go1.18.10 Git commit: 6051f14 Built: Thu Jan 19 17:34:14 2023 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.15 GitCommit: 5b842e528e99d4d4c1686467debf2bd4b88ecd86 runc: Version: 1.1.4 GitCommit: v1.1.4-0-g5fd4c4d docker-init: Version: 0.19.0 GitCommit: de40ad0

Client: Context: default Debug Mode: false Plugins: app: Docker App (Docker Inc., v0.9.1-beta3) buildx: Docker Buildx (Docker Inc., v0.10.0-docker) compose: Docker Compose (Docker Inc., v2.15.1) scan: Docker Scan (Docker Inc., v0.23.0)

Server: Containers: 11 Running: 6 Paused: 0 Stopped: 5 Images: 56 Server Version: 20.10.23 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: 5b842e528e99d4d4c1686467debf2bd4b88ecd86 runc version: v1.1.4-0-g5fd4c4d init version: de40ad0 Security Options: apparmor seccomp Profile: default Kernel Version: 5.15.0-60-generic Operating System: Ubuntu 20.04.5 LTS OSType: linux Architecture: x86_64 CPUs: 16 Total Memory: 62.42GiB Name: .... ID: ... Docker Root Dir: /var/lib/docker Debug Mode: false Username: ... Registry: ... Labels: Experimental: false Insecure Registries: ...:5001 ....:5002 localhost:32000 ....:5003 ...:5000 127.0.0.0/8 Registry Mirrors: ...:5001/ Live Restore Enabled: false

jtele2 commented 1 year ago

I am having the same problem. Was working fine up until about a week ago.

bruciebruce commented 1 year ago

I have also been hit with this issue. Cluster is up on AWS. Builds fine with all pods running. When EC2 instance is rebooted most of the pods enter a crash loop.

ubuntu@ip-10-1-1-102:~$ k3d runtime-info arch: x86_64 cgroupdriver: cgroupfs cgroupversion: "1" endpoint: /var/run/docker.sock filesystem: extfs name: docker os: Ubuntu 20.04.5 LTS ostype: linux version: 20.10.12

ubuntu@ip-10-1-1-102:~$ k3d version k3d version v5.4.7 k3s version v1.25.6-k3s1 (default)

ubuntu@ip-10-1-1-102:~$ docker info Client: Context: default Debug Mode: false

Server: Containers: 5 Running: 5 Paused: 0 Stopped: 0 Images: 3 Server Version: 20.10.12 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: runc version: init version: Security Options: apparmor seccomp Profile: default Kernel Version: 5.15.0-1028-aws Operating System: Ubuntu 20.04.5 LTS OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 61.81GiB Name: ip-10-1-1-102 ID: KB2D:4ZRM:F5IX:G6ZP:JVCV:ORCW:D3FO:GG4J:N5RH:ZJVR:J2QL:TAPO Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

bruciebruce commented 1 year ago

Trying to use Kind instead.

CormacLennon commented 1 year ago

Has anyone found a workaround for this yet? its a pain starting and stopping the cluster to get this back. Any way we can query the k3d instance to find out what it ought to be?

DanHorrocksBurgess commented 1 year ago

Has anyone found a workaround for this yet? its a pain starting and stopping the cluster to get this back. Any way we can query the k3d instance to find out what it ought to be?

Did you happen to find a workaround for this?

We're experiencing the same problem, our developers locally have to stop and start the cluster every time they reboot their machines to fix DNS resolution.

CormacLennon commented 1 year ago

Has anyone found a workaround for this yet? its a pain starting and stopping the cluster to get this back. Any way we can query the k3d instance to find out what it ought to be?

Did you happen to find a workaround for this?

We're experiencing the same problem, our developers locally have to stop and start the cluster every time they reboot their machines to fix DNS resolution.

I wrote a powershell function for our developers to run to fix the issue

function Repair-ClusterCoreDns()
{
$servero = docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' k3d-energy-server-0
$serverlb = docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' k3d-energy-serverlb
$registry = docker inspect --format='{{with index .NetworkSettings.Networks \"k3d-energy\"}}{{.IPAddress}}{{end}}' k3d-myregistry.localhost
$hostK3dInternal = Get-HostK3dInternal

$ips = "|
    $hostK3dInternal host.k3d.internal
    $servero k3d-energy-server-0
    $serverlb k3d-energy-serverlb
    $registry k3d-myregistry.localhost
"
    $patch = 'data:
  NodeHosts: ' + $ips
    Write-Output "Adding following members to the coredns config map:"
    Write-Output $patch
    kubectl patch configmap/coredns -n kube-system --type merge --patch $patch
}
Set-Alias -Name fixdns -Value Repair-ClusterCoreDns -Force
Export-ModuleMember -Function Repair-ClusterCoreDns -Alias fixdns

function Get-HostK3dInternal()
{
  $hostIp = ""
  $dnsEntries = docker exec k3d-energy-tools /bin/sh -c "getent ahostsv4 host.k3d.internal"

  foreach ( $dnsEntry in $dnsEntries) {
      $chunks = $dnsEntry.Split(" ") | Where-Object {$_}

      if($chunks[2] -eq "host.k3d.internal")
      {
        $hostIp = $chunks[0]
      }
  }

  if($hostIp -eq "")
  {
    Write-Host 'FAILURE: Could not resolve host.k3d.internal, Please ensure k3d-energy-tools container is running'
  }
  return $hostIp
}

its not perfect but it works, the important line to figure out what the host.k3d.internal ip should be is

docker exec k3d-energy-tools /bin/sh -c "getent ahostsv4 host.k3d.internal"

Which I only figured out reading the source and, as its undocumented, is liable to change. but it works for now

Lanchez commented 11 months ago

This is also a problem with local registries and it can easily be replicated. Local registries break when used from the cluster after a cluster restart.

Create cluster with a local registry
Check that coredns ConfigMap has proper entries
Restart docker daemon
Local registries break when accessed from the cluster and coredns ConfigMap is missing the entries

k3d-io / k3d