[FEATURE] k3d IPAM (to prevent etcd failures) #550

mindkeep commented 3 years ago

Feature Request

Original Bug Report

What did you do

I was toying with system resiliency around restarting servers and agents, and found that the Init server didn't come back after the following sequence:

% cat break-k3d.sh
set -x
k3d cluster create --servers 3 --agents 3 test-k3d
docker stop k3d-test-k3d-server-0
docker stop k3d-test-k3d-agent-1
docker start k3d-test-k3d-agent-1
docker start k3d-test-k3d-server-0
sleep 120
docker logs k3d-test-k3d-server-0

where we wind up with this message repeated:

time="2021-04-06T21:20:30.180484636Z" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [k3d-test-k3d-server-1-78ae1551= k3d-test-k3d-server-0-3e94d5e1= k3d-test-k3d-server-2-31d30714=], expect: k3d-test-k3d-server-0-3e94d5e1="

As it turns out the IP for k3d-test-k3d-server-0 moves from to (presumably swapping with k3d-test-k3d-agent-1). Is there any way to lock this down a bit so docker doesn't accidentally flip things around?

What did you expect to happen

I expected the IPs to be maintained...

Which OS & Architecture

Which version of k3d

Which version of docker

Server: Engine: Version: 19.03.15 API version: 1.40 (minimum version 1.12) Go version: go1.15.7 Git commit: 420b1d3625 Built: Thu Feb 11 18:14:30 2021 OS/Arch: linux/amd64

% docker info Client: Debug Mode: false

Server: Containers: 8 Running: 7 Paused: 0 Stopped: 1 Images: 3 Server Version: 19.03.15 Storage Driver: overlay2 Kernel Version: 5.4.97-gentoo Operating System: Gentoo/Linux OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 15.62GiB

iwilltry42 commented 3 years ago

Hi @mindkeep , thanks for opening this issue! You're right there, we saw this issue already in the context of #262 (https://github.com/rancher/k3d/issues/262#issuecomment-810105498). I'll investigate a bit on what we can do there :+1: (FWIW: old related moby/docker issue: https://github.com/moby/moby/issues/2801)

iwilltry42 commented 3 years ago


If there's no flag for k3s/etcd that we could use to work around this, we'll have to check how feasible it is to manage static IPs for k3d nodes (i.e. similar to using the --ip flag in docker).

mindkeep commented 3 years ago

Worth noting, if I shut down both containers and restart k3d-test-k3d-server-0 first, we get the original IP ordering and everything comes back up.

So it's purely a docker IP ordering problem.

Given that we already spin up a new docker network, could we explore assigning more static IPs within that network?

iwilltry42 commented 3 years ago

So it's purely a docker IP ordering problem.


could we explore assigning more static IPs

Implementing some IPAM in k3d would be the only possible solution, yep :thinking:

Let's turn this into a feature request :+1:

iwilltry42 commented 3 years ago

Hi @mindkeep, while I'm still trying to figure out the best way to cope with this, I already did some work on this. Please give https://github.com/rancher/k3d/releases/tag/v4.5.0-dev.0 a try and start your cluster with --subnet auto or --subnet (or whatever subnet it should use). With --subnet auto, k3d will create a fake docker network to get a subnet auto-assigned by docker that it can use. Also please make use of the --trace flag for verbose logs. Awaiting your feedback :)

bukowa commented 3 years ago


First try

$ k3d2 cluster create test25 --servers=3 --subnet=auto
←[36mINFO←[0m[0000] Prep: Network
←[36mINFO←[0m[0000] Created network 'k3d-test25' (3ad95731bfce7029df8a7f7540daaa02c2fb305a10e24d531e0ed6e6756d08b2)
←[36mINFO←[0m[0000] Created volume 'k3d-test25-images'
←[36mINFO←[0m[0000] Creating initializing server node
←[36mINFO←[0m[0000] Creating node 'k3d-test25-server-0'
←[36mINFO←[0m[0002] Pulling image 'docker.io/rancher/k3s:v1.20.5-k3s1'
←[36mINFO←[0m[0011] Creating node 'k3d-test25-server-1'
←[36mINFO←[0m[0012] Creating node 'k3d-test25-server-2'
←[36mINFO←[0m[0012] Creating LoadBalancer 'k3d-test25-serverlb'
←[36mINFO←[0m[0014] Pulling image 'docker.io/rancher/k3d-proxy:v4.5.0-dev.0'
←[36mINFO←[0m[0019] Starting cluster 'test25'
←[36mINFO←[0m[0019] Starting the initializing server...
←[36mINFO←[0m[0019] Starting Node 'k3d-test25-server-0'
←[36mINFO←[0m[0020] Starting servers...
←[36mINFO←[0m[0020] Starting Node 'k3d-test25-server-1'
←[31mERRO←[0m[0020] Failed to start node 'k3d-test25-server-1'
←[31mERRO←[0m[0020] Failed Cluster Start: Failed to start server k3d-test25-server-1: Error response from daemon: Address already in use
←[31mERRO←[0m[0020] Failed to create cluster >>> Rolling Back
←[36mINFO←[0m[0020] Deleting cluster 'test25'
←[36mINFO←[0m[0020] Deleted k3d-test25-server-0
←[36mINFO←[0m[0020] Deleted k3d-test25-server-1
←[36mINFO←[0m[0020] Deleted k3d-test25-server-2
←[36mINFO←[0m[0020] Deleted k3d-test25-serverlb
←[36mINFO←[0m[0020] Deleting cluster network 'k3d-test25'
←[36mINFO←[0m[0021] Deleting image volume 'k3d-test25-images'
←[31mFATA←[0m[0021] Cluster creation FAILED, all changes have been rolled back!
$ k3d2 cluster create test25 --servers=3 --subnet=
←[36mINFO←[0m[0000] Prep: Network
←[36mINFO←[0m[0000] Created network 'k3d-test25' (f7900949f4b380af681d7a9f1b39992b971cbbffc94bd57ee254696dc9812fbb)
←[36mINFO←[0m[0000] Created volume 'k3d-test25-images'
←[36mINFO←[0m[0000] Creating initializing server node
←[36mINFO←[0m[0000] Creating node 'k3d-test25-server-0'
←[36mINFO←[0m[0001] Creating node 'k3d-test25-server-1'
←[36mINFO←[0m[0002] Creating node 'k3d-test25-server-2'
←[36mINFO←[0m[0002] Creating LoadBalancer 'k3d-test25-serverlb'
←[36mINFO←[0m[0002] Starting cluster 'test25'
←[36mINFO←[0m[0002] Starting the initializing server...
←[36mINFO←[0m[0002] Starting Node 'k3d-test25-server-0'
←[36mINFO←[0m[0003] Starting servers...
←[36mINFO←[0m[0003] Starting Node 'k3d-test25-server-1'
←[31mERRO←[0m[0003] Failed to start node 'k3d-test25-server-1'
←[31mERRO←[0m[0003] Failed Cluster Start: Failed to start server k3d-test25-server-1: Error response from daemon: Address already in use
←[31mERRO←[0m[0003] Failed to create cluster >>> Rolling Back
←[36mINFO←[0m[0003] Deleting cluster 'test25'
←[36mINFO←[0m[0003] Deleted k3d-test25-server-0
←[36mINFO←[0m[0003] Deleted k3d-test25-server-1
←[36mINFO←[0m[0003] Deleted k3d-test25-server-2
←[36mINFO←[0m[0003] Deleted k3d-test25-serverlb
←[36mINFO←[0m[0003] Deleting cluster network 'k3d-test25'
←[36mINFO←[0m[0003] Deleting image volume 'k3d-test25-images'
←[31mFATA←[0m[0003] Cluster creation FAILED, all changes have been rolled back!
iwilltry42 commented 3 years ago

@bukowa ... well.. crap :grin: Forgot to add a case before building. :roll_eyes:

bukowa commented 3 years ago


iwilltry42 commented 3 years ago

@bukowa can you inspect those node containers? And can you check if there's an overlapping docker network (docker network ls & docker network inspect)?

bukowa commented 3 years ago

@iwilltry42 no overlapping networks + all nodes started right now after another docker restart, strange :) before they were in exit loop

iwilltry42 commented 3 years ago

@bukowa should be fixed here: https://github.com/rancher/k3d/releases/tag/v4.5.0-dev.1

bukowa commented 3 years ago

@iwilltry42 i made a small github action that may be usefull to test this https://github.com/bukowa/k3dtest/runs/2352095712?check_suite_focus=true

iwilltry42 commented 3 years ago

Thanks for that @bukowa , I forked it and added kubectl to check that output. It seems to work, right? E.g. https://github.com/iwilltry42/k3dtest/runs/2352325264

bukowa commented 3 years ago

@iwilltry42 seems to work but not on my machine ^^

ff192959bbdd   rancher/k3d-proxy:v4.5.0-dev.1   "/bin/sh -c nginx-pr…"   25 minutes ago   Restarting (1) 39 seconds ago                                                                                 k3d-test55-serverlb
db451387ad46   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   25 minutes ago   Exited (255) 23 minutes ago                                                                                   k3d-test55-server-2
10ec4e821999   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   25 minutes ago   Up 21 minutes                                                                                                 k3d-test55-server-1
ff7e15fb60bb   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --c…"   25 minutes ago   Exited (255) 23 minutes ago                                                                                   k3d-test55-server-0

server0.txt server1.txt server2.txt serverlb.txt

bukowa commented 3 years ago

@iwilltry42 i think ive got something, because i still have old clusters running:

ff192959bbdd   rancher/k3d-proxy:v4.5.0-dev.1   "/bin/sh -c nginx-pr…"   33 minutes ago   Up 20 seconds                  80/tcp,>6443/tcp                                               k3d-test55-serverlb
db451387ad46   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   33 minutes ago   Up 21 seconds                                                                                                k3d-test55-server-2
10ec4e821999   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   33 minutes ago   Up 20 seconds                                                                                                k3d-test55-server-1
ff7e15fb60bb   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --c…"   33 minutes ago   Up 20 seconds                                                                                                k3d-test55-server-0
4e8e8de805e9   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   2 weeks ago      Exited (255) 23 seconds ago>32080/tcp,>32443/tcp,>6443/tcp   k3d-wpenv-server-0
afee5e598565   rancher/k3d-proxy:v4.3.0         "/bin/sh -c nginx-pr…"   4 weeks ago      Restarting (1) 2 seconds ago                                                                                 k3d-k3s-default-serverlb
0bdf54e2bf41   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   4 weeks ago      Exited (255) 23 seconds ago                                                                                  k3d-k3s-default-server-0

now each time i restart docker looks like alive nodes are shifting, take a look:

$ docker network ls
NETWORK ID     NAME              DRIVER    SCOPE
b436e5a0149c   bridge            bridge    local
e05bb6c1420e   host              host      local
56bb9cb725d6   k3d-k3s-default   bridge    local
ed22e183cae8   k3d-test55        bridge    local
5312bd5f60bd   k3d-wpenv         bridge    local
2208d8a45078   none              null      local
iwilltry42 commented 3 years ago

@bukowa , I am like absolutely lost there :thinking: The network settings look fine to me and I cannot think of any way, that the clusters could interfere.. :thinking:

bukowa commented 3 years ago

@iwilltry42 ok maybe this finding can become useful when someone encounters similar issue

renepardon commented 3 years ago

I tried to do k3d cluster stop dev ; k3d cluster start dev after I received the error from issue #262

Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes)

Now it hangs here:

INFO[0000] Stopping cluster 'dev'                       
INFO[0000] Starting cluster 'dev'                       
INFO[0000] Starting the initializing server...          
INFO[0000] Starting Node 'k3d-dev-server-0'             
INFO[0001] Starting servers...                          
INFO[0001] Starting Node 'k3d-dev-server-1'             

The cluster was initially created with:

k3d registry create registry.localhost --port 5000
k3d cluster create dev \
    --k3s-server-arg "--no-deploy=traefik" \
    --registry-use k3d-registry.localhost:5000 \
    --port 80:80@loadbalancer \
    --port 443:443@loadbalancer \
    --api-port 6443 --servers 3 --agents 3
iwilltry42 commented 3 years ago

Hi @renepardon , can you please share the logs of the k3d-dev-server-0 and k3d-dev-server-1 containers? Also, which version of k3d and k3s are you using and what is your environment?

renepardon commented 3 years ago

renepardon commented 3 years ago


k3d version v4.4.3 k3s version v1.20.6-k3s1 (default)

But I can't find the log files. Neither in my home directory, nor in /var/log

iwilltry42 commented 3 years ago

@renepardon , if you were to search for actual log files, they'd be in /var/lib/docker/containers/<ID>/<ID>.log. But you can just do e.g. docker logs k3d-dev-server-0 to get them :+1: Also, please try k3d v4.4.4 and paste your kernel version (uname -r).

renepardon commented 3 years ago


k3d-dev-server-0 log repeats infinite like this:

time="2021-06-07T06:42:26.583579237Z" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [k3d-dev-server-1-b8022d18= k3d-dev-server-2-bb40fdf0= k3d-dev-server-0-f3a8f84f=], expect: k3d-dev-server-0-f3a8f84f="
time="2021-06-07T06:42:26.881852734Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1622098330: notBefore=2021-05-27 06:52:10 +0000 UTC notAfter=2022-06-07 06:42:26 +0000 UTC"
time="2021-06-07T06:42:26.882548039Z" level=info msg="Cluster-Http-Server 2021/06/07 06:42:26 http: TLS handshake error from remote error: tls: bad certificate"
time="2021-06-07T06:42:26.883032016Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1622098330: notBefore=2021-05-27 06:52:10 +0000 UTC notAfter=2022-06-07 06:42:26 +0000 UTC"
time="2021-06-07T06:42:27.350483832Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1622098330: notBefore=2021-05-27 06:52:10 +0000 UTC notAfter=2022-06-07 06:42:27 +0000 UTC"
time="2021-06-07T06:42:27.351461555Z" level=info msg="Cluster-Http-Server 2021/06/07 06:42:27 http: TLS handshake error from remote error: tls: bad certificate"

and k3d-dev-server-1:

iwilltry42 commented 3 years ago

@renepardon , I'll follow up on the similar issue in https://github.com/rancher/k3d/issues/619 :+1:

moio commented 1 year ago

@iwilltry42 I can still reproduce that IPs are occasionally shuffled on restart on k3d 5.4.9 - k3s v1.24.12+k3s1 (especially after a host reboot).

Does it help if I open a new issue?

p-se commented 2 months ago

@iwilltry42 I can still reproduce that IPs are occasionally shuffled on restart on k3d 5.4.9 - k3s v1.24.12+k3s1 (especially after a host reboot).

Does it help if I open a new issue?

I'm experiencing the same issue. After rebooting the host, the cluster becomes irrecoverably lost. k3d v5.6.3 - k3s v1.28.8-k3s1.