k3d-io / k3d

Little helper to run CNCF's k3s in Docker
https://k3d.io/
MIT License
5.35k stars 456 forks source link

[FEATURE] k3d IPAM (to prevent etcd failures) #550

Closed mindkeep closed 3 years ago

mindkeep commented 3 years ago

Feature Request


Original Bug Report

What did you do

I was toying with system resiliency around restarting servers and agents, and found that the Init server didn't come back after the following sequence:

% cat break-k3d.sh
#!/bin/bash
set -x
k3d cluster create --servers 3 --agents 3 test-k3d
docker stop k3d-test-k3d-server-0
docker stop k3d-test-k3d-agent-1
docker start k3d-test-k3d-agent-1
docker start k3d-test-k3d-server-0
sleep 120
docker logs k3d-test-k3d-server-0
%

where we wind up with this message repeated:

time="2021-04-06T21:20:30.180484636Z" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [k3d-test-k3d-server-1-78ae1551=https://172.20.0.3:2380 k3d-test-k3d-server-0-3e94d5e1=https://172.20.0.2:2380 k3d-test-k3d-server-2-31d30714=https://172.20.0.4:2380], expect: k3d-test-k3d-server-0-3e94d5e1=172.20.0.6"

As it turns out the IP for k3d-test-k3d-server-0 moves from 172.20.0.2 to 172.20.0.6 (presumably swapping with k3d-test-k3d-agent-1). Is there any way to lock this down a bit so docker doesn't accidentally flip things around?

What did you expect to happen

I expected the IPs to be maintained...

Which OS & Architecture

Which version of k3d

Which version of docker

Server: Engine: Version: 19.03.15 API version: 1.40 (minimum version 1.12) Go version: go1.15.7 Git commit: 420b1d3625 Built: Thu Feb 11 18:14:30 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.3.9 GitCommit: ea765ab runc: Version: 1.0.0-rc10 GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd docker-init: Version: 0.18.0 GitCommit: fec3683b971d9c3ef73f284f176672c44b448662

% docker info Client: Debug Mode: false

Server: Containers: 8 Running: 7 Paused: 0 Stopped: 1 Images: 3 Server Version: 19.03.15 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: ea765ab runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd init version: fec3683b971d9c3ef73f284f176672c44b448662 Security Options: seccomp Profile: default Kernel Version: 5.4.97-gentoo Operating System: Gentoo/Linux OSType: linux Architecture: x86_64 CPUs: 8 Total Memory: 15.62GiB Name: crossbow ID: BXTV:WYUT:UXM3:IW4V:4AUA:JU6Q:ZYSE:APB4:JWOZ:UTUS:ZKRK:SEDG Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

iwilltry42 commented 3 years ago

Hi @mindkeep , thanks for opening this issue! You're right there, we saw this issue already in the context of #262 (https://github.com/rancher/k3d/issues/262#issuecomment-810105498). I'll investigate a bit on what we can do there :+1: (FWIW: old related moby/docker issue: https://github.com/moby/moby/issues/2801)

iwilltry42 commented 3 years ago

Log

If there's no flag for k3s/etcd that we could use to work around this, we'll have to check how feasible it is to manage static IPs for k3d nodes (i.e. similar to using the --ip flag in docker).

mindkeep commented 3 years ago

Worth noting, if I shut down both containers and restart k3d-test-k3d-server-0 first, we get the original IP ordering and everything comes back up.

So it's purely a docker IP ordering problem.

Given that we already spin up a new docker network, could we explore assigning more static IPs within that network?

iwilltry42 commented 3 years ago

So it's purely a docker IP ordering problem.

Yep..

could we explore assigning more static IPs

Implementing some IPAM in k3d would be the only possible solution, yep :thinking:

Let's turn this into a feature request :+1:

iwilltry42 commented 3 years ago

Hi @mindkeep, while I'm still trying to figure out the best way to cope with this, I already did some work on this. Please give https://github.com/rancher/k3d/releases/tag/v4.5.0-dev.0 a try and start your cluster with --subnet auto or --subnet 172.45.0.0/16 (or whatever subnet it should use). With --subnet auto, k3d will create a fake docker network to get a subnet auto-assigned by docker that it can use. Also please make use of the --trace flag for verbose logs. Awaiting your feedback :)

bukowa commented 3 years ago

@iwilltry42

First try

$ k3d2 cluster create test25 --servers=3 --subnet=auto
←[36mINFO←[0m[0000] Prep: Network
←[36mINFO←[0m[0000] Created network 'k3d-test25' (3ad95731bfce7029df8a7f7540daaa02c2fb305a10e24d531e0ed6e6756d08b2)
←[36mINFO←[0m[0000] Created volume 'k3d-test25-images'
←[36mINFO←[0m[0000] Creating initializing server node
←[36mINFO←[0m[0000] Creating node 'k3d-test25-server-0'
←[36mINFO←[0m[0002] Pulling image 'docker.io/rancher/k3s:v1.20.5-k3s1'
←[36mINFO←[0m[0011] Creating node 'k3d-test25-server-1'
←[36mINFO←[0m[0012] Creating node 'k3d-test25-server-2'
←[36mINFO←[0m[0012] Creating LoadBalancer 'k3d-test25-serverlb'
←[36mINFO←[0m[0014] Pulling image 'docker.io/rancher/k3d-proxy:v4.5.0-dev.0'
←[36mINFO←[0m[0019] Starting cluster 'test25'
←[36mINFO←[0m[0019] Starting the initializing server...
←[36mINFO←[0m[0019] Starting Node 'k3d-test25-server-0'
←[36mINFO←[0m[0020] Starting servers...
←[36mINFO←[0m[0020] Starting Node 'k3d-test25-server-1'
←[31mERRO←[0m[0020] Failed to start node 'k3d-test25-server-1'
←[31mERRO←[0m[0020] Failed Cluster Start: Failed to start server k3d-test25-server-1: Error response from daemon: Address already in use
←[31mERRO←[0m[0020] Failed to create cluster >>> Rolling Back
←[36mINFO←[0m[0020] Deleting cluster 'test25'
←[36mINFO←[0m[0020] Deleted k3d-test25-server-0
←[36mINFO←[0m[0020] Deleted k3d-test25-server-1
←[36mINFO←[0m[0020] Deleted k3d-test25-server-2
←[36mINFO←[0m[0020] Deleted k3d-test25-serverlb
←[36mINFO←[0m[0020] Deleting cluster network 'k3d-test25'
←[36mINFO←[0m[0021] Deleting image volume 'k3d-test25-images'
←[31mFATA←[0m[0021] Cluster creation FAILED, all changes have been rolled back!
$ k3d2 cluster create test25 --servers=3 --subnet=172.45.0.0/16
←[36mINFO←[0m[0000] Prep: Network
←[36mINFO←[0m[0000] Created network 'k3d-test25' (f7900949f4b380af681d7a9f1b39992b971cbbffc94bd57ee254696dc9812fbb)
←[36mINFO←[0m[0000] Created volume 'k3d-test25-images'
←[36mINFO←[0m[0000] Creating initializing server node
←[36mINFO←[0m[0000] Creating node 'k3d-test25-server-0'
←[36mINFO←[0m[0001] Creating node 'k3d-test25-server-1'
←[36mINFO←[0m[0002] Creating node 'k3d-test25-server-2'
←[36mINFO←[0m[0002] Creating LoadBalancer 'k3d-test25-serverlb'
←[36mINFO←[0m[0002] Starting cluster 'test25'
←[36mINFO←[0m[0002] Starting the initializing server...
←[36mINFO←[0m[0002] Starting Node 'k3d-test25-server-0'
←[36mINFO←[0m[0003] Starting servers...
←[36mINFO←[0m[0003] Starting Node 'k3d-test25-server-1'
←[31mERRO←[0m[0003] Failed to start node 'k3d-test25-server-1'
←[31mERRO←[0m[0003] Failed Cluster Start: Failed to start server k3d-test25-server-1: Error response from daemon: Address already in use
←[31mERRO←[0m[0003] Failed to create cluster >>> Rolling Back
←[36mINFO←[0m[0003] Deleting cluster 'test25'
←[36mINFO←[0m[0003] Deleted k3d-test25-server-0
←[36mINFO←[0m[0003] Deleted k3d-test25-server-1
←[36mINFO←[0m[0003] Deleted k3d-test25-server-2
←[36mINFO←[0m[0003] Deleted k3d-test25-serverlb
←[36mINFO←[0m[0003] Deleting cluster network 'k3d-test25'
←[36mINFO←[0m[0003] Deleting image volume 'k3d-test25-images'
←[31mFATA←[0m[0003] Cluster creation FAILED, all changes have been rolled back!
iwilltry42 commented 3 years ago

@bukowa ... well.. crap :grin: Forgot to add a case before building. :roll_eyes:

bukowa commented 3 years ago

@iwilltry42

After restarting docker and deleting that cluster, my old clusters with 1 server won't bootup :(

E0414 11:40:13.438977 7 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: error trying to reach service: dial tcp 10.43.239.9:443: connect: no route to host

, Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]

I0414 11:40:13.439010 7 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.

E0414 11:40:13.440809 7 available_controller.go:508] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again

I0414 11:40:14.147032 7 range_allocator.go:82] Sending events to api server.

I0414 11:40:14.147149 7 range_allocator.go:110] No Service CIDR provided. Skipping filtering out service addresses.

I0414 11:40:14.147175 7 range_allocator.go:116] No Secondary Service CIDR provided. Skipping filtering out secondary service addresses.

I0414 11:40:14.147195 7 controllermanager.go:554] Started "nodeipam"

I0414 11:40:14.147300 7 node_ipam_controller.go:159] Starting ipam controller

I0414 11:40:14.147315 7 shared_informer.go:240] Waiting for caches to sync for node

I0414 11:40:14.149398 7 controllermanager.go:554] Started "persistentvolume-expander"

I0414 11:40:14.149475 7 expand_controller.go:310] Starting expand controller

I0414 11:40:14.149482 7 shared_informer.go:240] Waiting for caches to sync for expand

I0414 11:40:14.151273 7 controllermanager.go:554] Started "clusterrole-aggregation"

I0414 11:40:14.151297 7 clusterroleaggregation_controller.go:149] Starting ClusterRoleAggregator

I0414 11:40:14.151304 7 shared_informer.go:240] Waiting for caches to sync for ClusterRoleAggregator

I0414 11:40:14.152995 7 controllermanager.go:554] Started "job"

I0414 11:40:14.153102 7 job_controller.go:148] Starting job controller

I0414 11:40:14.153113 7 shared_informer.go:240] Waiting for caches to sync for job

I0414 11:40:14.154074 7 shared_informer.go:240] Waiting for caches to sync for resource quota

W0414 11:40:14.156670 7 actual_state_of_world.go:534] Failed to update statusUpdateNeeded field in actual state of world: Failed to set statusUpdateNeeded to needed true, because nodeName="k3d-k3s-default-server-0" does not exist

I0414 11:40:14.158994 7 shared_informer.go:247] Caches are synced for persistent volume

E0414 11:40:14.160100 7 memcache.go:196] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

I0414 11:40:14.161068 7 shared_informer.go:247] Caches are synced for attach detach

I0414 11:40:14.165053 7 shared_informer.go:247] Caches are synced for TTL

I0414 11:40:14.166524 7 shared_informer.go:247] Caches are synced for PVC protection

I0414 11:40:14.169107 7 shared_informer.go:247] Caches are synced for endpoint

I0414 11:40:14.181276 7 shared_informer.go:247] Caches are synced for HPA

I0414 11:40:14.187013 7 shared_informer.go:247] Caches are synced for PV protection

I0414 11:40:14.187210 7 event.go:291] "Event occurred" object="lens-metrics/kube-state-metrics" kind="Endpoints" apiVersion="v1" type="Warning" reason="FailedToUpdateEndpoint" message="Failed to update endpoint lens-metrics/kube-state-metrics: Operation cannot be fulfilled on endpoints \"kube-state-metrics\": the object has been modified; please apply your changes to the latest version and try again"

I0414 11:40:14.195398 7 shared_informer.go:247] Caches are synced for ReplicationController

I0414 11:40:14.198810 7 shared_informer.go:247] Caches are synced for GC

I0414 11:40:14.218147 7 shared_informer.go:247] Caches are synced for namespace

I0414 11:40:14.218332 7 shared_informer.go:247] Caches are synced for endpoint_slice_mirroring

I0414 11:40:14.221166 7 shared_informer.go:247] Caches are synced for disruption

I0414 11:40:14.221254 7 disruption.go:339] Sending events to api server.

I0414 11:40:14.224425 7 shared_informer.go:247] Caches are synced for certificate-csrsigning-kubelet-serving

I0414 11:40:14.224557 7 shared_informer.go:247] Caches are synced for certificate-csrsigning-kubelet-client

I0414 11:40:14.224609 7 shared_informer.go:247] Caches are synced for certificate-csrsigning-legacy-unknown

I0414 11:40:14.224726 7 shared_informer.go:247] Caches are synced for certificate-csrsigning-kube-apiserver-client

I0414 11:40:14.226641 7 shared_informer.go:247] Caches are synced for endpoint_slice

I0414 11:40:14.228199 7 shared_informer.go:247] Caches are synced for service account

I0414 11:40:14.230801 7 shared_informer.go:247] Caches are synced for ReplicaSet

I0414 11:40:14.234906 7 shared_informer.go:247] Caches are synced for certificate-csrapproving

I0414 11:40:14.234977 7 shared_informer.go:247] Caches are synced for taint

I0414 11:40:14.235281 7 node_lifecycle_controller.go:1429] Initializing eviction metric for zone:

W0414 11:40:14.235413 7 node_lifecycle_controller.go:1044] Missing timestamp for Node k3d-k3s-default-server-0. Assuming now as a timestamp.

I0414 11:40:14.238699 7 taint_manager.go:187] Starting NoExecuteTaintManager

I0414 11:40:14.238856 7 event.go:291] "Event occurred" object="k3d-k3s-default-server-0" kind="Node" apiVersion="v1" type="Normal" reason="RegisteredNode" message="Node k3d-k3s-default-server-0 event: Registered Node k3d-k3s-default-server-0 in Controller"

I0414 11:40:14.239237 7 node_lifecycle_controller.go:1245] Controller detected that zone is now in state Normal.

I0414 11:40:14.247401 7 shared_informer.go:247] Caches are synced for node

I0414 11:40:14.247602 7 range_allocator.go:172] Starting range CIDR allocator

I0414 11:40:14.247663 7 shared_informer.go:240] Waiting for caches to sync for cidrallocator

I0414 11:40:14.247703 7 shared_informer.go:247] Caches are synced for cidrallocator

I0414 11:40:14.249530 7 shared_informer.go:247] Caches are synced for deployment

I0414 11:40:14.250710 7 shared_informer.go:247] Caches are synced for expand

I0414 11:40:14.251616 7 shared_informer.go:247] Caches are synced for ClusterRoleAggregator

I0414 11:40:14.272632 7 shared_informer.go:247] Caches are synced for daemon sets

I0414 11:40:14.284690 7 shared_informer.go:247] Caches are synced for stateful set

I0414 11:40:14.392178 7 shared_informer.go:247] Caches are synced for crt configmap

I0414 11:40:14.453240 7 shared_informer.go:247] Caches are synced for job

I0414 11:40:14.454227 7 shared_informer.go:247] Caches are synced for resource quota

I0414 11:40:14.504974 7 shared_informer.go:247] Caches are synced for resource quota

E0414 11:40:14.804091 7 memcache.go:101] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

I0414 11:40:14.806192 7 shared_informer.go:240] Waiting for caches to sync for garbage collector

I0414 11:40:14.906439 7 shared_informer.go:247] Caches are synced for garbage collector

I0414 11:40:14.949133 7 shared_informer.go:247] Caches are synced for garbage collector

I0414 11:40:14.949164 7 garbagecollector.go:151] Garbage collector: all resource monitors have synced. Proceeding to collect garbage

E0414 11:40:16.558960 7 available_controller.go:508] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.43.239.9:443/apis/metrics.k8s.io/v1beta1: Get "https://10.43.239.9:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.43.239.9:443: connect: no route to host

W0414 11:40:17.065119 7 lease.go:233] Resetting endpoints for master service "kubernetes" to [172.18.0.3]

time="2021-04-14T11:40:17.067400700Z" level=info msg="Stopped tunnel to 172.18.0.2:6443"

time="2021-04-14T11:40:17.067465300Z" level=info msg="Proxy done" err="context canceled" url="wss://172.18.0.2:6443/v1-k3s/connect"

time="2021-04-14T11:40:17.067664600Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"

I0414 11:44:52.803254 7 network_policy_controller.go:157] Shutting down network policies full sync goroutine

time="2021-04-14T11:44:52.805001100Z" level=info msg="Shutting down batch/v1, Kind=Job workers"

time="2021-04-14T11:44:52.805117200Z" level=info msg="Shutting down helm.cattle.io/v1, Kind=HelmChartConfig workers"

time="2021-04-14T11:44:52.805641500Z" level=info msg="Shutting down /v1, Kind=Endpoints workers"

time="2021-04-14T11:44:52.807819300Z" level=info msg="Shutting down /v1, Kind=Node workers"
iwilltry42 commented 3 years ago

@bukowa can you inspect those node containers? And can you check if there's an overlapping docker network (docker network ls & docker network inspect)?

bukowa commented 3 years ago

@iwilltry42 no overlapping networks + all nodes started right now after another docker restart, strange :) before they were in exit loop

iwilltry42 commented 3 years ago

@bukowa should be fixed here: https://github.com/rancher/k3d/releases/tag/v4.5.0-dev.1

bukowa commented 3 years ago

@iwilltry42 i made a small github action that may be usefull to test this https://github.com/bukowa/k3dtest/runs/2352095712?check_suite_focus=true

iwilltry42 commented 3 years ago

Thanks for that @bukowa , I forked it and added kubectl to check that output. It seems to work, right? E.g. https://github.com/iwilltry42/k3dtest/runs/2352325264

bukowa commented 3 years ago

@iwilltry42 seems to work but not on my machine ^^

ff192959bbdd   rancher/k3d-proxy:v4.5.0-dev.1   "/bin/sh -c nginx-pr…"   25 minutes ago   Restarting (1) 39 seconds ago                                                                                 k3d-test55-serverlb
db451387ad46   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   25 minutes ago   Exited (255) 23 minutes ago                                                                                   k3d-test55-server-2
10ec4e821999   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   25 minutes ago   Up 21 minutes                                                                                                 k3d-test55-server-1
ff7e15fb60bb   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --c…"   25 minutes ago   Exited (255) 23 minutes ago                                                                                   k3d-test55-server-0

server0.txt server1.txt server2.txt serverlb.txt

bukowa commented 3 years ago

@iwilltry42 i think ive got something, because i still have old clusters running:

ff192959bbdd   rancher/k3d-proxy:v4.5.0-dev.1   "/bin/sh -c nginx-pr…"   33 minutes ago   Up 20 seconds                  80/tcp, 0.0.0.0:52232->6443/tcp                                               k3d-test55-serverlb
db451387ad46   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   33 minutes ago   Up 21 seconds                                                                                                k3d-test55-server-2
10ec4e821999   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   33 minutes ago   Up 20 seconds                                                                                                k3d-test55-server-1
ff7e15fb60bb   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --c…"   33 minutes ago   Up 20 seconds                                                                                                k3d-test55-server-0
4e8e8de805e9   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   2 weeks ago      Exited (255) 23 seconds ago    0.0.0.0:32080->32080/tcp, 0.0.0.0:32443->32443/tcp, 0.0.0.0:65336->6443/tcp   k3d-wpenv-server-0
afee5e598565   rancher/k3d-proxy:v4.3.0         "/bin/sh -c nginx-pr…"   4 weeks ago      Restarting (1) 2 seconds ago                                                                                 k3d-k3s-default-serverlb
0bdf54e2bf41   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   4 weeks ago      Exited (255) 23 seconds ago                                                                                  k3d-k3s-default-server-0

now each time i restart docker looks like alive nodes are shifting, take a look:

restart 1:

ff192959bbdd   rancher/k3d-proxy:v4.5.0-dev.1   "/bin/sh -c nginx-pr…"   36 minutes ago   Up 3 seconds                  80/tcp, 0.0.0.0:52232->6443/tcp                                               k3d-test55-serverlb
db451387ad46   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   36 minutes ago   Exited (255) 47 seconds ago                                                                                 k3d-test55-server-2
10ec4e821999   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   36 minutes ago   Exited (255) 47 seconds ago                                                                                 k3d-test55-server-1
ff7e15fb60bb   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --c…"   36 minutes ago   Exited (255) 47 seconds ago                                                                                 k3d-test55-server-0
4e8e8de805e9   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   2 weeks ago      Up 45 seconds                 0.0.0.0:32080->32080/tcp, 0.0.0.0:32443->32443/tcp, 0.0.0.0:65336->6443/tcp   k3d-wpenv-server-0
afee5e598565   rancher/k3d-proxy:v4.3.0         "/bin/sh -c nginx-pr…"   4 weeks ago      Up 45 seconds                 80/tcp, 0.0.0.0:50670->6443/tcp                                               k3d-k3s-default-serverlb
0bdf54e2bf41   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   4 weeks ago      Up 45 seconds                                                                                               k3d-k3s-default-server-0

restart 2:

ff192959bbdd   rancher/k3d-proxy:v4.5.0-dev.1   "/bin/sh -c nginx-pr…"   37 minutes ago   Up 1 second                       80/tcp, 0.0.0.0:52232->6443/tcp                                               k3d-test55-serverlb
db451387ad46   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   37 minutes ago   Up 23 seconds                                                                                                   k3d-test55-server-2
10ec4e821999   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   37 minutes ago   Up 23 seconds                                                                                                   k3d-test55-server-1
ff7e15fb60bb   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --c…"   37 minutes ago   Exited (255) About a minute ago                                                                                 k3d-test55-server-0
4e8e8de805e9   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   2 weeks ago      Exited (255) 25 seconds ago       0.0.0.0:32080->32080/tcp, 0.0.0.0:32443->32443/tcp, 0.0.0.0:65336->6443/tcp   k3d-wpenv-server-0
afee5e598565   rancher/k3d-proxy:v4.3.0         "/bin/sh -c nginx-pr…"   4 weeks ago      Up 2 seconds                      80/tcp, 0.0.0.0:50670->6443/tcp                                               k3d-k3s-default-serverlb
0bdf54e2bf41   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   4 weeks ago      Exited (255) 25 seconds ago                                                                                     k3d-k3s-default-server-0

restart 3:

ff192959bbdd   rancher/k3d-proxy:v4.5.0-dev.1   "/bin/sh -c nginx-pr…"   38 minutes ago   Up 2 seconds                  80/tcp, 0.0.0.0:52232->6443/tcp                                               k3d-test55-serverlb
db451387ad46   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   38 minutes ago   Exited (255) 26 seconds ago                                                                                 k3d-test55-server-2
10ec4e821999   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   38 minutes ago   Exited (255) 26 seconds ago                                                                                 k3d-test55-server-1
ff7e15fb60bb   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --c…"   38 minutes ago   Exited (255) 2 minutes ago                                                                                  k3d-test55-server-0
4e8e8de805e9   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   2 weeks ago      Up 25 seconds                 0.0.0.0:32080->32080/tcp, 0.0.0.0:32443->32443/tcp, 0.0.0.0:65336->6443/tcp   k3d-wpenv-server-0
afee5e598565   rancher/k3d-proxy:v4.3.0         "/bin/sh -c nginx-pr…"   4 weeks ago      Up 24 seconds                 80/tcp, 0.0.0.0:50670->6443/tcp                                               k3d-k3s-default-serverlb
0bdf54e2bf41   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   4 weeks ago      Up 23 seconds                                                                                               k3d-k3s-default-server-0

restart 4:

ff192959bbdd   rancher/k3d-proxy:v4.5.0-dev.1   "/bin/sh -c nginx-pr…"   40 minutes ago   Restarting (1) 7 seconds ago                                                                                    k3d-test55-serverlb
db451387ad46   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   40 minutes ago   Exited (255) 3 minutes ago                                                                                      k3d-test55-server-2
10ec4e821999   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --t…"   40 minutes ago   Up About a minute                                                                                               k3d-test55-server-1
ff7e15fb60bb   rancher/k3s:v1.20.5-k3s1         "/bin/k3s server --c…"   40 minutes ago   Up About a minute                                                                                               k3d-test55-server-0
4e8e8de805e9   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   2 weeks ago      Exited (255) About a minute ago   0.0.0.0:32080->32080/tcp, 0.0.0.0:32443->32443/tcp, 0.0.0.0:65336->6443/tcp   k3d-wpenv-server-0
afee5e598565   rancher/k3d-proxy:v4.3.0         "/bin/sh -c nginx-pr…"   4 weeks ago      Restarting (1) 6 seconds ago                                                                                    k3d-k3s-default-serverlb
0bdf54e2bf41   rancher/k3s:v1.20.4-k3s1         "/bin/k3s server --t…"   4 weeks ago      Exited (255) About a minute ago                                                                                 k3d-k3s-default-server-0
$ docker network ls
NETWORK ID     NAME              DRIVER    SCOPE
b436e5a0149c   bridge            bridge    local
e05bb6c1420e   host              host      local
56bb9cb725d6   k3d-k3s-default   bridge    local
ed22e183cae8   k3d-test55        bridge    local
5312bd5f60bd   k3d-wpenv         bridge    local
2208d8a45078   none              null      local
$ docker network inspect k3d-k3s-default
[
    {
        "Name": "k3d-k3s-default",
        "Id": "56bb9cb725d6eeb944f529cd964814845d5e5142029f83e06bfc85d584beb4aa",
        "Created": "2021-03-12T00:27:05.3874863Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "afee5e59856529cb6e304a423b93a79ff59b7439e4c4c8a5fc7b5be0a3f4c5b6": {
                "Name": "k3d-k3s-default-serverlb",
                "EndpointID": "7a9659b83a4c9d76b291a4461abeea0e8411af01cef9f2c85efae474aa221a38",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {
            "app": "k3d"
        }
    }
]
$ docker network inspect k3d-test55
[
    {
        "Name": "k3d-test55",
        "Id": "ed22e183cae8b7f29442f03cf95b81dd3ba12523e355aa8b9acfb9af84b281f6",
        "Created": "2021-04-15T11:45:01.0438683Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.20.0.0/16",
                    "Gateway": "172.20.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "10ec4e821999abf5a2c7a501cecafd8f13daea4842a9873f96985d36a8d7ef3f": {
                "Name": "k3d-test55-server-1",
                "EndpointID": "a130a1d83af61109579915e2b32fe898e1417e8ad65cd1eb563a578eda2d82ad",
                "MacAddress": "02:42:ac:14:00:03",
                "IPv4Address": "172.20.0.3/16",
                "IPv6Address": ""
            },
            "ff192959bbdd6fa5bd35e7fb3ba850b2156760c3cd6765159d311141cda3b885": {
                "Name": "k3d-test55-serverlb",
                "EndpointID": "a05b53563a77e22ba2a79f0560c01c260833cd4cdf9f1f59b270434a3814f42a",
                "MacAddress": "02:42:ac:14:00:04",
                "IPv4Address": "172.20.0.4/16",
                "IPv6Address": ""
            },
            "ff7e15fb60bb195de04aac993dc7456abcc3ea2adacd78a7bb4ee62ce278e098": {
                "Name": "k3d-test55-server-0",
                "EndpointID": "f84d9ba79007959197f4b5af7f1873f7a1e85da5f5aa23840963b9e45aea2f3b",
                "MacAddress": "02:42:ac:14:00:02",
                "IPv4Address": "172.20.0.2/16",
                "IPv6Address": ""
            }
        },
        "Options": {},
        "Labels": {
            "app": "k3d"
        }
    }
]
$ docker network inspect k3d-wpenv
[
    {
        "Name": "k3d-wpenv",
        "Id": "5312bd5f60bd523d046f919bf42c2b25c75463cc851aeb64e3bfff9b52583a11",
        "Created": "2021-03-28T22:57:14.5128859Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.19.0.0/16",
                    "Gateway": "172.19.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {},
        "Options": {},
        "Labels": {
            "app": "k3d"
        }
    }
]
iwilltry42 commented 3 years ago

@bukowa , I am like absolutely lost there :thinking: The network settings look fine to me and I cannot think of any way, that the clusters could interfere.. :thinking:

bukowa commented 3 years ago

@iwilltry42 ok maybe this finding can become useful when someone encounters similar issue

renepardon commented 3 years ago

I tried to do k3d cluster stop dev ; k3d cluster start dev after I received the error from issue #262

Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes)

Now it hangs here:

INFO[0000] Stopping cluster 'dev'                       
INFO[0000] Starting cluster 'dev'                       
INFO[0000] Starting the initializing server...          
INFO[0000] Starting Node 'k3d-dev-server-0'             
INFO[0001] Starting servers...                          
INFO[0001] Starting Node 'k3d-dev-server-1'             

The cluster was initially created with:

k3d registry create registry.localhost --port 5000
k3d cluster create dev \
    --k3s-server-arg "--no-deploy=traefik" \
    --registry-use k3d-registry.localhost:5000 \
    --port 80:80@loadbalancer \
    --port 443:443@loadbalancer \
    --api-port 6443 --servers 3 --agents 3
iwilltry42 commented 3 years ago

Hi @renepardon , can you please share the logs of the k3d-dev-server-0 and k3d-dev-server-1 containers? Also, which version of k3d and k3s are you using and what is your environment?

renepardon commented 3 years ago

Hello Thorsten,

I work on Ubuntu 20.04 and have the latest version of k3d. I had to delete and recreate the cluster to be able to continue with my workload. Next time this issue appears I will post the logs.

Von unterwegs gesendet

Am 27.05.2021 um 12:36 schrieb Thorsten Klein @.***>:



Hi @renepardonhttps://github.com/renepardon , can you please share the logs of the k3d-dev-server-0 and k3d-dev-server-1 containers? Also, which version of k3d and k3s are you using and what is your environment?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rancher/k3d/issues/550#issuecomment-849525621, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHGZK2TMXYEE7J6CRNQELTTPYOBDANCNFSM42PYHG6A.

renepardon commented 3 years ago

@iwilltry42

k3d version v4.4.3 k3s version v1.20.6-k3s1 (default)

But I can't find the log files. Neither in my home directory, nor in /var/log

iwilltry42 commented 3 years ago

@renepardon , if you were to search for actual log files, they'd be in /var/lib/docker/containers/<ID>/<ID>.log. But you can just do e.g. docker logs k3d-dev-server-0 to get them :+1: Also, please try k3d v4.4.4 and paste your kernel version (uname -r).

renepardon commented 3 years ago

@iwilltry42

k3d-dev-server-0 log repeats infinite like this:

time="2021-06-07T06:42:26.583579237Z" level=info msg="Failed to test data store connection: this server is a not a member of the etcd cluster. Found [k3d-dev-server-1-b8022d18=https://172.21.0.4:2380 k3d-dev-server-2-bb40fdf0=https://172.21.0.5:2380 k3d-dev-server-0-f3a8f84f=https://172.21.0.3:2380], expect: k3d-dev-server-0-f3a8f84f=172.21.0.7"
time="2021-06-07T06:42:26.881852734Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1622098330: notBefore=2021-05-27 06:52:10 +0000 UTC notAfter=2022-06-07 06:42:26 +0000 UTC"
time="2021-06-07T06:42:26.882548039Z" level=info msg="Cluster-Http-Server 2021/06/07 06:42:26 http: TLS handshake error from 172.21.0.6:48878: remote error: tls: bad certificate"
time="2021-06-07T06:42:26.883032016Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1622098330: notBefore=2021-05-27 06:52:10 +0000 UTC notAfter=2022-06-07 06:42:26 +0000 UTC"
time="2021-06-07T06:42:27.350483832Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1622098330: notBefore=2021-05-27 06:52:10 +0000 UTC notAfter=2022-06-07 06:42:27 +0000 UTC"
time="2021-06-07T06:42:27.351461555Z" level=info msg="Cluster-Http-Server 2021/06/07 06:42:27 http: TLS handshake error from 172.21.0.2:35352: remote error: tls: bad certificate"

and k3d-dev-server-1:

{"level":"info","ts":"2021-06-07T06:43:10.559Z","caller":"mvcc/index.go:189","msg":"compact tree index","revision":427102}
{"level":"info","ts":"2021-06-07T06:43:10.572Z","caller":"mvcc/kvstore_compaction.go:56","msg":"finished scheduled compaction","compact-revision":427102,"took":"12.938268ms"}
W0607 06:43:13.421471       8 handler_proxy.go:102] no RequestInfo found in the context
E0607 06:43:13.421512       8 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
, Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
I0607 06:43:13.421518       8 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
{"level":"warn","ts":"2021-06-07T06:43:13.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:13.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:18.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:18.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:23.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:23.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:28.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:28.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:33.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:33.504Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:38.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:38.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
W0607 06:43:43.357254       8 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
{"level":"warn","ts":"2021-06-07T06:43:43.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:43.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:48.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:48.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:53.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:53.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:58.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:43:58.506Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:44:03.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:44:03.506Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:44:08.505Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-06-07T06:44:08.506Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"be99668240cb5dc7","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
iwilltry42 commented 3 years ago

@renepardon , I'll follow up on the similar issue in https://github.com/rancher/k3d/issues/619 :+1:

moio commented 1 year ago

@iwilltry42 I can still reproduce that IPs are occasionally shuffled on restart on k3d 5.4.9 - k3s v1.24.12+k3s1 (especially after a host reboot).

Does it help if I open a new issue?

p-se commented 2 months ago

@iwilltry42 I can still reproduce that IPs are occasionally shuffled on restart on k3d 5.4.9 - k3s v1.24.12+k3s1 (especially after a host reboot).

Does it help if I open a new issue?

I'm experiencing the same issue. After rebooting the host, the cluster becomes irrecoverably lost. k3d v5.6.3 - k3s v1.28.8-k3s1.