Closed jgresty closed 3 years ago
also check mine https://github.com/rancher/k3d/issues/619
i guess we have the same problem.
If you try to do this on another server, will it work? Mine worked fine on other two servers though. I am not sure what made it go wrong
Hi @jgresty , thanks for opening this issue!
I couldn't do real research right now, but could this be related to the issue with the newer kernel versions as in https://github.com/k3s-io/k3s/pull/3337 ? :thinking:
Can you give it a try with a newer version of k3s (--image
flag)?
Hi @jgresty , any update on this?
Looking through the server logs, it seems like my previous comment can be neglected (makes sense since you're using the workaround flags already).
I was thinking of a clock skew between Docker and your host machine (which is why k3d won't catch the logs of the serverlb properly), but that looks OK as well.
We could make the log message that we're waiting for just a little less specific, by only using starting worker process
(dropping the es
suffix), as those lines will only follow after the original log line :thinking: But then the time difference may still be too little.
FWIW: Can you connect to the cluster while k3d is still waiting? E.g. run k3d kubeconfig get $CLUSTER
or k3d kubeconfig merge $CLUSTER
to get the kubeconfig in a different terminal and then try to connect with kubectl.
Sorry I forgot about this issue, worked around by running in a fedora vm but obviously that isn't ideal...
Tried to specify the latest image anyway (v1.21.1-k3s1) but that didn't help.
I can connect to the cluster with kubectl after running
k3d kubeconfig merge k3s-default --kubeconfig-switch-context --kubeconfig-merge-default
➤ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"archive", BuildDate:"2021-05-14T14:09:09Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1+k3s1", GitCommit:"75dba57f9b1de3ec0403b148c52c348e1dee2a5e", GitTreeState:"clean", BuildDate:"2021-05-21T16:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
That's just such a weird issue :thinking: Would you be open to testing a debug release for this? I'd just add some more logs etc. to it to see where k3d fails to read the proper log messages :thinking:
Sure, I'll be happy to assist
Before we dig deeper. Can you still reproduce this with the latest k3d release v4.4.5? (I guess so, but its worth a try :grimacing:)
Can still reproduce with
k3d version v4.4.5
k3s version v1.21.1-k3s1 (default)
I also can reproduce this on v4.4.5 with k3s v.1.21.1-k3s1 - how can I assist?
name resolution is not an issue anymore. but i am getting this error now - every 30 s
proxier.go:1612] "Failed to execute iptables-restore" err="exit status 2 (ip6tables-restore v1.8.5 (nf_tables): unknown option \"--random-fully\"\nError occurred at line: 18\nTry `ip6tables-restore -h' or 'ip6tables-restore --help' for more information.\n)" I0614 12:39:41.039647 8 proxier.go:859] "Sync failed" retryingTime="30s"
@tglunde , you have exactly the same issue? What are your system specs?
I unfortunately have no way of debugging this properly, as I don't have problems on any of my 5 test servers (all with different OS - Kernel - Docker version combinations). :confounded: But I also cannot exactly replicate your environment @jgresty :thinking: Anyone here who could provide me with a VM/Vagrant image to reproduce a failing system?
@jgresty & @tglunde
So I created a debug release to at least try to get a clue about what's happening here.
Would be great, if you could give it a try: https://drive.google.com/drive/folders/1JLlq6IUUn3OV_Mm7sQMjx8kg7epW-wxv?usp=sharing
Download the release matching your setup and try it with the following command:
K3D_LOG_NODE_WAIT_LOGS=loadbalancer k3d cluster create debug --trace
This should generate logs where the loadbalancer logs are included in the output to ensure that there's no clock skew involved (though, to be save, I also dropped subsecond precision from the wait command).
not sure wether this is exactly the same, but similar. Now when I start the debug version you provided everything seems to go smooth without error. please find the log file attached. anything else i can do? if you like we could also do a zoom session. log.txt
@tglunde , if it's working now, the I suspect a rounding issue when determining the starting point where docker should start to watch the logs... Let's wait what others observe with the debug build. Btw I mistyped the env var in my previous comment and fixed that now.
can i just use the debug version in the meantime or is there another workaround i could apply in the meantime?
@tglunde , feel free to use the debug version in the meantime, until there's another positive reply for it (then I'll create a real release from it). The debug version is not much different from a normal release. In fact, I only changed like 5 lines (https://github.com/rancher/k3d/commit/97451e11cb6444c6a38174d5a6e6e6ef5286aaf9).
@iwilltry42 Cool. thank you so much for your efforts - highly appreciated!
Let's wait what others observe with the debug build.
I can also confirm that the debug build posted worked perfectly on my setup (5.12.10-arch1-1
)
Well this is strange .. I read the code change and only this line matters, doesn't it?
startTime = ts.Truncate(time.Second)
Let's wait what others observe with the debug build.
I can also confirm that the debug build posted worked perfectly on my setup (
5.12.10-arch1-1
)
@jgresty Awesome!
Well this is strange .. I read the code change and only this line matters, doesn't it?
startTime = ts.Truncate(time.Second)
@nguyenvulong
Exactly. the fact that it works now confirms my assumption that this is caused by some rounding issue of the since
timestamp used to start checking for logs.
E.g. the original post shows this log line from k3d:
TRAC[0006] NodeWaitForLogMessage: Node 'k3d-k3s-default-serverlb' waiting for log message 'start worker processes' since '2021-06-03 13:20:56.510628562 +0000 UTC'
This shows the timestamp 2021-06-03 13:20:56.510628562 +0000 UTC
which would be rounded to '2021-06-03 13:20:57 +0000 UTC'
, thus the process would start to look for the log line after it appeared: 2021/06/03 13:20:56 [notice] 14#14: start worker processes
.
The "fix" startTime = ts.Truncate(time.Second)
just drops the subsecond precision, so it cannot start watching later than that. This could introduce an edge case when the loadbalancer is in a restart loop and has that log just in that few ms that are now included in the log check.
However, this shouldn't happen, as the log won't show up if the setup is in a broken state that causes the restart loop.
Additionally, we also check for the Restarting
status.
Is this considered resolved? Stale? I think I'm running into that issue.
Linux 5.13.0-39-generic x86_64 Docker version 20.10.14 k3d version v5.4.0 k3s version v1.22.7-k3s1 (default)
Running the following command
k3d cluster create mycluster -p "8081:80@loadbalancer" --agents 2 --trace
Output (last lines before it hangs forever)
TRAC[0006] Starting node 'k3d-mycluster3-agent-1'
TRAC[0006] Starting node 'k3d-mycluster3-agent-0'
INFO[0006] Starting Node 'k3d-mycluster3-agent-1'
INFO[0006] Starting Node 'k3d-mycluster3-agent-0'
DEBU[0006] Truncated 2022-03-30 11:53:23.010548487 +0000 UTC to 2022-03-30 11:53:23 +0000 UTC
DEBU[0006] Waiting for node k3d-mycluster3-agent-0 to get ready (Log: 'Successfully registered node')
TRAC[0006] NodeWaitForLogMessage: Node 'k3d-mycluster3-agent-0' waiting for log message 'Successfully registered node' since '2022-03-30 11:53:23 +0000 UTC'
DEBU[0006] Truncated 2022-03-30 11:53:23.115008808 +0000 UTC to 2022-03-30 11:53:23 +0000 UTC
DEBU[0006] Waiting for node k3d-mycluster3-agent-1 to get ready (Log: 'Successfully registered node')
TRAC[0006] NodeWaitForLogMessage: Node 'k3d-mycluster3-agent-1' waiting for log message 'Successfully registered node' since '2022-03-30 11:53:23 +0000 UTC'
docker ps
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e671a034e568 rancher/k3s:v1.22.7-k3s1 "/bin/k3s agent" 46 seconds ago Up 38 seconds k3d-mycluster-agent-1
894448f38f42 rancher/k3s:v1.22.7-k3s1 "/bin/k3s agent" 46 seconds ago Up 38 seconds k3d-mycluster-agent-0
1e2edaf1319e rancher/k3s:v1.22.7-k3s1 "/bin/k3s server --t…" 47 seconds ago Up 44 seconds k3d-mycluster-server-0
c6a3f5dc38f5 registry:2 "/entrypoint.sh /etc…" 23 hours ago Up 2 hours 0.0.0.0:5000->5000/tcp k3d-registry.localhost
k3d node list
$ k3d node list
NAME ROLE CLUSTER STATUS
k3d-mycluster-agent-0 agent mycluster running
k3d-mycluster-agent-1 agent mycluster running
k3d-mycluster-server-0 server mycluster running
k3d-mycluster-serverlb loadbalancer mycluster created
k3d-registry.localhost registry running
server container error logs
$ docker logs k3d-mycluster-server-0 2>&1 | grep -E '^E'
E0330 12:07:45.339923 7 controller.go:151] Unable to perform initial Kubernetes service initialization: Service "kubernetes" is invalid: spec.clusterIPs: Invalid value: []string{"10.43.0.1"}: failed to allocate IP 10.43.0.1: cannot allocate resources of type serviceipallocations at this time
E0330 12:07:45.341161 7 controller.go:156] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/172.22.0.2, ResourceVersion: 0, AdditionalErrorMsg:
E0330 12:07:51.639144 7 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
E0330 12:07:53.355881 7 controllermanager.go:419] unable to get all supported resources from server: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:07:59.876489 7 namespaced_resources_deleter.go:161] unable to get all supported resources from server: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:08:01.230181 7 resource_quota_controller.go:162] initial discovery check failure, continuing and counting on future sync update: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:08:01.236734 7 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:08:02.340606 7 memcache.go:196] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:08:02.347614 7 memcache.go:101] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:08:04.032992 7 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
E0330 12:08:32.639175 7 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:09:02.652890 7 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:09:04.034107 7 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
E0330 12:09:32.665580 7 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:10:02.683043 7 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:10:32.697176 7 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:11:02.708975 7 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0330 12:11:04.034324 7 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
Server logs go like this in an endless loop:
time="2022-03-30T12:17:08Z" level=info msg="Waiting for control-plane node k3d-mycluster-server-0 startup: nodes \"k3d-mycluster-server-0\" not found"
time="2022-03-30T12:17:09Z" level=info msg="Waiting for control-plane node k3d-mycluster-server-0 startup: nodes \"k3d-mycluster-server-0\" not found"
time="2022-03-30T12:17:09Z" level=info msg="certificate CN=k3d-mycluster-agent-0 signed by CN=k3s-server-ca@1648642062: notBefore=2022-03-30 12:07:42 +0000 UTC notAfter=2023-03-30 12:17:09 +0000 UTC"
time="2022-03-30T12:17:09Z" level=info msg="certificate CN=k3d-mycluster-agent-1 signed by CN=k3s-server-ca@1648642062: notBefore=2022-03-30 12:07:42 +0000 UTC notAfter=2023-03-30 12:17:09 +0000 UTC"
time="2022-03-30T12:17:09Z" level=info msg="certificate CN=system:node:k3d-mycluster-agent-0,O=system:nodes signed by CN=k3s-client-ca@1648642062: notBefore=2022-03-30 12:07:42 +0000 UTC notAfter=2023-03-30 12:17:09 +0000 UTC"
time="2022-03-30T12:17:09Z" level=info msg="certificate CN=system:node:k3d-mycluster-agent-1,O=system:nodes signed by CN=k3s-client-ca@1648642062: notBefore=2022-03-30 12:07:42 +0000 UTC notAfter=2023-03-30 12:17:09 +0000 UTC"
time="2022-03-30T12:17:10Z" level=info msg="Waiting for control-plane node k3d-mycluster-server-0 startup: nodes \"k3d-mycluster-server-0\" not found"
time="2022-03-30T12:17:11Z" level=info msg="Waiting for control-plane node k3d-mycluster-server-0 startup: nodes \"k3d-mycluster-server-0\" not found"
time="2022-03-30T12:17:12Z" level=info msg="Waiting for control-plane node k3d-mycluster-server-0 startup: nodes \"k3d-mycluster-server-0\" not found"
time="2022-03-30T12:17:12Z" level=info msg="certificate CN=k3d-mycluster-server-0 signed by CN=k3s-server-ca@1648642062: notBefore=2022-03-30 12:07:42 +0000 UTC notAfter=2023-03-30 12:17:12 +0000 UTC"
time="2022-03-30T12:17:12Z" level=info msg="certificate CN=system:node:k3d-mycluster-server-0,O=system:nodes signed by CN=k3s-client-ca@1648642062: notBefore=2022-03-30 12:07:42 +0000 UTC notAfter=2023-03-30 12:17:12 +0000 UTC"
time="2022-03-30T12:17:12Z" level=info msg="Waiting to retrieve agent configuration; server is not ready: \"overlayfs\" snapshotter cannot be enabled for \"/var/lib/rancher/k3s/agent/containerd\", try using \"fuse-overlayfs\" or \"native\": failed to mount overlay: invalid argument"
Interesingly, it seems to run without issues if I don't create agents (run the k3d cluster create mycluster
command without --agents
flag). After that I can communicate with the cluster using kubectl. Attempt to add agent nodes using k3d node create agent -c mycluster --role agent
command leads to the command hang forever.
It seems that my issue may not have been related despite the similarity judging by the logs. For me it was that k3s won't run on ZFS without workarounds.
Possibly related to https://github.com/rancher/k3d/issues/592 , same output from cluster create command but different logs from containers.
Cluster created using the workarounds described in the faq and skipping traefik:
k3d cluster create --k3s-server-arg "--kube-proxy-arg=conntrack-max-per-core=0" --k3s-agent-arg "--kube-proxy-arg=conntrack-max-per-core=0" --image rancher/k3s:v1.20.6-k3s1 -v /dev/mapper:/dev/mapper --k3s-server-arg "--no-deploy=traefik" --trace
Creating cluster halted on
docker logs k3d-k3s-default-serverlb
docker logs k3d-k3s-default-server-0
Versions
docker