loft-sh / vcluster

vCluster - Create fully functional virtual Kubernetes clusters - Each vcluster runs inside a namespace of the underlying k8s cluster. It's cheaper than creating separate full-blown clusters and it offers better multi-tenancy and isolation than regular namespaces.
https://www.vcluster.com
Apache License 2.0
6.3k stars 402 forks source link

Pods in a vcluster unable to resolve DNS addresses #2094

Open ddl-pjohnson opened 1 month ago

ddl-pjohnson commented 1 month ago

What happened?

vcluster installs correctly and most pods start correctly, however some fail to resolve DNS addresses correctly, both internal and external ones. Even the coredns pod will sometimes be unable to start properly because it can't connect to the kubernetes api.

Sometimes pods will start working if I repeatedly delete pods to restart them, particularly the core dns pods.

The host cluster has calico installed and we've run into DNS issues that seem similar before, see https://github.com/projectcalico/calico/issues/4955 for what I think was happening in that case.

Example errors from pods:

Nginx server:

 2024/08/22 23:24:07 [emerg] 1#0: host not found in resolver "kube-dns.kube-system.svc.cluster.local" in /opt/nginx/con
│ nginx: [emerg] host not found in resolver "kube-dns.kube-system.svc.cluster.local" in /opt/nginx/conf/nginx.conf:13

Fluentd:

│ 2024-08-22 23:22:28 +0000 [info]: Received graceful stop
│ W, [2024-08-22T23:22:46.715401 #14]  WARN -- #<Bunny::Session:0x1338 fluentd@XXXXXXXX, vhost=/, addresses=[XXXXXXX]>: Could not establish TCP connecti
│ 2024-08-22 23:22:46 +0000 [error]: #0 unexpected error error_class=Bunny::TCPConnectionFailedForAllHosts error="Could not establish TCP connection to any of the configured hosts"
│   2024-08-22 23:22:46 +0000 [error]: #0 /usr/lib/ruby/gems/3.2.0/gems/bunny-2.14.4/lib/bunny/session.rb:338:in `rescue in start'

What did you expect to happen?

Pods should not have errors connecting to internal and external.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately there isn't a public way of deploying this, I'll see what I can do in terms of recreating it externally.

Anything else we need to know?

No response

Host cluster Kubernetes version

```console Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.12-eks-2f46c53 WARNING: version difference between client (1.30) and server (1.28) exceeds the supported minor version skew of +/-1 ```

vcluster version

```console vcluster version 0.20.0 ```

VCluster Config

``` # Using default config vcluster create \ test-efs \ --upgrade \ -n test-efs \ --update-current=false \ --connect=false \ --switch-context=false \ --context "$HOST_CLUSTER_CONTEXT" \ --kube-config-context-name "$VCLUSTER_CLUSTER_CONTEXT" ```
deniseschannon commented 3 weeks ago

Do you have an example of what pods are working and not working? Is there a pattern with them?