mzaferyahsi commented 10 months ago

What happened?

When I've installed a fresh vcluster using the helm chart with vcluster-k8s, I noticed that my main pi-hole DNS server (outside of the cluster) is being queried for etcd nodes. I believe this is because coredns cannot resolve and therefore forwards the request to upstream server.

Logs from pi-hole

Dec  7 23:04:53 dnsmasq[206427]: reply arelon-etcd-17 is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-17.arelon-etcd-headless.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-17.arelon-etcd-headless.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-17.arelon-etcd-headless from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-17.arelon-etcd-headless is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-17.arelon-etcd-headless.vcluster-arelon from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-17.arelon-etcd-headless.vcluster-arelon is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-17.arelon-etcd-headless.vcluster-arelon from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-17.arelon-etcd-headless.vcluster-arelon is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-18.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-18.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-18.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-18.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-18 from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: forwarded arelon-etcd-18 to 10.32.0.1
Dec  7 23:04:53 dnsmasq[206427]: reply arelon-etcd-18 is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-18.arelon-etcd-headless.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-18.arelon-etcd-headless.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-18.arelon-etcd-headless.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-18.arelon-etcd-headless.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-18.arelon-etcd-headless from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-18.arelon-etcd-headless is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-18.arelon-etcd-headless.vcluster-arelon.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-18.arelon-etcd-headless.vcluster-arelon.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-18.arelon-etcd-headless.vcluster-arelon.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-18.arelon-etcd-headless.vcluster-arelon.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-18.arelon-etcd-headless.vcluster-arelon from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-18.arelon-etcd-headless.vcluster-arelon is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-19.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-19.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-19.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-19.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-19 from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: forwarded arelon-etcd-19 to 10.32.0.1
Dec  7 23:04:53 dnsmasq[206427]: reply arelon-etcd-19 is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-19.arelon-etcd-headless.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-19.arelon-etcd-headless.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-19.arelon-etcd-headless from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-19.arelon-etcd-headless is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-19.arelon-etcd-headless.vcluster-arelon.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-19.arelon-etcd-headless.vcluster-arelon.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-19.arelon-etcd-headless.vcluster-arelon from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-19.arelon-etcd-headless.vcluster-arelon is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[A] arelon-etcd-2.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-2.alroze.cloud is NXDOMAIN
Dec  7 23:04:53 dnsmasq[206427]: query[AAAA] arelon-etcd-2.alroze.cloud from 10.21.1.12
Dec  7 23:04:53 dnsmasq[206427]: cached arelon-etcd-2.alroze.cloud is NXDOMAIN

What did you expect to happen?

DNS queries to not to leak to my pi-hole.

How can we reproduce it (as minimally and precisely as possible)?

Setup pi-hole
Deploy k8s cluster with kubespray a. Use DNS server as the pi-hole IPs b. Use ndots:2 c. Use cluster_name: cluster.local
Deploy vcluster-k8s with following values.yaml
```
# Enable HA mode
enableHA: true
```

Scale up syncer replicas

syncer: replicas: 3 extraArgs:

"--tls-san=cluster.arelon.xxx.xxx"
"--tls-san=10.21.2.10"

ingress: enabled: true annotations: kubernetes.io/ingress.class: nginx

Scale up etcd

etcd: replicas: 3

Scale up controller manager

controller: replicas: 3

Scale up api server

api: replicas: 3

Scale up DNS server

coredns: replicas: 3

storage: className: longhorn

sync: secrets: enabled: true persistentvolumes: enabled: true volumesnapshots: enabled: false serviceaccounts: enabled: true networkpolicies: enabled: true pods: enabled: true

Sync ephemeralContainers to host cluster

ephemeralContainers: true
# Sync readiness gates to host cluster
status: true


4. Monitor /var/logs/pihole.log

### Anything else we need to know?

This issue also happens on vcluster v0.16.4

### Host cluster Kubernetes version

<details>

```console
$ kubectl version
Server Version: v1.28.3

Host cluster Kubernetes distribution

``` Self hosted K8s ```

vlcuster version

```console $ vcluster --version 0.18.0 ```

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

``` k8s ```

OS and Arch

``` OS: Ubuntu 22.04.3 TLS Arch: x64 ```

mzaferyahsi commented 10 months ago

It seems that the etcd nodes cannot be resolved with xxx-etcd-headless services. When I create a new service for each etcd instance, the DNS queries seem to stop.

FabianKramm commented 10 months ago

@mzaferyahsi thanks for creating this issue! I'm not sure why this is not working for you, but I don't think we should create a separate service for each replica, maybe we only need to adjust the certs itself and its enough already.

mzaferyahsi commented 10 months ago

@FabianKramm indeed that is also another solution. But the requirement on limiting the number of etcd SANs still stays in place. Therefore, I still recommend passing the number of etcd replicas as parameter for the syncer deployment. I've just tested by adjusting the etcd sans as below and it seems to be okay.

    for i := 0; i < etcdReplicaCount; i++ {
        if etcdEmbedded {
            // this is for embedded etcd
            hostname := vClusterName + "-" + strconv.Itoa(i)
            etcdSans = append(etcdSans, hostname, hostname+"."+vClusterName+"-headless", hostname+"."+vClusterName+"-headless"+"."+currentNamespace)
        } else {
            // this is for external etcd
            etcdHostname := etcdService + "-" + strconv.Itoa(i)
            // etcdSans = append(etcdSans, etcdHostname, etcdHostname+"."+etcdService+"-headless", etcdHostname+"."+etcdService+"-headless"+"."+currentNamespace)
            etcdSans = append(etcdSans, etcdHostname+"."+etcdService+"-headless"+"."+currentNamespace)
        }
    }

Shall I apply the same logic to the embedded?

FabianKramm commented 9 months ago

@mzaferyahsi we are doing some refactoring for this now, but can add that later when the refactoring is done yeah

PavelGloba commented 1 week ago

I have the same issue with DNS requests for nonexisting etcd replicas to the host cluster's DNS server (in my case coredns) on vcluster 19.3 Also the services and configuration with the helm chart for existing replicas is also incorrect. Domains should end with svc.cluster.local, otherwise the domain would not resolve from the first try. As far as I can see, right now in the master branch there is still a code which creates configuration for 20 replicas

mzaferyahsi commented 1 week ago

Indeed, this hasn't been fixed. One solution that I've used is to run the initialization with old version of vcluster, and then update your cluster to the latest one. That way the certificates are initiated correctly and then used by the new cluster.

PavelGloba commented 1 week ago

Just tried updating fresh installation of 0.17.1 to 0.19.3 It didn't alter any etcd certificates and the problem with DNS is still there

loft-sh / vcluster

etcd DNS query to host DNS #1402

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Scale up syncer replicas

Scale up etcd

Scale up controller manager

Scale up api server

Scale up DNS server

Sync ephemeralContainers to host cluster

Host cluster Kubernetes distribution

vlcuster version

Vcluster Kubernetes distribution(k3s(default)), k8s, k0s)

OS and Arch