kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.28k stars 1.54k forks source link

no_proxy fed to the cluster is missing the control plane API dns name #2884

Closed wherka-ama closed 2 years ago

wherka-ama commented 2 years ago

What happened: Kind failed to create the cluster in the environment with http(s) proxy

>kind create cluster
enabling experimental podman provider
Creating cluster "kind" ...
 βœ“ Ensuring node image (kindest/node:v1.24.3) πŸ–Ό 
 βœ“ Preparing nodes πŸ“¦  
 βœ“ Writing configuration πŸ“œ 
 βœ— Starting control-plane πŸ•ΉοΈ 
ERROR: failed to create cluster: failed to init node with kubeadm: command "podman exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0817 10:16:02.075463     148 initconfiguration.go:255] loading configuration from "/kind/kubeadm.conf"
W0817 10:16:02.076834     148 initconfiguration.go:332] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.24.3
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I0817 10:16:02.082340     148 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I0817 10:16:02.217935     148 certs.go:522] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kind-control-plane kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 10.89.0.14 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0817 10:16:02.444010     148 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I0817 10:16:02.524854     148 certs.go:522] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I0817 10:16:02.713906     148 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I0817 10:16:02.897146     148 certs.go:522] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [kind-control-plane localhost] and IPs [10.89.0.14 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [kind-control-plane localhost] and IPs [10.89.0.14 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0817 10:16:03.342608     148 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
I0817 10:16:03.504921     148 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I0817 10:16:03.569178     148 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0817 10:16:03.621297     148 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0817 10:16:03.872480     148 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I0817 10:16:04.133926     148 kubelet.go:65] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0817 10:16:04.260842     148 manifests.go:99] [control-plane] getting StaticPodSpecs
I0817 10:16:04.261098     148 certs.go:522] validating certificate period for CA certificate
I0817 10:16:04.261198     148 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0817 10:16:04.261214     148 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0817 10:16:04.261222     148 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0817 10:16:04.261227     148 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0817 10:16:04.261232     148 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0817 10:16:04.263297     148 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0817 10:16:04.263332     148 manifests.go:99] [control-plane] getting StaticPodSpecs
I0817 10:16:04.263514     148 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0817 10:16:04.263528     148 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0817 10:16:04.263532     148 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0817 10:16:04.263536     148 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0817 10:16:04.263540     148 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0817 10:16:04.263543     148 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0817 10:16:04.263547     148 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0817 10:16:04.264167     148 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0817 10:16:04.264186     148 manifests.go:99] [control-plane] getting StaticPodSpecs
I0817 10:16:04.264353     148 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0817 10:16:04.264731     148 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0817 10:16:04.265196     148 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0817 10:16:04.265210     148 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I0817 10:16:04.265657     148 loader.go:372] Config loaded from file:  /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0817 10:16:04.367121     148 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 101 milliseconds
...
I0817 10:16:43.962771     148 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 94 milliseconds
[kubelet-check] Initial timeout of 40s passed.
I0817 10:16:44.465505     148 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 98 milliseconds
...
I0817 10:20:04.548982     148 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 92 milliseconds

Unfortunately, an error has occurred:
        timed out waiting for the condition

This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
        cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        cmd/kubeadm/app/cmd/init.go:153
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
        vendor/github.com/spf13/cobra/command.go:856
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:974
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        cmd/kubeadm/app/kubeadm.go:50
main.main
        cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:250
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1571
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        cmd/kubeadm/app/cmd/init.go:153
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
        vendor/github.com/spf13/cobra/command.go:856
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:974
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        cmd/kubeadm/app/kubeadm.go:50
main.main
        cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:250
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1571

What you expected to happen: Successful cluster creation

How to reproduce it (as minimally and precisely as possible): 1.Setup or use an existing http(s) - in this context we used squid/4.15 2.Set the env variables: https_proxy and http_proxy to point at the proxy above

  1. Using podman as a provider(KIND_EXPERIMENTAL_PROVIDER=podman) try to create the cluster

Anything else we need to know?: I've done some extra troubleshooting to recreate the API server health checks done during the cluster boot:

root@test-control-plane:/# curl https://test-control-plane:6443/healthz?timeout=10s
curl: (56) Received HTTP code 403 from proxy after CONNECT

root@test-control-plane:/# curl --verbose https://test-control-plane:6443/healthz?timeout=10s
* Uses proxy env variable no_proxy == 'fc00:f853:ccd:e793::/64,10.89.0.0/24,.xxx(redacted),127.0.0.1,localhost,10.96.0.0/16,10.244.0.0/16,.svc,.svc.cluster,.svc.cluster.local'
* Uses proxy env variable https_proxy == 'http://<redacted>:80'
*   Trying <redacted>:80...
* Connected to <redacted> (<redacted>) port 80 (#0)
* allocate connect buffer!
* Establish HTTP proxy tunnel to test-control-plane:6443
> CONNECT test-control-plane:6443 HTTP/1.1
> Host: test-control-plane:6443
> User-Agent: curl/7.74.0
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 403 Forbidden
< Server: squid/4.15
< Mime-Version: 1.0
< Date: Tue, 16 Aug 2022 15:01:29 GMT
< Content-Type: text/html;charset=utf-8
< Content-Length: 3505
< X-Squid-Error: ERR_ACCESS_DENIED 0
< Vary: Accept-Language
< Content-Language: en
< X-Cache: MISS from <redacted>
< X-Cache-Lookup: NONE from <redacted>:80
< Via: 1.1 <redacted> (squid/4.15)
< Connection: keep-alive
< 
* Received HTTP code 403 from proxy after CONNECT
* CONNECT phase completed!
* Closing connection 0
curl: (56) Received HTTP code 403 from proxy after CONNECT

I've already modified the podman.getProxyEnv in order to add <cluster name>-control-plane to no_proxy to improve the situation. After the patch everything worked as expected. I'm happy to propose such implementation as a PR.

Environment:

wherka-ama commented 2 years ago

Here is the patch that I was referring to in the description:

diff --git a/pkg/cluster/internal/providers/docker/provision.go b/pkg/cluster/internal/providers/docker/provision.go
index 97b05594..aab4a3f4 100644
--- a/pkg/cluster/internal/providers/docker/provision.go
+++ b/pkg/cluster/internal/providers/docker/provision.go
@@ -286,12 +286,12 @@ func getProxyEnv(cfg *config.Cluster, networkName string, nodeNames []string) (m

                noProxyList := append(subnets, envs[common.NOProxy])
                noProxyList = append(noProxyList, nodeNames...)
-               // Add pod and service dns names to no_proxy to allow in cluster
+               // Add pod, service and control plane(API server) dns names to no_proxy to allow in cluster
                // Note: this is best effort based on the default CoreDNS spec
                // https://github.com/kubernetes/dns/blob/master/docs/specification.md
                // Any user created pod/service hostnames, namespaces, custom DNS services
                // are expected to be no-proxied by the user explicitly.
-               noProxyList = append(noProxyList, ".svc", ".svc.cluster", ".svc.cluster.local")
+               noProxyList = append(noProxyList, ".svc", ".svc.cluster", ".svc.cluster.local", strings.Join([]string{cfg.Name, "control-plane"}, "-"))
                noProxyJoined := strings.Join(noProxyList, ",")
                envs[common.NOProxy] = noProxyJoined
                envs[strings.ToLower(common.NOProxy)] = noProxyJoined
diff --git a/pkg/cluster/internal/providers/podman/provision.go b/pkg/cluster/internal/providers/podman/provision.go
index a515324e..68c1a2a0 100644
--- a/pkg/cluster/internal/providers/podman/provision.go
+++ b/pkg/cluster/internal/providers/podman/provision.go
@@ -252,12 +252,12 @@ func getProxyEnv(cfg *config.Cluster, networkName string) (map[string]string, er
                        return nil, err
                }
                noProxyList := append(subnets, envs[common.NOProxy])
-               // Add pod and service dns names to no_proxy to allow in cluster
+               // Add pod, service and control plane(API server) dns names to no_proxy to allow in cluster
                // Note: this is best effort based on the default CoreDNS spec
                // https://github.com/kubernetes/dns/blob/master/docs/specification.md
                // Any user created pod/service hostnames, namespaces, custom DNS services
                // are expected to be no-proxied by the user explicitly.
-               noProxyList = append(noProxyList, ".svc", ".svc.cluster", ".svc.cluster.local")
+               noProxyList = append(noProxyList, ".svc", ".svc.cluster", ".svc.cluster.local", strings.Join([]string{cfg.Name, "con
trol-plane"}, "-"))
                noProxyJoined := strings.Join(noProxyList, ",")
                envs[common.NOProxy] = noProxyJoined
                envs[strings.ToLower(common.NOProxy)] = noProxyJoined

I can create a PR as well if that's a preferred way of solving this problem relatively quickly.

aojea commented 2 years ago

I think we probably should add all the container names and the ~container subnet~ to the NO_PROXY

https://github.com/kubernetes-sigs/kind/blob/5f25ddcbdc84b057e3a48304a4f2295b3feef775/pkg/internal/cluster/providers/provider/common/proxy.go#L49-L54

at least the node-names are available on that function

/cc @BenTheElder

EDIT

we are already passing the container subnet in the provision.go , it seems only the container names are missing

aojea commented 2 years ago

/assign @wherka-ama please go ahead

wherka-ama commented 2 years ago

@aojea : the basic fix - just for the control plane has been added. I will check now how we can add all the container names in a similar fashion.

aojea commented 2 years ago

@aojea : the basic fix - just for the control plane has been added. I will check now how we can add all the container names in a similar fashion.

better to use an independent PR with the whole fix please, clusters with multiple control-plane will still fail with the basic fix

wherka-ama commented 2 years ago

better to use an independent PR with the whole fix please, clusters with multiple control-plane will still fail with the basic fix

@aojea : alright, I'll push something more sophisticated then(still based on https://github.com/kubernetes-sigs/kind/pull/2885, I'll just push more/improved stuff in there). I think I know roughly how it should look like. I will test it with a multi-node/plane scenario to ensure that it does what it is supposed to. Thanks for your guidance! Much appreciated.

wherka-ama commented 2 years ago

I've pushed an improved implementation to address the multi-node environment(@aojea :thanks for pointing that out!) and adhering to an existing MakeNodeNamer factory which will help in making the whole solution aligned with the rest of the code. I've tested it in a single and multi-node configurations with and without the proxy and it looks like we have something that works rather nicely.

See: https://github.com/kubernetes-sigs/kind/pull/2885

@aojea and @BenTheElder : do you mind having a look please?

BTW: I wrote some unit tests for that code. However, there is a little bit of a problem with mocking the exec.Command which is buried within provision.getSubnets. It will take some effort to make it work as a proper unit test i.e. without running the actual sys calls and without it being super ugly and incomprehensible. I'm happy to spend more time on that part, but I suggest we tackle it in the next release.

aojea commented 2 years ago

/close

https://github.com/kubernetes-sigs/kind/pull/2885

k8s-ci-robot commented 2 years ago

@aojea: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/kind/issues/2884#issuecomment-1222033993): >/close > >https://github.com/kubernetes-sigs/kind/pull/2885 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.