Networking problems after second kubeone run

exolab commented 3 years ago

I am using kubeone to set up a cluster on Hetzner cloud. After the initial run, things mostly work. There is a problem reaching kube-dns-upstream from pods on the worker nodes, but that can be fixed by doing a rolling restart of coredns.

However, after I run kubeone for a second time, canal gets redeployed, which then renders the cluster unusable. Every request seems to be taking 10 seconds (which of course makes debugging a pain).

This is how I execute kubeone (on both runs). kubeone.yaml and tf.json are identical on both runs. I am using kubeone 1.2.2 (and have also tried 1.2.1) and have tried with kubernetes 1.20 as well.

kubeone appl --auto-approve --debug --manifest kubeone.yaml -t tf.json

-- kubeone.yaml
apiVersion: kubeone.io/v1beta1
kind: KubeOneCluster
versions:
    kubernetes: 1.19.11
cloudProvider:
  hetzner: {}
  external: true
containerRuntime:
  containerd: {}

-- tf.json
{
  "kubeone_api": {
    "sensitive": false,
    "type": [
      "object",
      {
        "endpoint": "string"
      }
    ],
    "value": {
      "endpoint": "x.x.x.x"
    }
  },
  "kubeone_hosts": {
    "sensitive": false,
    "type": [
      "object",
      {
        "control_plane": [
          "object",
          {
            "bastion": "string",
            "cloud_provider": "string",
            "cluster_name": "string",
            "network_id": "string",
            "private_address": [
              "tuple",
              [
                "string",
                "string",
                "string"
              ]
            ],
            "public_address": "dynamic",
            "ssh_agent_socket": "string",
            "ssh_port": "number",
            "ssh_private_key_file": "string",
            "ssh_user": "string"
          }
        ]
      }
    ],
    "value": {
      "control_plane": {
        "bastion": "xxx.xxx.xxx.xxx",
        "cloud_provider": "hetzner",
        "cluster_name": "test-001",
        "network_id": "xxx",
        "private_address": [
          "x.x.x.x",
          "x.x.x.x",
          "x.x.x.x"
        ],
        "public_address": null,
        "ssh_agent_socket": "env:SSH_AUTH_SOCK",
        "ssh_port": 22,
        "ssh_private_key_file": "/builds/infrastructure/infrastructure.tmp/TF_VAR_CLUSTER_PRIVATE_KEY",
        "ssh_user": "root"
      }
    }
  },
  "kubeone_workers": {
    "sensitive": false,
    "type": [
      "object",
      {
        "test-001-pool1": [
          "object",
          {
            "providerSpec": [
              "object",
              {
                "cloudProviderSpec": [
                  "object",
                  {
                    "firewall": "string",
                    "image": "string",
                    "labels": [
                      "object",
                      {
                        "test-001-workers": "string"
                      }
                    ],
                    "location": "string",
                    "networks": [
                      "tuple",
                      [
                        "string"
                      ]
                    ],
                    "serverType": "string"
                  }
                ],
                "operatingSystem": "string",
                "operatingSystemSpec": [
                  "object",
                  {
                    "distUpgradeOnBoot": "bool"
                  }
                ],
                "sshPublicKeys": [
                  "tuple",
                  [
                    "string"
                  ]
                ]
              }
            ],
            "replicas": "number"
          }
        ]
      }
    ],
    "value": {
      "test-001-pool1": {
        "providerSpec": {
          "cloudProviderSpec": {
            "image": "ubuntu-20.04",
            "labels": {
              "test-001-workers": "pool1"
            },
            "location": "nbg1",
            "networks": [
              "x"
            ],
            "serverType": "ccx12"
          },
          "operatingSystem": "ubuntu",
          "operatingSystemSpec": {
            "distUpgradeOnBoot": false
          },
          "sshPublicKeys": [
            "ssh-ed25519 xxxxxxx"
          ]
        },
        "replicas": 1
      }
    }
  }
}

Does anyone have an idea what might be causing this? What logs should I look at specifically for debugging?

kron4eg commented 3 years ago

@exolab please try this PR (you will have to compile it): https://github.com/kubermatic/kubeone/pull/1380, this PR will be release as kubeone v1.2.3.

exolab commented 3 years ago

@kron4eg Thank you for the swift reply and PR. We did build kubeone ourselves and have since migrated to the released v1.2.3. However, we still see the same thing happening.

We deploy the cluster.
We start a container on a worker node, log in and can do nslookup google.com just fine, we get fast responses
We run kubeone a second time. All the canal pods get terminated and restarted.
We try to do nslookup again and get no response ;; connection timed out; no servers could be reached
Looking at the node-local-dns logs on the worker node, we see that the kube-dns-upstream pod cannot be reached: [ERROR] plugin/errors: 2 route53.amazonaws.com.cluster.local. A: dial tcp 10.106.52.115:53: i/o timeout

This only happens after the second kubeone run.

kron4eg commented 3 years ago

We run kubeone a second time. All the canal pods get terminated and restarted.

are you using kubeone apply each time, or kubeone install?

kron4eg commented 3 years ago

I can't reproduce this issue. Can you please share the deployment used to spawn the pods.

exolab commented 3 years ago

We run kubeone a second time. All the canal pods get terminated and restarted.

are you using kubeone apply each time, or kubeone install?

We are using apply in both cases...

exolab commented 3 years ago

I can't reproduce this issue. Can you please share the deployment used to spawn the pods.

I am not entirely sure what you mean. This is how we are spawning the pod on the worker node:

kubectl run -i --tty dnsutils --image gcr.io/kubernetes-e2e-test-images/dnsutils:1.3 --kubeconfig terraform/credentials/kubeconfig --restart=Never -- sh

exolab commented 3 years ago

@kron4eg I can confirm that after deploying a fresh cluster using the most recent patched version and then running kubeone apply a second time no longer leads to the problem I had.

Thank you so much for your impressively swift reaction and resolution, @kron4eg!

kubermatic / kubeone

Networking problems after second kubeone run #1379