Traffic seems to be routed via nodes' public IPs instead of private IPs

namelessvoid commented 3 years ago

What happened:

I have a kubeone cluster set-up at Hetzner via the example terraform scripts which include a private network. The only change we have is to add worker pools for a list of datacenters:

variable "datacenters" {
  type = list(string)
  default = ["nbg1", "fsn1"]
}

output "kubeone_workers" {
  description = "Workers definitions, that will be transformed into MachineDeployment object"

  value = {
    for idx, datacenter in var.datacenters: 

    # following outputs will be parsed by kubeone and automatically merged into
    # corresponding (by name) worker definition
    "${var.cluster_name}-pool${idx + 1}" => {
      replicas = var.workers_replicas
      providerSpec = {
        sshPublicKeys   = [file(var.ssh_public_key_file)]
        operatingSystem = var.worker_os
        operatingSystemSpec = {
          distUpgradeOnBoot = false
        }
        cloudProviderSpec = {
          # provider specific fields:
          # see example under `cloudProviderSpec` section at:
          # https://github.com/kubermatic/machine-controller/blob/master/examples/hetzner-machinedeployment.yaml
          serverType = var.worker_type
          location   = datacenter
          image      = var.image
          networks = [
            hcloud_network.net.id
          ]
          # Datacenter (optional)
          # datacenter = ""
          labels = {
            "${var.cluster_name}-workers" = "pool1"
          }
        }
      }
    }
  }
}

The resulting nodes look like this:

# kubectl get nodes -o wide
NAME                             STATUS   ROLES                  AGE   VERSION   INTERNAL-IP   EXTERNAL-IP       OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
staging-control-plane-1          Ready    control-plane,master   29d   v1.20.6   192.168.0.3   195.201.XXX.XXX   Ubuntu 20.04.2 LTS   5.4.0-72-generic   docker://19.3.14
staging-control-plane-2          Ready    control-plane,master   29d   v1.20.6   192.168.0.5   162.55.XXX.XXX    Ubuntu 20.04.2 LTS   5.4.0-72-generic   docker://19.3.14
staging-control-plane-3          Ready    control-plane,master   29d   v1.20.6   192.168.0.4   195.201.XXX.XXX   Ubuntu 20.04.2 LTS   5.4.0-72-generic   docker://19.3.14
staging-pool1-5d679cf75-464fm    Ready    <none>                 16h   v1.20.6   192.168.0.9   195.201.XXX.XXX   Ubuntu 20.04.2 LTS   5.4.0-72-generic   docker://19.3.15
staging-pool2-84c786cf67-dxf9p   Ready    <none>                 17h   v1.20.6   192.168.0.7   162.55.XXX.XXX    Ubuntu 20.04.2 LTS   5.4.0-72-generic   docker://19.3.15

When I traceroute a kubernetes service (e.g. backend.default.svc.cluster.local) I see that the traffic is routed via the public IP of the nodes instead of the IP within the private network:

# kubectl run ubuntu --image ubuntu -- sleep infinity
# kubectl exec -it ubuntu -- bash
root@ubuntu:/# traceroute backend.default.svc.cluster.local
traceroute to backend.default.svc.cluster.local (10.110.120.137), 30 hops max, 60 byte packets
 1  static.XXX.XXX.55.162.clients.your-server.de (162.55.XXX.XXX)  0.252 ms  0.052 ms  0.015 ms
 2  172.31.1.1 (172.31.1.1)  11.795 ms  11.217 ms  11.723 ms
 3  11202.your-cloud.host (159.69.96.89)  0.272 ms  0.410 ms  0.379 ms
 4  * * *
 5  spine1.cloud2.fsn1.hetzner.com (213.239.225.41)  0.954 ms  1.292 ms  1.262 ms
 6  core23.fsn1.hetzner.com (213.239.239.137)  3.423 ms core23.fsn1.hetzner.com (213.239.239.125)  2.076 ms core24.fsn1.hetzner.com (213.239.239.133)  4.113 ms
 7  core11.nbg1.hetzner.com (213.239.245.225)  6.156 ms core11.nbg1.hetzner.com (213.239.203.125)  10.360 ms  6.085 ms^C

Where 162.55.XXX.XXX is the public IP of the node. I'd expect the traffic being sent to 192.168.0.7 instead. I confirmed on a GKE cluster and there it seems that traffic is routed via the private IPs.

As a consequence, if I apply a firewall which prevents access to the nodes' public IPs, the cluster networking becomes non-operational in a sense that DNS lookups no longer work and services cannot be reached.

What is the expected behavior:

In-cluster traffic should be routed via private IPs and not via public IPs. I should also be able to restrict public node IP access via firewall and the cluster should stay operational.

How to reproduce the issue:

I did not try it with a fresh install, but steps to reproduce should be:

Install kubeone on Hetzner with the default terraform templates
Create some default pod with a service (nginx should suffice)
Create a second pod (e.g. ubuntu) and traceroute the service created in 2).

Anything else we need to know?

Information about the environment: KubeOne version (kubeone version): Cluster was created with kubeone 1.2.1 but was updated to 1.2.2 and then 1.2.3 recently. MachineDeployments have been restarted via https://docs.kubermatic.com/kubeone/master/cheat_sheets/rollout_machinedeployment/ Operating system: Ubuntu 20.04.2 LTS Provider you're deploying cluster on: Hetzner Operating system you're deploying on: MacOS

Hope you can help me with that! Thank you a lot!

kron4eg commented 3 years ago

I'm not sure if cross datacenter traffic can be sent over the private IPs. I suppose that question should be directed at hetzner cloud themselves.

kron4eg commented 3 years ago

OK, I've tried to create VMs in different DCs and they are capable to communicate to each other over the private networking.

kron4eg commented 3 years ago

@namelessvoid can you please try to build kubeone using latest master and try it?

I'm getting different results

root@ubuntu:/# traceroute 10.244.7.2
traceroute to 10.244.7.2 (10.244.7.2), 30 hops max, 60 byte packets
 1  static.123.164.55.162.clients.your-server.de (162.55.164.123)  0.156 ms  0.066 ms  0.073 ms
 2  10.244.7.0 (10.244.7.0)  4.442 ms  4.234 ms  4.131 ms
 3  10.244.7.2 (10.244.7.2)  4.282 ms  4.042 ms  3.844 ms

where 10.244.7.2 is overlay IP of the pod running on the other datacenter.

namelessvoid commented 3 years ago

Maybe I'm getting it wrong but shouldn't the first hop be the virtual network IP of your node? 162.55.164.123 is the public IP, isn't it? Disclaimer: I'm not too deep into k8s networking 🙈

I'll try latest master as soon as I can (I'm a bit tied by releases right now).

namelessvoid commented 3 years ago

Just for completeness, I tried a fresh cluster installed with kubeone 1.2.3 and see these results:

 1  static.170.210.55.162.clients.your-server.de (162.55.210.170)  0.147 ms  0.033 ms  0.024 ms
 2  172.31.1.1 (172.31.1.1)  13.458 ms  13.301 ms  13.008 ms
 3  11685.your-cloud.host (195.201.67.143)  0.607 ms  0.478 ms  0.515 ms

Then I built kubeone from master and retried on another freshly installed cluster:

root@ubuntu:/# traceroute nginx.default.svc.cluster.local
traceroute to nginx.default.svc.cluster.local (10.103.180.121), 30 hops max, 60 byte packets
 1  static.97.89.201.138.clients.your-server.de (138.201.89.97)  0.062 ms  0.027 ms  0.022 ms
 2  172.31.1.1 (172.31.1.1)  14.458 ms  14.358 ms  14.322 ms
 3  12740.your-cloud.host (136.243.181.165)  0.512 ms  0.447 ms  0.395 ms

kubeone version for the self-built one shows

{
  "kubeone": {
    "major": "1",
    "minor": "2",
    "gitVersion": "v1.2.0-rc.0-65-gab496ef",
    "gitCommit": "ab496efdaa222e92f14a1d0cbe63149d57f8cc53",
    "gitTreeState": "",
    "buildDate": "2021-06-22T11:48:09+02:00",
    "goVersion": "go1.16.5",
    "compiler": "gc",
    "platform": "darwin/amd64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "30",
    "gitVersion": "v1.30.0",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

Test setup:

$ kubectl run nginx --image nginx
$ kubectl expose pod nginx --port 80
$ kubectl run ubuntu --image ubuntu -- sleep infinity
$ kubectl exec -it ubuntu -- bash
  # apt update && apt install traceroute -y
  # traceroute nginx.default.svc.cluster.local

namelessvoid commented 3 years ago

Did a third test by installing the cluster from the example terraform files.

Kubeone manifest looks like this:

apiVersion: kubeone.io/v1beta1
kind: KubeOneCluster

versions:
  kubernetes: '1.20.6'

cloudProvider:
  hetzner: {}
  external: true

addons:
  enable: true
  path: "./addons"

For the test, ./addons was empty.

I tried both, a cluster with a single worker node and a cluster with two worker nodes. The traceroute results remain the same, traffic is routed via public IPs.

I'm attaching some screens from the networking of the Hetzner Cloud Console. This should be setup correctly, shouldn't it?

I'm happy for any ideas for further debugging! Thank you a lot! :)

namelessvoid commented 3 years ago

Ok, maybe I found something - sorry for not thinking about this earlier!

When I traceroute the pod IP as you did, @kron4eg, I also see the traffic using the overlay IP:

$ traceroute 10.244.8.36
traceroute to 10.244.8.36 (10.244.8.36), 30 hops max, 60 byte packets
 1  static.XXX.XXX.XXX.162.clients.your-server.de (162.XXX.XXX.XXX)  0.132 ms  0.033 ms  0.021 ms
 2  10.244.8.0 (10.244.8.0)  3.768 ms  3.641 ms  3.543 ms
 3  10-244-8-36.nginx.default.svc.cluster.local (10.244.8.36)  3.622 ms  3.497 ms  3.490 ms

I'm still confused, though, why the public IP shows up in the trace.

But when accessing the service exposing the very same pod, it seems to take the public route again:

$ traceroute 10.109.255.202
traceroute to 10.109.255.202 (10.109.255.202), 30 hops max, 60 byte packets
 1  static.XXX.XXX.XXX.162.clients.your-server.de (162.55.166.14)  0.080 ms  0.039 ms  0.022 ms
 2  172.31.1.1 (172.31.1.1)  10.880 ms  9.905 ms  10.592 ms
 3  11202.your-cloud.host (159.69.96.89)  0.447 ms  0.332 ms  0.320 ms
 4  * * *
 5  spine2.cloud2.fsn1.hetzner.com (213.239.225.45)  1.018 ms spine1.cloud2.fsn1.hetzner.com (213.239.225.41)  0.958 ms spine2.cloud2.fsn1.hetzner.com (213.239.225.45)  1.263 ms
 6  core23.fsn1.hetzner.com (213.239.239.137)  13.665 ms  2.714 ms core24.fsn1.hetzner.com (213.239.239.129)  4.106 ms
 7  core11.nbg1.hetzner.com (213.239.203.125)  7.735 ms core12.nbg1.hetzner.com (213.239.203.121)  10.383 ms core11.nbg1.hetzner.com (213.239.203.125)  16.566 ms
 ...

So maybe some setting for the service overlay is not correct?

@kron4eg Could you maybe retry this on your end to confirm this? Thank you a lot!

kron4eg commented 3 years ago

I'll try to reproduce

kron4eg commented 3 years ago

@namelessvoid I still can't replicate that behaviour (using master build). Could you please attach your manifests (workloads/services/etc)?

namelessvoid commented 3 years ago

@kron4eg Sorry for the late response, got some stuff in may way in between...

There is nothing special, I believe:

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx
  namespace: default
spec:
  containers:
  - image: nginx
    name: nginx
---
apiVersion: v1
kind: Service
metadata:
  labels:
    run: nginx
  name: nginx
  namespace: default
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: nginx
  type: ClusterIP

Lykos153 commented 3 years ago

I can confirm this issue.

kubectl get nodes -o wide
NAME                        STATUS   ROLES                  AGE   VERSION   INTERNAL-IP   EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
t1-control-plane-1          Ready    control-plane,master   80m   v1.21.3   10.8.0.2      188.34.X.X    Ubuntu 20.04.2 LTS   5.4.0-77-generic   containerd://1.4.8
t1-pool1-54f9cd8694-drz4m   Ready    <none>                 77m   v1.21.3   10.8.0.3      162.55.X.X   Ubuntu 20.04.2 LTS   5.4.0-77-generic   containerd://1.4.8

Testing with the manifests @namelessvoid provided in their last post:

root@ubuntu:/# traceroute 10.244.1.2
traceroute to 10.244.1.2 (10.244.1.2), 30 hops max, 60 byte packets
 1  static.103.165.55.162.clients.your-server.de (162.55.X.X)  0.100 ms  0.030 ms  0.065 ms
 2  10-244-1-2.nginx.default.svc.cluster.local (10.244.1.2)  0.223 ms  0.063 ms  0.069 ms

The first hop (162.55.X.X) is the external IP of the node. That should be 10.8.0.3 instead.

EDIT: OK, I suppose it was a false alarm. Pods keep talking to each other even though I'm now blocking all external traffic to the nodes. I'm still confused that the external IP shows up in the traceroute, though.

kron4eg commented 3 years ago

The first hop (162.55.X.X) is the external IP of the node

Is own IP of the node. This IP is the default route for pods.

Lykos153 commented 3 years ago

Can we somehow configure the internal IP to be the node's IP? Yesterday I said

Pods keep talking to each other even though I'm now blocking all external traffic to the nodes.

but that is only true if I use the SDN firewall provided by Hetzner. When I use iptables on the nodes to block all incoming traffic via the interface eth0, the pods can't communicate anymore.

I'd actually like to be able to disable the public interface completely. Is that somehow feasible with kubeone?

kron4eg commented 3 years ago

@Lykos153 I support it can be achieved by using custom images.

Lykos153 commented 3 years ago

I can now say for sure that DNS traffic is still routed via the public interface. With all incoming public connections blocked, pods can reach each other via IP but not via service hostnames. Also, every request from pods to the internet has a ~5s delay due to DNS timeout. The cluster is not usable unless I open ports 9*53 on the public network. I'm gonna try to get rid of the public interface using a custom image as you suggested. The issue remains, however.

ErwinSteffens commented 3 years ago

Any update here? We have the same issue.

We need to whitelist the public ip-ranges as trusted ip's is our ingress to make the proxy protocol to work.

alam0rt commented 2 years ago

Same issue, makes firewalling horrible. Have manually patched kubeconfigs to use private IP... Maybe can override kubeadm args somewhere

kron4eg commented 2 years ago

@alam0rt did it helped?

alam0rt commented 2 years ago

@alam0rt did it helped?

It helps, but it gets overridden on upgrade as the kubeadm config is regenerated.

For the time being I am just adding the public IPs to the rules using


data "hcloud_servers" "nodes" {
  with_selector = "role=node"
}

locals {
  node_public_ipv4 = [for node in data.hcloud_servers.nodes.servers : join("/", [node.ipv4_address, "32"])]
}

kron4eg commented 2 years ago

The admin kubeconfig is generated using value from terraform output kubeone_api. By default this value is public IP of the kubeapi loadbalancer. I don't see if hcloud_load_balancer can give you internal IP.

output "kubeone_api" {
  description = "kube-apiserver LB endpoint"

  value = {
    endpoint = hcloud_load_balancer.load_balancer.ipv4
    apiserver_alternative_names = var.apiserver_alternative_names
  }
}

alam0rt commented 2 years ago

The admin kubeconfig is generated using value from terraform output kubeone_api. By default this value is public IP of the kubeapi loadbalancer. I don't see if hcloud_load_balancer can give you internal IP.
output "kubeone_api" {
  description = "kube-apiserver LB endpoint"

  value = {
    endpoint = hcloud_load_balancer.load_balancer.ipv4
    apiserver_alternative_names = var.apiserver_alternative_names
  }
}

There definitely is a private IP that can be used. I'll give it a go soon and see what happens.

alam0rt commented 2 years ago

So, it looks like you can use

  value = {
    endpoint = hcloud_load_balancer.load_balancer.network_ip
  }
}

network_ip is defined here: https://github.com/hetznercloud/terraform-provider-hcloud/blob/d6f4207b2b75b76e007bd08602e6dcbfb1740032/internal/loadbalancer/resource.go#L406

but is apparently undocumented!

kron4eg commented 2 years ago

OK, having the INTERNAL IP as kube-api endpoint means that kubeconfigs for whole system will contain that IP. Including admin config. Kubeone will work around that, not issue (we always tunnel kube-apiserver requests via ssh).

However your local kubectl might have a problem, but worry not kubeone proxy to the rescue! kubeone proxy will create a pass through ssh-tunnel proxy, that kubectl can easily leverage with export HTTPS_PROXY=http://....

alam0rt commented 2 years ago

Speaking of which, is there a good way to regenerate all of the kubeconfigs ? I have updated the terraform output and ran kubeone apply --manifest kubeone.yaml -t new.json but I don't think anything is updated. Maybe I need to force upgrade?

kron4eg commented 2 years ago

No, I don't think so it's possible, at least no under kubeadm. You'd need to create a new cluster.

alam0rt commented 2 years ago

Damn! New cluster it is I guess.

kron4eg commented 2 years ago

I mean, it can be done manually, but it's highly possible to kill your cluster. But if you'd like to try, here's how:

You'd need to regenerate certificates for kube-apiserver, with new SAN list that will include internal IP of the loadbalancer.
Then replace all the kubelet's kubeconfigs to point to this new LB
Then replace kubeproxy config in the configMap and restart all the kube-proxies across the cluster and pray the cluster is not dead after this

kron4eg commented 2 years ago

But I highly recommend not doing this in the cluster that has anything valuable running under it.

kubermatic-bot commented 2 years ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

xmudrii commented 2 years ago

/remove-lifecycle stale Docs are still pending.

madalinignisca commented 2 years ago

Question: is the network CNI configured and deployed before Hetzner CCM or after? I'm not sure yet what actually happens, but per their instructions here https://github.com/hetznercloud/hcloud-cloud-controller-manager/blob/main/docs/deploy_with_networks.md I believe that supported CNI will be handled by the CCM to ensure communication will be done throught the private interface.

madalinignisca commented 2 years ago

the Hetzner ccm manifest in addons is without network addon, and will not try to make pods use the private network.

xmudrii commented 2 years ago

@madalinignisca The CNI is deployed before the CCM. We'll give this a try, but in the meanwhile, I recommend checking out Cilium if that works for you. Some folks reported more success with Cilium (e.g. https://github.com/kubermatic/kubeone/issues/2219).

kubermatic-bot commented 1 year ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

xmudrii commented 1 year ago

/remove-lifecycle stale /lifecycle frozen

madalinignisca commented 1 year ago

@madalinignisca The CNI is deployed before the CCM. We'll give this a try, but in the meanwhile, I recommend checking out Cilium if that works for you. Some folks reported more success with Cilium (e.g. #2219).

I managed to do a core manual setup with kubeadm and managed to get all things in my idea. I would love if I get time to try to get it with kubeone. Yes, Cillium involved and as I had to dive deep into it, I think I'm never looking at other CNI.

xmudrii commented 5 months ago

This issue should be fixed as of KubeOne 1.7 at least. I'm going to close it, but if you still have the issue, please let us know. /close

kubermatic-bot commented 5 months ago

@xmudrii: Closing this issue.

In response to [this](https://github.com/kubermatic/kubeone/issues/1388#issuecomment-2120649712): >This issue should be fixed as of KubeOne 1.7 at least. I'm going to close it, but if you still have the issue, please let us know. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubermatic / kubeone

Traffic seems to be routed via nodes' public IPs instead of private IPs #1388