Problems with Rancher deployed clusters

vitobotta commented 4 years ago

Hi! I am testing this controller with Rancher clusters, and for some reason the metrics server nor the Prometheus/Grafana monitoring installed by Rancher seem to work. kubectl top nodes returns error: metrics not available yet even after waiting for a while, and the monitoring API is never coming up.

I can't remember the details, but I did test this a little months ago and had similar problems because of the IP addresses. Before installing the controller, kubectl get nodes -owide was showing the IP addresses as internal, while after installing the controller they are shown as external and there is no internal IP. I can't remember how I found out there was a link between this change and the metrics servert/API not being available. Am I missing something? I made sure the kubelet is configured with cloud-provider = external.

Thanks!

vitobotta commented 4 years ago

Found it. I have this in the metrics server logs:

unable to fully collect metrics: [unable to extract connection information for node "test-000-worker1": node test-000-worker1 had no addresses that matched types [InternalIP], ....

What can I do?

vitobotta commented 4 years ago

I managed to get the metrics server working with the external IPs, but there are other things that do not work without an internal IP like the cluster monitoring. How are you guys working around these when using the hcloud controller manager?

LKaemmerling commented 4 years ago

Do you use the Cloud Controller version with Networks Support? Do you use k3s or the full k8s?

vitobotta commented 4 years ago

Hi @LKaemmerling I installed without the networks support since I am not using a private network. It's full k8s deployed with Rancher. Thanks

github-actions[bot] commented 4 years ago

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

vitobotta commented 4 years ago

I am having again this problem with a new cluster and v1.6.0 which I installed to use load balancers.

I am having problems with prometheus operator which cannot properly set up monitoring because the internal IP is now unset. In a Rancher cluster without this cloud controller installed the internal IP remains set. https://github.com/rancher/rke/issues/860#issuecomment-419224667 seems to suggest that the IP address is obtained by the cloud provider. Is it trying to get a private IP from Hetzner Cloud then? I am not using a private network, so the internal IP should be left set to the main IP of the node in this case as it was before installing the controller.

Any idea of how to fix the missing internal IP? Thanks @LKaemmerling

LKaemmerling commented 4 years ago

Sounds like it tries to use the networks. Did you deploy the networks version of the cloud controller or the non networks version?

vitobotta commented 4 years ago

@LKaemmerling Without the networks, just the YAML deployment with the token and nothing else

MrSaints commented 4 years ago

We've just set-up a brand new 3 node k3s (1.17) cluster on Hetzner, and hcloud-cloud-controller-manager works perfectly. We have a load balancer set-up as well. We are using networks with it.

We initially had issues with missing prefix hcloud://, but a complete re-deployment (it won't work with just a restart) with --disable-cloud-controller --kubelet-arg cloud-provider=external fixed it.

LKaemmerling commented 4 years ago

Actually i can't see a problem directly to our Cloud Controler within this issue. If you encounter this issue again, feel free to reopen the issue. Keep in mind that we officially only support k8s, not k3s.

vitobotta commented 4 years ago

@MrSaints Did you have any issues with metrics server due to the missing internal IP? Are you using the cluster monitoring stack that Rancher installs? Have you deployed K3s manually or with some automation? Single master or HA? Sorry for the many questions but I would appreciate any info you can provide. I would prefer deploying Kubernetes with Rancher directly because it gives an experience that is similar to managed services. But if you can suggest an alternative that is at least easy to automate, scale and maintain and that is HA (and not complicated at that), I can try it. Thanks in advance! :)

MrSaints commented 4 years ago

Did you have any issues with metrics server due to the missing internal IP? Are you using the cluster monitoring stack that Rancher installs?

Unfortunately no. But I have encountered said issue on a different cloud provider (AWS) before. There is a startup option which you can provide to fix this (may require a bit of Googling).

Single master or HA?

Single master for the time being as we are using it for light internal workload.

vitobotta commented 4 years ago

Thanks. I am experimenting with a new cluster deployed with Rancher and I am making progress. @LKaemmerling I deployed the cluster with the private network enabled and deployed the cloud controller with network support. The node now has both internal and external IPs and the monitoring works! There are two issues with this that I need to figure out: 1) Rancher host has to be in the same private network as the clusters, so both Rancher and all the clusters need to be in the same project (unless it's possible to enable cross project access to a private network somehow?); 2) the kubeconfig that Rancher generates for clusters deployed this way specifies the private IPs from the hosts, so I cannot use it as it is to connect to the cluster e.g. from my Mac but only via Rancher proxy. I'm now investigating if there are workarounds for these but at least it's progress.

LKaemmerling commented 4 years ago

Hey,

cool that you found a solution! For 1), i guess this is needed, the cloud controller manager only knows about one project (one API token == One Project) so all your servers need to be within the same project.

For 2): I don't know about rancher that much, so i can't help you with that. Maybe @mxschmitt has an idea? He wrote the UI driver for Rancher

vitobotta commented 4 years ago

Hi @LKaemmerling

I found a workaround for 2). I enabled the "authorized endpoint" using a load balancer that talks to the masters via private IP. Seems to work. It's not a big deal if I have to keep Rancher and the clusters in the same project, but it would be nice to separate them into different projects.

ItsReddi commented 4 years ago

@vitobotta if you find any solution for 1) it would be very nice. We are in a early testing stage with R2 and hetzner cloud at the moment and are hanging here too, since it is not a suitable solution for us to keep clusters in same project. In our case some clusters are not even the same hetzner customers.

LKaemmerling commented 4 years ago

@vitobotta if you find any solution for 1) it would be very nice. We are in a early testing stage with R2 and hetzner cloud at the moment and are hanging here too, since it is not a suitable solution for us to keep clusters in same project. In our case some clusters are not even the same hetzner customers.

From a security perspective, i would recommend separating every customer, so one customer could bring your clusters down.

ItsReddi commented 4 years ago

From a security perspective, i would recommend separating every customer, so one customer could bring your clusters down.

Yes that is the problem why it is not suitable to deploy all clusters in the same project to be able to use internal networks

codeagencybe commented 4 years ago

I'm not sure but would "fleet" bring a solution? It's also a new product from rancher family. I have never tested it myself but if this can leverage the project token/API on cluster level, it means you can deploy and manage a multi cluster env.

https://github.com/rancher/fleet

hetznercloud / hcloud-cloud-controller-manager

Problems with Rancher deployed clusters #39