syself / cluster-api-provider-hetzner

Cluster API Provider Hetzner 🚀 Kubernetes Infrastructure as Software 🔧 Terraform/Kubespray/kOps alternative for running Kubernetes on Hetzner
https://caph.syself.com
Apache License 2.0
540 stars 51 forks source link

Make it possible to use a pre-created private network #762

Open rbjorklin opened 1 year ago

rbjorklin commented 1 year ago

/kind feature

Describe the solution you'd like I'm attempting to spin up a cluster that is entirely on a private network that has already been created. Unfortunately, I'm met with a uniqueness error:

"error":"failed to reconcile network for HetznerCluster rbjorklin-com/rbjorklin-com: failed to create network: error creating network: name is already used (uniqueness_error)"

Anything else you would like to add: I have created a network with a single VM inside it. From this VM I'm attempting to bootstrap a cluster with a private-only network. Does this make sense? Do you foresee any issues doing this?

Environment:

rbjorklin commented 1 year ago

I just came across this comment. Are you using Cilium Host Firewall to lock down etcd and other services on your control plane nodes from external access?

simonostendorf commented 1 year ago

I would also be interested in using a pre-created hcloud network, as I would like to create a nat gateway in the same network beforehand, so that the newly created k8s nodes can use this nat gateway to reach the internet.

simonostendorf commented 1 year ago

Do you think this could be done by just implementing this in the network.create and network.delete functions? I don't know all the code, but if the create function returns the same network service for existing networks, I think this could work and the needed changes are really small.

daper commented 11 months ago

It would be useful for me as well to be able to specify an existing network on a HetznerCluster level.

johannesfrey commented 10 months ago

I also had a use case for exactly this. Before creating the workload cluster with CAPH I pre-created a network (having the CAPH owned label) and a NAT gateway via terraform. CAPH then adds the machines to this network. The downside of this is that when the Cluster is deleted the network will also be deleted, because of the label. I hacked together a way to pass in an ID of an exisiting network, which won't create a network but rather reuse the given one. I also adjusted reconcileDelete so that the network is only deleted when it has the owned label, which required to pass the labels into the status of the HetznerCluster. What's still missing is the validation to check that the pre-created Network is the same than the one set in HCloudNetworkSpec. Not sure if this is the right way to go, but perhaps it's a start. If there is interest I could of course create a PR for that!?

simonostendorf commented 10 months ago

If there is interest I could of course create a PR for that!?

I think the best solution would be to add a natGateway configuration to the HetznerCluster resource. With this the lifecycle of the cluster would be the same as the network and the gateway and everything is still managed by the capi provider.

I think the option should be something like the api loadbalancer, where you can customize the natGateway and the provider will create a small server (or multiple with ha configuration) nat is used as nat gateway.

Example:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerCluster
spec:
  hcloudNetwork:
    cidrBlock: 10.0.0.0/8
    enabled: true
    networkZone: eu-central
    subnetCidrBlock: 10.0.0.0/16
    natGateway:
      enabled: true
      fixedIp: 10.0.255.254

I know that this is outside of the capi provider scope but I think this is the "best" and "cleanest" solution until Hetzner creates NatGateways as a service (see current job offers that something is comming).

johannesfrey commented 10 months ago

If there is interest I could of course create a PR for that!?

I think the best solution would be to add a natGateway configuration to the HetznerCluster resource. With this the lifecycle of the cluster would be the same as the network and the gateway and everything is still managed by the capi provider.

I think the option should be something like the api loadbalancer, where you can customize the natGateway and the provider will create a small server (or multiple with ha configuration) nat is used as nat gateway.

Example:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: HetznerCluster
spec:
  hcloudNetwork:
    cidrBlock: 10.0.0.0/8
    enabled: true
    networkZone: eu-central
    subnetCidrBlock: 10.0.0.0/16
    natGateway:
      enabled: true
      fixedIp: 10.0.255.254

I know that this is outside of the capi provider scope but I think this is the "best" and "cleanest" solution until Hetzner creates NatGateways as a service (see current job offers that something is comming).

Yes, this seems to be a much cleaner solution. Especially the fact that the network and the gateway will then be tied to the lifecycle of the cluster. (But, IIRC at least CAPA also offers the option to "bring your own network", by specifing the ID of an existing one).

But as you said it is also a matter of the scope. The only advantage of the naive solution, which just picks up an existing network, is that the code base of CAPH remains small (no extra server that must be configured for IP forwarding, no network routes etc.). With the disadvantage that the network can now also be disjoint form the lifecycle of the cluster.

Another yet undefined aspect might be how a NAT gateway looks like that suits all needs? To not simply assume a type that probably won't fit for every use case, this might require to make the NAT gateway configurable via cloud-init?! For example, in my personal use case the NAT gateway also has wireguard running which allows the management cluster (running somewhere else) to connect to the internal IP of the API server without the need for a load balancer.

Or simply don't make the NAT gateway configurable and assume a minimal "best fit"?! If there is need for additional stuff (like e.g. the wireguard thing I just mentioned), users might be responsible for it themselves, outside of the scope of CAPH. Which, brings us back to the initial "problem" if this server should be in the same network as the CAPH machines :slightly_smiling_face:

Whould be really interesting what others think and to see if there is an actual need for this.

simonostendorf commented 10 months ago

@johannesfrey Maybe one compromise could be that a NatGateway spec will be added in the provider to keep everything together, but the cloudinit for that server will be read (or can be read and there is a small default) from a configmap so custom user configurations are possible.

simonostendorf commented 10 months ago

I could try to work on it, but it would be my first "real" kubebuilder project and I don't have much time over the next few weeks, especially for writing all the unit and e2e tests (the api changes and the reconcile loop could be done very fast i guess)

So it would be interesting to hear from one of the code owners here if such PRs would be merged or if they are "out of scope".

@batistein you have interacted with other PRs in the past discussing networking, what do you think? (sorry for pinging one of you, I just want to know if this discussion is in scope and can be worked on or if there are better solutions)

johannesfrey commented 10 months ago

@simonostendorf thanks for driving this forward.

I would be more than happy to support on this, if you want. The only thing is that I am also gone for the next two and a half weeks and afterwards I'm only able to support in my spare time. But yeah, let's see if there are some other perspectives on this subject before diving in 😊 .

simonostendorf commented 9 months ago

I would be more than happy to support on this, if you want.

@johannesfrey Would be happy if we could solve the NAT topic.

I forked this repo and extended the api and controller logic. There are some things missing or marked with a todo now, but I will implement them in a few days (if I find enough time).

After that only the go unit tests (for natgateway service and for changed hcloud-controller) and the e2e tests (to create a cluster with natgateway) are missing. But I am not familiar enough with go and the caph logic to know what I need to change there.

Maybe you could inspect it and commit something on top of my changes?

I could find the time to test the changes yet, but I will edit this comment if I tested them.

You can find my changes inside my fork but I could also create a draft pull request to discuss the topic with others.

lkt82 commented 8 months ago

Hi :) we a evaluating if the Hetzner provider would help us in our setup and we can see the benefit in having an option for bringing your own network. For example when you need private connectivity to existing vm's in a private network.

Other cluster api providers like Azure also supports using pre-existing networks so it's not uncommon

batistein commented 8 months ago

@simonostendorf , integrating a NAT gateway or similar network solutions into Hetzner would indeed be a substantial feature. However, based on our experience, it would require a significant amount of effort, potentially spanning several person-months, especially considering the testing phase. We've encountered issues with Hetzner's private networks in the past, which adds to the complexity of such an undertaking.

In our managed Kubernetes services, we've generally avoided using private networks, not just in Hetzner but across other providers as well. We find that, in most cases, a perimeter architecture is no longer a necessity. Instead, we prefer leveraging robust CNI solutions like Cilium. It not only meets our networking challenges effectively but also simplifies the network topology. This simplicity is a considerable advantage, making debugging more straightforward and faster. Additionally, Cilium's feature for external workloads seamlessly accommodates existing VMs. @lkt82

Given these points, while we understand the interest in this feature, I personally don't see enough benefit to offset the considerable effort it would demand for development and integration, especially when current tools have been adept at handling challenges in more lightweight ways.

However, if there's community interest and someone is willing to contribute to such a feature, we are open to providing support where we can. We believe in collaborative solutions, and if there's a clear demand and willingness from contributors, it's something we could explore together, despite the economic and technical challenges it presents. Our focus at Syself, though, continues to be on investing resources in areas like Zero-Trust security, which we see as a more beneficial strategy.

simonostendorf commented 8 months ago

Thanks for the detailed answer @batistein . In the meantime I have also concentrated on solutions with only a public network or with a private and public network and especially on ARM architecture. It reduces the complexity of the network and especially the complexity of the deployment by not having to deploy another NAT gateway.