RFE: Avoid host networking for Ironic

dtantsur commented 5 months ago

Exposing Ironic on the host networking is far from ideal. For instance, if we do so, we're going to expose JSON RPC. There may be other internal traffic that we don't want everyone to see. In theory, only dnsmasq must be on host networking.

So, why does Metal3 use host networking at all?

DNSmasq serves as DHCP and TFTP servers. Both are UDP based and hard to route. DHCP involves broadcasts.
When booting over iPXE, hosts need to download iPXE scripts and kernel/initramfs from an HTTP server. This server is local to the Ironic instance that handles this host. Since the host is not yet a part of the cluster network, it cannot use the cluster service DNS name or IP.
IPA needs to reach back to the Ironic API (any of the running instances, optimally to the one handling the host). Still no cluster networking at this point.

One complication is supporting the bootstrap scenario. While most Metal3 consumers bootstrtap their production clusters by establishing a temporary cluster with Metal3 and then pivoting, OpenShift has an important limitation: the bootstrap cluster only provisions control plane hosts. Thus, it cannot rely on any components that won't come up without workers, including e.g. Ingress.

dtantsur commented 5 months ago

/triage accepted /lifecycle frozen

dtantsur commented 5 months ago

Currently discussed solution: using Ingress with a fall-back to a simple httpd-based proxy (probably derived from OpenShift's ironic-proxy) for edge cases like OpenShift.

The Ironic API part is relatively simple and could even be fixed with a load balancer like MetalLB. Ironic-standalone-operator could even create an IPAddressPool for the provisioning network if it's present (otherwise, just expect the human operator to create one). DNSmasq will refer the booting hosts to the load balanced IP address, which will reach to any Ironic instance.

iPXE configuration is harder. The boot configuration API will be required to make sure Ironic serves the iPXE scripts correctly regardless of whether it handles this host. But we still need to serve kernel/initramfs/ISO images, and these should not be proxied through Ironic.

The issue with images can be handled by using Ingress. Since each Ironic is aware of its host name (and thus the cluster DNS name), it can compose an image URL with a sub-path that refers to the right Ironic. So, the Ironic instance with the name ironic-1 will serve images from http(s)://<ingress IP>/images/ironic-1/..., which will be redirected to http(s)://ironic-1.<service>.<namespace>:6183/....

Open questions:

Can we have Ingress routes without HTTPS? it is required by both iPXE in the general case and virtual media in some rare cases.
Can we have Ingress IPs on control plane nodes? We do not want normal workloads to anyhow cross paths with either the provisioning network or the exposes Ironic API.

lentzi90 commented 5 months ago

Can we have Ingress routes without HTTPS? it is required by both iPXE in the general case and virtual media in some rare cases.

Ingress can handle both HTTP and HTTPS traffic. In general, it is expected that TLS termination happens in the ingress controller though, so the traffic reaching the "backend" would be HTTP. There are solutions to work around this when the traffic needs to be encrypted all the way, but it may be better then to consider LoadBalancers that deal with TCP instead.

Can we have Ingress IPs on control plane nodes? We do not want normal workloads to anyhow cross paths with either the provisioning network or the exposes Ironic API.

I'm not sure I understand the question here but I will try to clarify what I do know. Ingress-controllers are usually exposed through LoadBalancers. The exact implementation of this will differ between clusters, but I would say that it is quite common to exclude control-plane nodes, since you would not normally run the ingress-controller there. Traffic can still be forwarded to any node in the cluster. That said, it is definitely possible to configure things so that the ingress-controller runs on control-plane nodes and that the LoadBalancer targets them.

hardys commented 5 months ago

Can we have Ingress IPs on control plane nodes? We do not want normal workloads to anyhow cross paths with either the provisioning network or the exposes Ironic API.

As @lentzi90 says, in situations where dedicated compute hosts exist the application Ingress endpoint would normally be configured so it cannot connect to the control-plane hosts.

But IIUC the question here is actually can we run an additional ingress endpoint with a special configuration that targets the provisioning network, which I think probably is possible by running an additional Ingress Controller and something like IngressClass - we'd also need to consider how to restrict access to that Ingress endpoint so regular users can't connect to the provisioning network.

dtantsur commented 5 months ago

This sounds like a lot of complexity to me. I start seeing writing our own simple load balancer based on httpd as actually a viable solution.

dtantsur commented 5 months ago

"Fun" addition: I've recently learned that some BMCs severely restrict the URL length for virtual media. So if we start using longer URLs, we may see more issues.

Rozzii commented 2 months ago

Linking @mboukhalfa 's discussion an POCs https://github.com/orgs/metal3-io/discussions/1739

zaneb commented 2 months ago

I think you missed off a key reason why we can't just use a NodePort Service to expose the pod network (as in @mboukhalfa's PoC): node ports are constrained to a particular range (30000-32767) and available to Services on a first-come first-served basis. That means users in any namespace can squat on a port and steal traffic intended for ironic, which is a significant security vulnerability. (For existing deployments it also means requiring all users to change the settings for any external firewall they have to account for the ironic port changing.)

This could perhaps be mitigated by only running ironic on the control plane nodes and never allowing user workloads on those nodes. But OpenShift at least has topologies that allow running user workloads on the control plane, so this would be a non-starter for us. Although actually I think kubeproxy will forward traffic to any node on that port to the Service, so even separating the workloads doesn't help.

If this actually worked it would have made many things sooo much easier. So it is not for want of motivation that we haven't tried it.

I don't believe there is a viable alternative to host networking.

I start seeing writing our own simple load balancer based on httpd as actually a viable solution.

The ironic-proxy that you implmented in OpenShift is exactly that, isn't it?

dtantsur commented 2 months ago

The ironic-proxy that you implmented in OpenShift is exactly that, isn't it?

Yes. Some community members are not fond of using an alternative to an existing solution, but I actually believe you're right.

mboukhalfa commented 2 months ago

@zaneb, good point. We are trying to encourage people to raise concerns in all ways in the discussion https://github.com/orgs/metal3-io/discussions/1739. That's the reason behind having these PoCs. The current showcase is very limited, and it doesn't even consider the dnsmasq case. We foresee that the final design and implementation for Ironic and Metal3 networks will not be easy or quick. It is a long-term process.

Our plan is to start with the following ideas:

NodePort
LoadBalancer services https://github.com/metal3-io/metal3-dev-env/pull/1435
Multus CNI (k8s multi-network)
Possibly a custom LoadBalancer

I would like to get your feedback on the LoadBalancer and Multus use cases in this discussion https://github.com/orgs/metal3-io/discussions/1739. I am not an expert in network security within Kubernetes, so that's something we should document along the way.

Rozzii commented 2 months ago

This is not really frozen since @mboukhalfa is working on investigating this very topic. /remove lifecycle frozen

Rozzii commented 2 months ago

/remove lifecycle-frozen

Rozzii commented 2 months ago

/remove lifecycle frozen

metal3-io / ironic-standalone-operator

RFE: Avoid host networking for Ironic #21