RFE: finish Distributed architecture support

dtantsur commented 12 months ago

The fact that we only run 1 Ironic instance is somewhat unfortunate: Ironic has a built-in active/active HA that spreads the load evenly in the cluster by assigning each Node to one Ironic instance. Ironic also has a take over process making sure that Nodes never go orphaned. On the other hand, due to its usage of green threads, each Ironic instance only uses 1 CPU core. Having 3 instances will improve CPU utilizations. For instance, CERN manages around 10k of Nodes through its 9 (at some point, not sure about the current state) Ironic conductors.

An easy step to do is to use a DaemonSet for Ironic instead of the Deployment. We will need to drop Inspector because, unlike Ironic, it's not HA ready. The new inspection implementation won't have this issue. I believe that will give us virtual media deployments without a provisioning network right away. We will need to sort our JSON RPC since all Ironic instances need to talk to each other. If we use the pods' cluster IPs, TLS may be an issue since they're not quite predictable.

DHCP is also a problem. If we run one dnsmasq, we won't be able to direct a Node to the iPXE server of the Ironic instance that handles it (Ironic API itself is fine: any Ironic will respond correctly for any Node, redirecting the request internally through the RPC). Not without changes to Ironic, at least. If we run several dnsmasq instances, that's still a problem: the request will land on the random one. Also, the networking configuration will be a challenge.

[ ] Figure out how to run dnsmasq
[ ] How to make iPXE to talk to the right instance?
[ ] Make sure the host it set correctly for JSON RPC (possibly my-pod.my-namespace.pod to make TLS usable)
[ ] Make sure JSON RPC has authentication
[ ] Make sure JSON RPC has TLS (do we need a reverse proxy to avoid eventlet?)

dtantsur commented 12 months ago

Re TLS: the only option I see now is to create a wildcard certificate for *.<namespace>.pod and use things like 10-89-0-2.test.pod as hostnames for RPC.

Meanwhile, chances are very high that the default hostname won't work for RPC since it's just the hostname of the node (because of the host networking).

dtantsur commented 11 months ago

Re dnsmasq: I'm thinking of running one instance, but we could also use some fancy subnet magic to run 3 of them. Do we want to? In OpenShift we probably don't... Without that, iPXE can be weird so.

dtantsur commented 11 months ago

Hostname preparation work https://github.com/metal3-io/ironic-image/pull/449

dtantsur commented 11 months ago

iPXE: we could loop over all control plane nodes in the initial script, trying to load the 2nd stage script. For that, we need to know the IP addresses of all control plane nodes, which is probably doable via the Ironic DaemonSet?

lentzi90 commented 11 months ago

I feel like a lot of context is missing here. What is the goal? Which containers are we talking about? Which of them need to be together in one pod and which can be separate? Which need host networking? Have we properly considered more cloud-native alternatives (e.g. LoadBalancers)?

Regarding TLS and JSON RPC, I think it is worth noting that StatefulSets have predictable hostnames without IP addresses so you would not need wildcard certificates. The most cloud-native solution for mTLS is probably a service mesh though. I would not want to make that a requirement, but it could be a good idea to at least take some ideas from that area.

dtantsur commented 11 months ago

@lentzi90 updated the description with a lot of text. Sorry, should have done it from the beginning - I forgot that not everyone with in the performance&scale subteam discussions.

Have we properly considered more cloud-native alternatives (e.g. LoadBalancers)?

I'm not sure what that gives us: Ironic API is load balanced already.

I think it is worth noting that StatefulSets have predictable hostnames without IP addresses

Yeah, I've considered them. I think their limitations may be quite painful in our case.

The most cloud-native solution for mTLS is probably a service mesh though. I would not want to make that a requirement

Me neither... I'm also not sure what it solves for us: Ironic already maintains a list of its peers, we still need to configure TLS properly.

lentzi90 commented 11 months ago

@lentzi90 updated the description with a lot of text. Sorry, should have done it from the beginning - I forgot that not everyone with in the performance&scale subteam discussions.

Thank you! :blush:

Have we properly considered more cloud-native alternatives (e.g. LoadBalancers)?

I'm not sure what that gives us: Ironic API is load balanced already.

Sorry, I should have explained a bit more what I meant. From my perspective, host networking is a bit problematic so I have been thinking about alternatives. In the CI we currently set up a VIP and use keepalived to move it between nodes as needed. A more cloud-native way of doing this would be to use a Service of type LoadBalancer. There are a few implementations that will work on baremetal, e.g. Metallb. The point is that we would get an "external" IP without host networking, which should help with some of the issues.

I think it is worth noting that StatefulSets have predictable hostnames without IP addresses

Yeah, I've considered them. I think their limitations may be quite painful in our case.

Anything in particular? The headless service? I think it would not affect us much in the current configuration at least, since we anyway use the host network. I may easily be missing something though.

The most cloud-native solution for mTLS is probably a service mesh though. I would not want to make that a requirement

Me neither... I'm also not sure what it solves for us: Ironic already maintains a list of its peers, we still need to configure TLS properly.

I was mostly thinking that we could take some ideas for how to configure TLS from them. Most of them work so that they inject a sidecar container that (together with a central controller) set up TLS for all pods. Each pod gets their own unique certificate and the sidecar basically acts as a proxy for the other containers in the pod. All traffic goes through it and it handles TLS for the other containers transparently. This means that the application does not even need to know about it.

Obviously Ironic already handles TLS, but perhaps we can get an idea for how to generate the certificates "on the fly" like with a service mesh.

dtantsur commented 11 months ago

A more cloud-native way of doing this would be to use a Service of type LoadBalancer. There are a few implementations that will work on baremetal, e.g. Metallb.

I don't think we can rely on a LoadBalancer being present, especially when the cluster is bootstrapping itself. (The downside of Kubernetes: not a lot of things can be assumed to be present...)

I've considered a HostPort service, but the unpredictable port is a no-go.

The point is that we would get an "external" IP without host networking, which should help with some of the issues.

It's a good improvement, but I don't think it helps with any of these issues?

Furthermore, I don't think we can use dnsmasq without host networking. Nor support provisioning networks at all.

Anything in particular? The headless service?

The service is easy to create, although I'm not sure which sense it makes for us. I'm worried about using persistent volumes, as well as the limitations around clean up. If you have a more detailed guide on StatefulSets, I'd be happy to read it - the kubernetes docs are notoriously brief.

Most of them work so that they inject a sidecar container that (together with a central controller) set up TLS for all pods. Each pod gets their own unique certificate and the sidecar basically acts as a proxy for the other containers in the pod. All traffic goes through it and it handles TLS for the other containers transparently. This means that the application does not even need to know about it.

It's interesting, do you have a good write-up on this as well? We already have httpd responsible for TLS. If we could make it generate certificates... I'm not sure how signing will work though. By reaching to the cert-service from inside the container? But who will approve the CSR?

lentzi90 commented 11 months ago

Good discussion! I can see a way forward here!

Let me see if I can get some kind of prototype up to see how/if it works with a LoadBalancer. This is possible to do also on minikube or kind so I don't think bootstrapping is an issue. It is just a matter of moving the IP from the bootstrap cluster to the self managed cluster, just like we do today with keepalived.

I've considered a HostPort service, but the unpredictable port is a no-go.

It should be possible to pick the port. However, there is always the risk of someone else trying to use the same port or it already being in use. :slightly_frowning_face:

I'm worried about using persistent volumes, as well as the limitations around clean up. If you have a more detailed guide on StatefulSets, I'd be happy to read it - the kubernetes docs are notoriously brief.

Unfortunately I don't have a good doc. :slightly_frowning_face: However, I don't think volumes are required. It is just the most common use case for StatefulSets so they appear in all examples... The point about cleanup looks strange in the docs. I think the limitation is just that there are no guarantees for what order the pods are deleted, which is probably not an issue for us. Weird wording in the docs though...

It's interesting, do you have a good write-up on this as well? We already have httpd responsible for TLS. If we could make it generate certificates... I'm not sure how signing will work though. By reaching to the cert-service from inside the container? But who will approve the CSR?

Unfortunately I don't have very deep knowledge on how it works. I think this should answer most questions about how Istio does it though: https://istio.io/latest/docs/concepts/security/

dtantsur commented 11 months ago

Let me see if I can get some kind of prototype up to see how/if it works with a LoadBalancer. This is possible to do also on minikube or kind so I don't think bootstrapping is an issue.

While looking at ingress in OpenShift, I've seen this issue: ingress controllers are deployed on workers, but workers are not up yet when Metal3 is deploying them. Let's make sure we don't end up in this situation.

It should be possible to pick the port.

Only from a strange pre-defined range. We need to be able to tell admins in advance, which ports they must enable in the firewalls. Ideally, we should converge to one well-known port (currently, OpenShift uses four: 6385/ironic, 5050/inspector, 6180/httpd, 6183/httpd+tls, with inspector gone soon).

dtantsur commented 11 months ago

Re dnsmasq and iPXE: I think something like https://bugs.launchpad.net/ironic/+bug/2044561 is doable and solves the issue with finding the right Ironic instance. Let us see what the community thinks.

dtantsur commented 11 months ago

Meanwhile, adding support for unix sockets in JSON RPC, if we need that: https://review.opendev.org/c/openstack/ironic-lib/+/901863

dtantsur commented 11 months ago

Re dnsmasq and iPXE: it's possible that Ironic actually has all we need. It now features a way to manage dnsmasq. We could run 3 dnsmasq's with host-ignore:!known and use this feature to populate host files on the appropriate Ironic instance only with the required options instead of relying on a static dnsmasq.conf.

This will complicate adding new host auto-discovery if we ever decide to support that. The last time, this feature requested has been rejected.

Another problem: we'd need disjoint DHCP ranges for different dnsmasq instances, otherwise they'll try to allocate the same IP to different nodes. It's a matter of configuration though.

dtantsur commented 11 months ago

Disjoint DHCP ranges can become a problem: how to pass the right subrange into pods? The only hack I can think of is for the operator to list control plane nodes and prepare a mapping nodeIP->subrange. Then the dnsmasq start-up script can pick the right one based on which hostIP it actually got. May get ugly with dual-stack...

An obvious problem: node replacement will always cause a re-deployment.

Rozzii commented 11 months ago

Disjoint DHCP ranges can become a problem: how to pass the right subrange into pods? The only hack I can think of is for the operator to list control plane nodes and prepare a mapping nodeIP->subrange. Then the dnsmasq start-up script can pick the right one based on which hostIP it actually got. May get ugly with dual-stack...

An obvious problem: node replacement will always cause a re-deployment.

I have not checked the architecture jet but shouldn't there be a CR for each Ironic instance, so what I mean I would expect see IronicDeployment -> IronicInstance and then the actual pod would be linked to the instance thus we could add the dhcp range and similar info for these CRs and eventually that would be translated to environment variables in the pod that would be the mapping I would expect.

EDIT: Maybe not even an IronicDeployment just the instance that have this dhcp range info.

dtantsur commented 11 months ago

I would expect see IronicDeployment -> IronicInstance

There is nothing "Ironic" there it's just a normal DaemonSet/Deployment resulting in Pods. We've discovered on the meeting that we can annotate pods, and the deployment won't undo that. The missing part would be: how to pass annotations to a pod after it's started. This is a challenge. Maybe we need a sidecar container that manages DHCP ranges and generally manages dnsmasq?

Rozzii commented 11 months ago

This was further discussed on community meeting and on other channels. As you have mentioned we have discussed the annotation based approach for assigning the dhcp ranges.

I am not sure that the best way to do this would be a sidecar, maybe configuring distributed layer 2 network services would require it's own controller, maybe we shouldn't put this on the back of Ironic neither on a sidecar or the existing dnsmasq container.

Now please forgive me if the nomenclature is not correct I hope it is but not sure, I'd like to convey just the general concept. I more and more feel that we would need a K8s controller that would deploy dnsmasq pods, it could most likely read the Ironic CRs .

Rozzii commented 11 months ago

For the JSON RPC TLS as I have mentioned on the community meeting, my first ide would be just regular K8s services for each Ironic pod.

dtantsur commented 11 months ago

For the JSON RPC TLS as I have mentioned on the community meeting, my first ide would be just regular K8s services for each Ironic pod.

That would surely work, but it does not sound like a good kubernetes practice? Why is it better than relying on <IP>.<NAMESPACE>.pod?

dtantsur commented 11 months ago

I more and more feel that we would need a K8s controller that would deploy dnsmasq pods, it could most likely read the Ironic CRs .

I've considered that, but in a non-distributed world, dnsmasq must leave inside the metal3 pod.

dtantsur commented 10 months ago

Okay, so for the sake for making progress, here is where we stand with the HA MVP:

dnsmasq: run 1 copy in a deployment
iPXE: use boot configuration API to find the right Ironic
JSON RPC: punt on the TLS question first, later potentially use a headless service per conductor
JSON RPC: update to use a reverse proxy to avoid TLS in eventlet

lentzi90 commented 10 months ago

I have played a bit with loadbalancer instead of host network here: https://github.com/lentzi90/playground/tree/ironic-loadbalancer#metal3 The issue I have is that the ironic image is trying to be smart about IPs... which does not work great when you have randomly assigned Pod IPs in the cluster network. But this is mainly a configuration issue I would say. There is no issue curling the loadbalancer IP and getting response from Ironig, but if I start an inspection, things break. Inspector expects to be able to reach Ironic on the IP assigned to the interface that it is using. That is the Pod IP and this is not in the certificate, so it doesn't accept it.

If you want to try it:

Clone https://github.com/lentzi90/playground/tree/ironic-loadbalancer#metal3 (use the branch ironic-loadbalancer)
Run ./Metal3/dev-setup.sh
Wait for all pods to be up
Curl the APIs:
1. curl https://192.168.222.200:5050 -k
2. curl https://192.168.222.200:6385 -k

Is there a way to tell Inspector to just use the IP I give to reach Ironic, without also requiring this to be the IP that is associated with the listening interface? My issue is basically wait_for_interface_or_ip that sets the IRONIC_IP and IRONIC_URL_HOST based on what it finds at the interface. I want to set these variables to 192.168.222.200 no matter what the interface shows.

Edit: I should mention that this is similar to the BMO e2e setup where libvirt takes care of dnsmasq. I.e. there is no dnsmasq container in the Ironic Pod. I can try the same with dnsmasq in the Pod also of course but at this point I don't see how it would be useful.

dtantsur commented 10 months ago

Isn't a loadbalancer optional in kubernetes? If yes, we need a solution for the cases when 1) it is not present, 2) a different implementation than metallb is present.

For OpenShift, I've hacked together a poor man's load balancer based on httpd that is run as a daemonset: https://github.com/openshift/ironic-image/blob/master/ironic-config/apache2-proxy.conf.j2. I'm not proud of it, but it works. Most critically, it works even when the cluster is heavily degraded.

lentzi90 commented 10 months ago

Well, yes loadbalancers are optional and metallb is one implementation that can be used. So I would say that if there is no loadbalancer implementation, then the solution is to add metallb or kube-vip for example. The implementation should not matter since they all implement the same API.

Is there a benefit to hacking together a custom poor man's load balancer just for Ironic, instead of using an off-the-shelf implementation? From what I understand they practically do the same thing. :thinking: It feels backwards to work around the missing loadbalancer in this way to me.

dtantsur commented 10 months ago

Is there a benefit to hacking together a custom poor man's load balancer just for Ironic, instead of using an off-the-shelf implementation?

Do off-the-shelf load balancer operate when the cluster is degraded so much it has no workers? I was told that no (maybe the one we use? not sure)

lentzi90 commented 10 months ago

That would depend on configuration. The default manifests that I used for metallb has tolerations on the daemonset so it will run on control-plane nodes. However, the controller that is reconciling the API objects does not, so it would not run if all nodes have control-plane taints.

On the other hand, if you have a management cluster in the cloud, or let's say on OpenStack, that provides the load balancer then it is completely unaffected by the cluster state. Or the opposite, I could easily have a cluster with more taints and high priority workload that would evict all of metallb and ironic also so it cannot run at all.

What I'm trying to say is that this all comes down to config. The custom implementation works in a given situation. I bet that we could make metallb or kube-vip work in the same way. And it is possible to break/circumvent any load balancer implementation by configuring the cluster or workload in a bad way.

I'm not sure if it makes sense for Ironic, but what I would generally expect in an operator like this is that it assumes there is a load balancer implementation already in place. Then the operator can work from this. For the cluster admin, this means it is possible to choose how to do it. They can use metallb, a custom thing or maybe even run in a cloud that provides it. Quite flexible and nice.

With the custom implementation "hard coded" it would be a pain to switch to another implementation I guess?

Rozzii commented 9 months ago

I just tie the ironic-image issue here that I think is relevant: https://github.com/metal3-io/ironic-image/issues/468

lentzi90 commented 7 months ago

Adding links from discussions at kubecon. This is about wildcard certificates for headless services:

https://github.com/openshift/service-ca-operator/issues/25

dtantsur commented 7 months ago

Hi folks! Let me do a recap of our hallway discussions during KubeCon EU.

After some back-and-forth, we seem to be converging towards StatefulSets. I have done experiments with Kind locally, and it seems that our previous concerns are ungrounded: EmptyDir volumes work, deleting a StatefulSet works. And (completely undocumented) it is possible to use a normal (not headless) service with StatefulSets, which will take both roles: a load-balanced service DNS name and pod-specific DNS names.

That means, we can generate TLS certificates for service.namespace.svc and *.service.namespace.svc and expect them to work for both accessing the API and JSON RPC. (May not play well with OpenShift's CA service per link above, but this can be solved later).

There have been some progress on the load balancing discussion, but it still has unsolved issues we need to dive into.

dtantsur commented 7 months ago

Note: the discussion around load balancers and host networking has been split into https://github.com/metal3-io/ironic-standalone-operator/issues/21.

dtantsur commented 7 months ago

/triage accepted

dtantsur commented 7 months ago

An obvious problems with StatefulSets: they don't work as DaemonSets (who even decided these things are orthogonal??).

dtantsur commented 7 months ago

Can we maybe use HostAliases to allow conductors to talk to each other? https://kubernetes.io/docs/tasks/network/customize-hosts-file-for-pods/

lentzi90 commented 7 months ago

I'm not following. What is the issue with StatefulSets not working as DaemonSets? I guess you want to spread the pods so they don't run on the same host? But do you mean that you want a one-to-one mapping between pods and nodes as a DaemonSet would give? That can probably be useful in some situations but I'm not sure it would be the "default use-case".

With StatefulSets, we would get a "cloud native" configuration:

Expose through LoadBalancer
Use headless Service to communicate internally
No hostNetwork (DHCP-less)
"Run anywhere" Basically it would work as a "normal" Kubernetes workload with the drawback that we cannot have DHCP.

With DaemonSets, we could do a more "traditional" configuration:

hostNetwork
No dependencies on LoadBalancers or other optional Kubernetes features
Run exactly one instance per node (probably limited to a specific node group, e.g. control-plane) This would be a pretty static configuration and quite unusual in the Kubernetes world. However, I imagine it would closely resemble a "traditional" configuration where you set up specific hosts to run Ironic.

About HostAliases, I think we can use it but I don't get why/how? If we can configure a HostAlias, can we not then just use the IP directly anyway?

matthewei commented 6 months ago

I think we can use this soluation https://github.com/aenix-io/dnsmasq-controller to solve HA

matthewei commented 6 months ago

IronicStandalone contains http,dnsmasq,ironic,ramdisk-logs. Only the dnsmasq and ironic need hostnetwork. Dnsmasq use network to realize dhcp and ironic use it to realize setting boot and power status and other function.

dtantsur commented 5 months ago

What is the issue with StatefulSets not working as DaemonSets? I guess you want to spread the pods so they don't run on the same host?

Yes, otherwise we introduce a dependency on the host network removal, which is still far away. With host networking, it's more or less a requirement.

I think we can use this soluation https://github.com/aenix-io/dnsmasq-controller to solve HA

This is interested, but I wonder if it's maintained.

Only the dnsmasq and ironic need hostnetwork.

Let's not mix the host networking question in the picture please.

dtantsur commented 5 months ago

I've just realized we have a possible stopgap measure for JSON RPC TLS. We can make ironic-standalone-operator generate a CA and then make each Ironic instance generate and sign its own certificate based on its IP address. Yes, it's not pretty, but it can work right now.

Then only the Ironic boot configuration API will be a requirement.

metal3-io-bot commented 2 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Rozzii commented 2 months ago

/remove-lifecycle stale

metal3-io / ironic-standalone-operator

RFE: finish Distributed architecture support #3