metal3-io / metal3-docs

Architecture documentation that describes the components being built under Metal³.
http://metal3.io
Apache License 2.0
263 stars 111 forks source link

DHCP-less IPAM proposal #373

Open hardys opened 6 months ago

hardys commented 6 months ago

Draft proposal to explore how we can enable the CAPM3 IPAM flow in a DHCP-less environment, following on from recent discussions on Slack

/cc @Rozzii @dtantsur

hardys commented 6 months ago

Thanks for the review @pierrecregut - and apologies for not tagging you originally, I wasn't sure of your github ID but I see now I should have figured it out 😂

pierrecregut commented 5 months ago

@hardys At some point the document was moved to the cluster-api-provider-metal3. But the solution should work at the baremetal-operator level and should not imply the use of of a cluster. I would also have not changed the title to include CAPM3 explicitly.

hardys commented 5 months ago

@hardys At some point the document was moved to the cluster-api-provider-metal3. But the solution should work at the baremetal-operator level and should not imply the use of of a cluster. I would also have not changed the title to include CAPM3 explicitly.

@pierrecregut this was in response to our previous discussion and an attempt to constrain the scope of this proposal such that it's actionable - we are exploring some BMO/BMH API dependencies, but the main aim is to extend the current CAPM3 templating so the existing IPAM integration works in DHCP-less situations?

tuminoid commented 5 months ago

Please rebase at convenient time to get the new markdownlinter check.

hardys commented 5 months ago

@hardys At some point the document was moved to the cluster-api-provider-metal3. But the solution should work at the baremetal-operator level and should not imply the use of of a cluster. I would also have not changed the title to include CAPM3 explicitly.

I move the proposal back to the top-level since it now contains changes for CAPM3 and BMO, and I can adjust the PR title if that makes things clearer.

hardys commented 5 months ago

/retitle DHCP-less IPAM proposal

hardys commented 5 months ago

Thanks everyone for your feedback so far - I've reworked the proposal and tried to take account of all the comments, please let me know if there are any further suggestions or things that I missed.

hardys commented 4 months ago

To cross-reference we had some good discussions related to this proposal in the community meetup call last week

metal3-io-bot commented 3 months ago

@hardys: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
shellcheck ff02bb40340e30497a756982568eb4649bb5c186 link true /test shellcheck

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
Rozzii commented 3 months ago

/approve

Rozzii commented 3 months ago

/lgtm

Rozzii commented 1 month ago

/cc @lentzi90 @kashifest please take a look

metal3-io-bot commented 1 month ago

@Rozzii: GitHub didn't allow me to request PR reviews from the following users: please, take, a, look.

Note that only metal3-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/metal3-io/metal3-docs/pull/373#issuecomment-2095429015): >/cc @lentzi90 @kashifest please take a look Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tuminoid commented 1 month ago

Hi @hardys, there is few open review comments/questions here, are you planning on continuing/finishing this?

hardys commented 4 weeks ago

Hi @hardys, there is few open review comments/questions here, are you planning on continuing/finishing this?

Hi @tuminoid yes I would still like to finish this but have been busy with other tasks - I see @lentzi90 has some feedback so I will look into those, and perhaps we may need to have some discussions on the weekly call to reach consensus.

metal3-io-bot commented 4 weeks ago

New changes are detected. LGTM label has been removed.

metal3-io-bot commented 4 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Rozzii Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/metal3-io/metal3-docs/blob/main/OWNERS)** - ~~[design/OWNERS](https://github.com/metal3-io/metal3-docs/blob/main/design/OWNERS)~~ [Rozzii] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
metal3-io-bot commented 4 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Rozzii Once this PR has been reviewed and has the lgtm label, please assign zaneb for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/metal3-io/metal3-docs/blob/main/OWNERS)** - ~~[design/OWNERS](https://github.com/metal3-io/metal3-docs/blob/main/design/OWNERS)~~ [Rozzii] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
hardys commented 4 weeks ago

Sorry for the much delayed review. Thanks for the proposal and great work on the writeup!

I have two concerns/questions:

  1. Using an annotation may seem like an easy solution but I worry that it will not be nice in the long run. We get no schema validation (typos will be nasty to find) and no way of evolving gracefully with conversions between API versions. To me the annotation feels like we are bolting on something rather than building a proper feature with full support because of this. There were some alternatives described with the disadvantage that they would not handle the association of template to BMH. Why is this? Could we not have a proper BMH API for that?

Please could you check the previous thread where this was discussed with @pierrecregut and @zaneb in some detail - the consensus was that adding an annotation was a lightweight option which seems more appropriate given that the BMH doesn't consume this template (it's layered on top via another controller)

I'm open to revisiting that direction but it would be really helpful if you could articulate a specific alternative so we can understand the pros/cons vs the minimally invasive annotation choice previously agreed.

  1. There are discussions on-going around how to make CAPM3 support multi-tenancy. One of the use-cases I see is that the owner of the BMH does not have to be the same as the owner of the (Metal3)Machine. By handling the preprovisioningNetworkData through CAPM3 we would essentially tie them together even more and make multi-tenancy harder.

I'm not sure this is true, check the IPAM Scenario 2 - decoupled preprovisioning/provisioning IPPool scenario in this proposal, I think that describes exactly this use-case?

My naive proposal would be that BMO should have the new controller that handles the preprovisioningNetworkData and CAPM3 should not know or care about this. If one team is owning the whole thing there is no difference, but if it is split up, I expect that the team owning the BMHs would also handle preprovisioning. What do you think about this?

Logically what you describe makes sense, we could have a new controller as described in this proposal added to BMO instead of CAPM3, but practically my concern is it'll basically duplicate the existing CAPM3 controller which handles networkData - we'd need to find a way to handle that which won't make future refactoring (for example to enable templating of nmstate format as an alternative to cloud-init) more difficult.

@pierrecregut @zaneb what are your thoughts? Is it worth reworking this proposal again to consider e.g adding a preprovisioningDataTemplate field to the BMH directly, along with an optional/pluggable templating controller?

pierrecregut commented 3 weeks ago

On one hand, I also agree that explicit fields are better than labels or annotations. I am not sure there is an issue with version migration as it is just the reference but there is also the burnt in documentation of the CRD that makes them much more usable. On the other hand, it is also clear that there is no real consensus on the network format. So the field must be able to consume different templates, so different resource kinds. A ref with a kind and a name may be fine. It is still true that the content will not be handled by the BareMetalHost controller itself but another one even if a default may be bundled in the BmhController pod.

lentzi90 commented 3 weeks ago

Sorry for the much delayed review. Thanks for the proposal and great work on the writeup! I have two concerns/questions:

  1. Using an annotation may seem like an easy solution but I worry that it will not be nice in the long run. We get no schema validation (typos will be nasty to find) and no way of evolving gracefully with conversions between API versions. To me the annotation feels like we are bolting on something rather than building a proper feature with full support because of this. There were some alternatives described with the disadvantage that they would not handle the association of template to BMH. Why is this? Could we not have a proper BMH API for that?

Please could you check the previous thread where this was discussed with @pierrecregut and @zaneb in some detail - the consensus was that adding an annotation was a lightweight option which seems more appropriate given that the BMH doesn't consume this template (it's layered on top via another controller)

I'm open to revisiting that direction but it would be really helpful if you could articulate a specific alternative so we can understand the pros/cons vs the minimally invasive annotation choice previously agreed.

Thanks for the pointer, I have now gone through the thread and understand better the reasoning. I would like to suggest at minimum one change in the design, and also to give a couple of comparisons to alternatives that we could consider.

The minimum change I would like is that BMO should not care about the annotation. Instead, the preprovisioningNetworkDataName would be set to the name of the secret to be created. BMO would wait until it exists before inspection is attempted. The controller reading the annotation would be resposible for checking what name the secret should have and create it.

This flow is identical to how cert-manager handles annotations on Ingress resources and should be familiar to many users. Note that this way of working is limited. It is not possible to request any special settings through the annotation since it is only a reference to another resource. Because of this limitation, it is quite common to create Certificate resources instead directly. Translating this to Metal3, I think it should be expected that users may want to create explicit Metal3DataClaims or IPAddressClaims to get better control also. I'm unsure if this would be possible with the proposal as it is?

The second comparison I want to make is to PersistentVolume(Claims) and StatefulSets. I think this is also similar to what we want to achieve and it is solved in a different way. StatefulSets have volumeClaimTemplates that provide a way to dynamically create PersistentVolumeClaims for the Pods, that would then be filled from a StorageClass.

I think this has the advantage of giving the user better control than the annotation, while also automating claim creation. However, it would be a larger API change, and we do not have the same need for dynamically creating the claims since the BMHs don't scale. I still wanted to give this as an alternative to consider, but I am fine with the annotation as described above.

  1. There are discussions on-going around how to make CAPM3 support multi-tenancy. One of the use-cases I see is that the owner of the BMH does not have to be the same as the owner of the (Metal3)Machine. By handling the preprovisioningNetworkData through CAPM3 we would essentially tie them together even more and make multi-tenancy harder.

I'm not sure this is true, check the IPAM Scenario 2 - decoupled preprovisioning/provisioning IPPool scenario in this proposal, I think that describes exactly this use-case?

You are right! Somehow I missed that. It does address the use-case form the "user point of view". I.e. one user can manage the BMHs and preprovisioning, while another is "consuming" the BMHs for running workload. This is all good.

I still have a concern when thinking one step further though. If CAPM3 is going to handle the preprovisioningNetworkData and directly interact with the BMHs, then the user must delegate enough privileges to CAPM3 to be able to do this. My long term goal is to add an indirection so that CAPM3 would not need this access. The point is to avoid giving the "consumer" access to BMC credentials, Ironic endpoints and so on.

However, I admit that CAPM3 for now (even with the multi-tenancy proposal) needs this access so adding the preprovisioningNetworkData is just one more thing. If we can agree that CAPM3 will just read the annotation and create the secret, then that is good enough for me. My concern would be if CAPM3 would directly change the BMH object to add the preprovisioningNetworkData.

My naive proposal would be that BMO should have the new controller that handles the preprovisioningNetworkData and CAPM3 should not know or care about this. If one team is owning the whole thing there is no difference, but if it is split up, I expect that the team owning the BMHs would also handle preprovisioning. What do you think about this?

Logically what you describe makes sense, we could have a new controller as described in this proposal added to BMO instead of CAPM3, but practically my concern is it'll basically duplicate the existing CAPM3 controller which handles networkData - we'd need to find a way to handle that which won't make future refactoring (for example to enable templating of nmstate format as an alternative to cloud-init) more difficult.

Good point. I am fine ignoring this for now. As we progress with multi-tenancy and "BMH claims", we may need to revisit, but I don't want to block this proposal based on some imaginary future proposal and design.

pierrecregut commented 3 weeks ago

Logically what you describe makes sense, we could have a new controller as described in this proposal added to BMO instead of CAPM3, but practically my concern is it'll basically duplicate the existing CAPM3 controller which handles networkData - we'd need to find a way to handle that which won't make future refactoring (for example to enable templating of nmstate format as an alternative to cloud-init) more difficult.

I am not sure code reuse which (a go issue) and location of controllers need to be mixed. We could have a module (separate to avoid linking dependencies of BMO controller and capm3 controller) in BMO repository that handles network templating with respect to a set of meta-data sources. It would also contain the types describing networks with marshalling annotations. It would be used for the PreprovisioningTemplate controller in BMO with only BareMetalHost as source and in CAPM3 by the DataTemplate controller with machine, M3M and BareMetalHost as sources.

Defining the preprovisioning configuration is really a responsibility of the teams handling the hardware because its primary purpose is to talk to the Ironic instance they manage. They have no reason to deploy capm3 on their cluster if BMH and cluster management are separate.

hardys commented 15 hours ago

However, I admit that CAPM3 for now (even with the multi-tenancy proposal) needs this access so adding the preprovisioningNetworkData is just one more thing. If we can agree that CAPM3 will just read the annotation and create the secret, then that is good enough for me. My concern would be if CAPM3 would directly change the BMH object to add the preprovisioningNetworkData.

@lentzi90 my main concern with this suggestion is we'd end up with two slightly different user interfaces for preprovisioningNetworkData and the existing networkData - I guess that could be mitigated by ensuring the "new" behavior is possible for both, e.g check for a secret name on the BMH and only try to set it when unset (permissions for which could be potentially removed in future)

zaneb commented 2 hours ago

After reading through the multitenancy proposal, I think Lennart raises a good point here.

The original sin of Metal³ has always been that we lumped everything into one BMH object and didn't make any distinction between different roles that might be involved - forget multitenancy for a second, we don't even distinguish between a hardware administrator and someone who is allowed to provision. If you can provision a host you can do anything to it.

CAPM3 already crossed the line a bit by giving the user control of the networkData. Note that in OpenStack, the networkData and metaData are provided internally by ironic itself, and only the userData is under the control of the user via the Nova API.

We added the preprovisioningNetworkData field largely so we could keep what is clearly an admin operation (affects whether we can boot IPA at all) separate from the user-supplied data that CAPM3 adds. Now we are finding that CAPM3 doing IPAM for the provisioning image is not actually sufficient, since it needs to be done before provisioning.

Perhaps this is an opportunity to reconsider whether this functionality belongs on the CAPI side of the fence at all. Whether a desire to add a host to a CAPI cluster should be sufficient to imply permission to do that.

In Metal³ we are missing a component that covers some functionality in OpenStack that is spread between Nova, Ironic, and Neutron: given a request to provision a host, find one we're permitted to use, connect it to the right networks, and supply it with the right networkData. Partly that was intentional, because as Steve and I both know it turns out nobody actually wants their datacenter to act like a cloud where they don't care where the resources are coming from.

The HostClaim proposal is essentially a proposal to build that component, starting with disconnecting a request to provision a host from any specific host (which I think may be a mistake), but inevitably leading to demands that we add the ability to reconfigure networks at the switch level (which I also think may be a mistake).

Personally, I'd prefer to pursue multitenancy by separating the different roles clearly and making space for anybody who wants to to build another component in the middle to bridge the gap between user and admin. It's not clear that one size fits all, nor that we want the Metal³ project to be the sole owner of that problem.

In the meantime, it seems to me that IPAM fits pretty clearly on the 'admin' side of the user-admin divide, and is something that should be configured as part of creating the BMH resources, not as part of provisioning them, in applications where those two roles are separate. I'm not sure that this changes this proposal at all, except that maybe infrastructure.cluster.x-k8s.io is the wrong namespace for the Metal3PreprovisioningDataTemplate. But maybe we should start to look at this as a replacement for, rather than an addition to, CAPM3 templates for the networkData.