openstack-k8s-operators / openstack-operator

Meta Operator for OpenStack
https://openstack-k8s-operators.github.io/openstack-operator/
Apache License 2.0
27 stars 76 forks source link

Remove webhook validation for networks #1099

Open rabi opened 1 week ago

rabi commented 1 week ago

We now support custom network names. ctlplane network name or the network for which serviceNet is ctlplane could be anywhere in the order and it won't matter.

Jira: https://issues.redhat.com/browse/OSPRH-10125

openshift-ci[bot] commented 1 week ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rabi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openstack-k8s-operators/openstack-operator/blob/main/OWNERS)~~ [rabi] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
softwarefactory-project-zuul[bot] commented 1 week ago

Build failed (check pipeline). Post recheck (without leading slash) to rerun all jobs. Make sure the failure cause has been resolved before you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8c2584f8647c4ac8890dfe3dd12cd75c

:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 1h 52m 33s :heavy_check_mark: podified-multinode-edpm-deployment-crc SUCCESS in 1h 12m 29s :heavy_check_mark: cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 24m 50s :x: openstack-operator-tempest-multinode FAILURE in 1h 39m 48s

rabi commented 1 week ago

One still needs to have the required name. So it doesn't make sense to remove the validation, only to place it's analogue somewhere else. Instead you should modify it, to check for presence of at least one network with the required name.

Sorry I don't understand what you mean here??. Network name can be anything. What matters is the serviceName in netconfig. serviceName would fallback to network name provides. So there can be networks in netconfig like

And then you provide them in nodeset spec as

networks:

So the existing validation is not correct.

jpodivin commented 1 week ago

One still needs to have the required name. So it doesn't make sense to remove the validation, only to place it's analogue somewhere else. Instead you should modify it, to check for presence of at least one network with the required name.

Sorry I don't understand what you mean here??. Network name can be anything. What matters is the serviceName in netconfig. serviceName would fallback to network name provides. So there can be networks in netconfig like

* name: net1
  serviceName: ctlplane
  ....

* name: net2
  serviceNet: internalapi

And then you provide them in nodeset spec as

networks:

* net1

* net2

So the existing validation is not correct.

Sorry. Perhaps I didn't make it entirely clear. The current validation, the one you are removing, is incorrect, in that it assumes something that was true, an no longer is, to still be true. That is correct. However, removing it completely, meaning that we would no longer validate these fields at all, would imply that all network names are acceptable in any form.

But you are also adding this, into ipam.go:

            for _, net := range nets {
                if strings.EqualFold(string(net.Name), dataplanev1.CtlPlaneNetwork) ||
                    netServiceNetMap[strings.ToLower(string(net.Name))] == dataplanev1.CtlPlaneNetwork {
                    foundCtlPlane = true
                }
            }
            if !foundCtlPlane {
                msg := fmt.Sprintf("ctlplane network should be defined for node %s", nodeName)
                return nil, netServiceNetMap, fmt.Errorf(msg)
            }

Checking if there is at least one network with matching name and throwing error if there is none. That seems very similar to the original validation. My question then is, why not just fix the original validation, to do ^^^. Rather than removing it, and placing similar code to be used by reconciliation loop?

rabi commented 1 week ago

.That seems very similar to the original validation. My question then is, why not just fix the original validation, to do ^^^. Rather than removing it, and placing similar code to be used by reconciliation loop?

It's different in many aspects..

The reason validation was moved out of the webhook is because, you need to get the netconfig CR with api call which we won't do in the webhook. Instead the netconfig network/serviceNetMap is already there in ipam.go where the validaton has been done in this PR before reserving ips.

jpodivin commented 1 week ago

.That seems very similar to the original validation. My question then is, why not just fix the original validation, to do ^^^. Rather than removing it, and placing similar code to be used by reconciliation loop?

It's different in many aspects..

* network name may not have anycase("ctlplane")

* network name or serviceNetName for networks in a nodeset has to have "ctlplane"

* "ctlplane" does not need to be the first network

The reason validation was moved out of the webhook is because, you need to get the netconfig CR with api call which we won't do in the webhook. Instead the netconfig network/serviceNetMap is already there in ipam.go where the validaton has been done in this PR before reserving ips.

It's a loop that checks names of networks against preset string and throws an error if the list isn't compliant. That sounds very similar to the original, but I digress.

If you consider API calls in webhooks to be verboten, then I guess you really don't have any choice. But that is very much a self imposed constraint that I consider unfortunate. Especially since there are plenty of examples of doing exactly that across the ecosystem. Even in our own project

https://github.com/openshift/lvm-operator/blob/main/api/v1alpha1/lvmcluster_webhook.go#L90 https://github.com/openshift/operator-framework-rukpak/blob/main/internal/webhook/configmaps.go#L35 https://github.com/openshift/sandboxed-containers-operator/blob/devel/api/v1/kataconfig_webhook.go#L61 https://github.com/openstack-k8s-operators/openstack-baremetal-operator/blob/main/api/v1beta1/openstackbaremetalset_webhook.go#L68 https://github.com/openstack-k8s-operators/openstack-baremetal-operator/blob/main/api/v1beta1/openstackprovisionserver_webhook.go#L77

rabi commented 1 week ago

If you consider API calls in webhooks to be verboten, then I guess you really don't have any choice.

I think there were PRs blocked earlier that were making api calls from webhooks and I don't think it's used anywhere other than openstack-baremetal-operator repo for us. I can dig out that discussions if you want. As per openstack-baremetal-operator, some of that has been inherited from osp-director-operator and we're planning to change those.

jpodivin commented 1 week ago

If you consider API calls in webhooks to be verboten, then I guess you really don't have any choice.

I think there were PRs blocked earlier that were making api calls from webhooks and I don't think it's used anywhere other than openstack-baremetal-operator repo for us. I can dig out that discussions if you want. As per openstack-baremetal-operator, some of that has been inherited from osp-director-operator and we're planning to change those.

I do believe there were, and I still don't agree that it was a good choice to limit ourselves in this way, especially since it seems to be reasonably common practice.

rabi commented 1 week ago

Check these links, one is from you :smile:

https://github.com/openstack-k8s-operators/dataplane-operator/pull/829#discussion_r1568341183 https://github.com/openstack-k8s-operators/dataplane-operator/pull/829#discussion_r1568910852

gibizer commented 1 week ago

I'm on the side of by default don't do external calls in webhooks as I believe we need to be careful not to slow down the API response generation, and we also be careful not to create API call multiplication related overload situation where a single API call needs a lot of extra API calls in the background to be handled. Then we can look at each cases where this is imposing too much restriction. If the validation has a high potential to catch high impact issues then we can decide to break our default rule and add a specific external call to the webhook.

Now looking at the case at hand.

If the answer to both question is clearly yes, and there is no easy way to make the situation better other than adding the external call to the webhook, then I think this external call is warranted.

rabi commented 1 week ago

Will the current move of validation logic cause that a deployment will fail at the middle?

It would fail in reserving ips. But that's before anything concrete is done by the controller i.e creating ipsets, provisioning baremetal or deploying anything on the nodes.. So it's not high impact I think.

rabi commented 1 week ago

Is it easy for the user to make a mistake and for example typo

Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.

jpodivin commented 1 week ago

Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.

I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.

As for the comments. I caved to peer pressure at the time. I acknowledge that. However I reserve the right to change my mind. Surely you wouldn't deny me that?

gibizer commented 1 week ago

Will the current move of validation logic cause that a deployment will fail at the middle?

It would fail in reserving ips. But that's before anything concrete is done by the controller i.e creating ipsets, provisioning baremetal or deploying anything on the nodes.. So it's not high impact I think.

OK. I agree this is low impact. It fails relatively fast without doing any changes on the edpm node. So I'm OK to move the validation logic to the reconcile loop.

gibizer commented 1 week ago

Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.

I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.

I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap.

Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.

jpodivin commented 1 week ago

Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.

I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.

I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap.

Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.

If we are talking cost/time then we need to consider the calls that would have been avoided if the spec was rejected by webhook. For the sake of argument, let's say that all calls take negligible time, except for those which use client to query API. And let's say that every API call takes the same amount of time.

The option with webhook would, under these assumptions, take 1 unit of time. As it would not let the invalid data anywhere near the reconciliation loop. Now for the option with check implemented deeper.

First call[1], one call per service [2], cleaning stale reservations[3] and finally the call to get netConfigs [4].

I omit lot of other, potentially expensive calls. And I don't even consider those that should be relatively cheap, like setting conditions etc.

But even then, the proposed 'fast' approach requires multiple times more API calls than the other option.

[1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L140 [2] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/service.go#L140 https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L221 [3]https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L307 [3] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L344

gibizer commented 1 week ago

Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.

I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.

I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap. Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.

If we are talking cost/time then we need to consider the calls that would have been avoided if the spec was rejected by webhook. For the sake of argument, let's say that all calls take negligible time, except for those which use client to query API. And let's say that every API call takes the same amount of time.

The option with webhook would, under these assumptions, take 1 unit of time. As it would not let the invalid data anywhere near the reconciliation loop. Now for the option with check implemented deeper.

First call[1], one call per service [2], cleaning stale reservations[3] and finally the call to get netConfigs [4].

I omit lot of other, potentially expensive calls. And I don't even consider those that should be relatively cheap, like setting conditions etc.

But even then, the proposed 'fast' approach requires multiple times more API calls than the other option.

[1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L140 [2] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/service.go#L140 https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L221 [3]https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L307 [3] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L344

What you miss is the point where that call takes place. If a call takes place in the webhook it slows down the response from the API. It is on a critical path to update the NodeSet. While if it takes place in the reconcile loop then that API call can be (and will be) throttled if the controller is sending too many API requests to preserve the responsiveness of the API server.

jpodivin commented 1 week ago

Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.

I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.

I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap. Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.

If we are talking cost/time then we need to consider the calls that would have been avoided if the spec was rejected by webhook. For the sake of argument, let's say that all calls take negligible time, except for those which use client to query API. And let's say that every API call takes the same amount of time. The option with webhook would, under these assumptions, take 1 unit of time. As it would not let the invalid data anywhere near the reconciliation loop. Now for the option with check implemented deeper. First call[1], one call per service [2], cleaning stale reservations[3] and finally the call to get netConfigs [4]. I omit lot of other, potentially expensive calls. And I don't even consider those that should be relatively cheap, like setting conditions etc. But even then, the proposed 'fast' approach requires multiple times more API calls than the other option. [1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L140 [2] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/service.go#L140 https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L221 [3]https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L307 [3] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L344

What you miss is the point where that call takes place. If a call takes place in the webhook it slows down the response from the API. It is on a critical path to update the NodeSet. While if it takes place in the reconcile loop then that API call can be (and will be) throttled if the controller is sending too many API requests to preserve the responsiveness of the API server.

If calls get throttled then it will take even longer to see changes take effect, and fail. What time are we trying to save here? Customer wants their valid configuration applied, or to get a response when there is something wrong with it. At very least that's what I would have wanted.

In that the webhook takes less time than the reconciliation loop.

There are two options here. Either it will take slightly, barely noticeably, longer to get a response when submitting new resource. Or it will respond immediately, only to fail even later, once API processes bunch of pointless requests. I don't see much benefit in the second.

gibizer commented 6 days ago

Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.

I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.

I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap. Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.

If we are talking cost/time then we need to consider the calls that would have been avoided if the spec was rejected by webhook. For the sake of argument, let's say that all calls take negligible time, except for those which use client to query API. And let's say that every API call takes the same amount of time. The option with webhook would, under these assumptions, take 1 unit of time. As it would not let the invalid data anywhere near the reconciliation loop. Now for the option with check implemented deeper. First call[1], one call per service [2], cleaning stale reservations[3] and finally the call to get netConfigs [4]. I omit lot of other, potentially expensive calls. And I don't even consider those that should be relatively cheap, like setting conditions etc. But even then, the proposed 'fast' approach requires multiple times more API calls than the other option. [1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L140 [2] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/service.go#L140 https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L221 [3]https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L307 [3] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L344

What you miss is the point where that call takes place. If a call takes place in the webhook it slows down the response from the API. It is on a critical path to update the NodeSet. While if it takes place in the reconcile loop then that API call can be (and will be) throttled if the controller is sending too many API requests to preserve the responsiveness of the API server.

If calls get throttled then it will take even longer to see changes take effect, and fail. What time are we trying to save here? Customer wants their valid configuration applied, or to get a response when there is something wrong with it. At very least that's what I would have wanted.

In that the webhook takes less time than the reconciliation loop.

We get throttled to keep the API server responsive overall. So that is a good thing. The cluster API is still responsive and the user can still change things (e.g. fix config) in etcd.

In my mind the reconcile loop is allowed to be significantly slower than the API response as it only need to make the world eventually consistent with the config in etcd.

There are two options here. Either it will take slightly, barely noticeably, longer to get a response when submitting new resource. Or it will respond immediately, only to fail even later, once API processes bunch of pointless requests. I don't see much benefit in the second.

By the extension of your first option we could move most of the active logic from the reconcile loop to the webhook. But we don't do that. I guess for the reason of keeping the webhooks as fast as possible as they are running in a codepath where the client waiting for the response synchronously.

Anyhow I think we can agree that we disagree where to draw the line. I draw it at the existence of the first external call as it is easy to spot it. You probably draw it somewhere else, maybe when a single resource validation needs more than X (X >> 1) external calls. But somewhere there is a line, otherwise we could implement every mutation on the cluster inline in the webhook path.

jpodivin commented 6 days ago

I need to point out, that in this case there can be no eventual consistency, because the input just isn't valid, and never will be valid. If there was a way to reconcile the input somehow, then the loop would be the right place for it.

rabi commented 6 days ago

there can be no eventual consistency, because the input just isn't valid

As mentioned by @gibizer webhooks directly impact scalability and performance of the API server as the call from the api server is synchronous (in this case before admitting the request apiserver calls the controller endpoint for the webhook and then from the webhook you again call the api server to get external/other resources) and the request/thread in apiserver is blocked till all that is finished. That's the reason the best practice is to not make api calls from the webhook, if not absolutely necessary.

Regarding ensuring the input is completely valid before persisting it to etcd, you would require lots of validations i.e services listed in nodeset should be valid, fixedIP provided in networks should be from the network allocation ranges, check that enough bmhs are available for the nodes in the spec when using preProvisioned: false etc. Ensureing everything is not necessary in the webhook, as user can see the error during reconcile and update the CR accordingly.

Anyway, I don't see this debate going anywhere. I would suggest we merge this to complete https://issues.redhat.com/browse/OSPRH-10125. If we decide to ensure that the spec/input is fully valid before admitting requests across operators, we should add all the validations later.

rabi commented 6 days ago

recheck

slagle commented 5 days ago

Anyway, I don't see this debate going anywhere. I would suggest we merge this to complete https://issues.redhat.com/browse/OSPRH-10125. If we decide to ensure that the spec/input is fully valid before admitting requests across operators, we should add all the validations later.

That ought to be for FR2 though aiui, so what's the rush to get this in now as it would end up in FR1 instead? I ask, given our retro where we wanted to be more intentional about the work we are accepting for each release.

I recognize there's a case for committing work if it's done, but in this case it seems like there's some design discussion that is worth having before moving forward especially since the target is FR2.

slagle commented 5 days ago

As mentioned by @gibizer webhooks directly impact scalability and performance of the API server as the call from the api server is synchronous (in this case before admitting the request apiserver calls the controller endpoint for the webhook and then from the webhook you again call the api server to get external/other resources) and the request/thread in apiserver is blocked till all that is finished. That's the reason the best practice is to not make api calls from the webhook, if not absolutely necessary.

FTR, I also agree with that we should not make API calls from webhooks as best practice. I think that's where we should draw the line between webhook validation and reconciliation logic. If the operation/validation needs to make an API call, then it ought to be in the reconcile loop.

We can find multiple examples of it working both ways all over the place, but I don't think that's the point. All logic could be in a webhook, or all logic could be in the reconcile, so we ought to have a best practice on where to draw the line. For me, it's making API calls, and relying on the pattern of eventual consistency.

rabi commented 5 days ago

so what's the rush to get this in now as it would end up in FR1 instead?

There is no rush other than the fact that, AFAIR this has been discussed earlier with the same conclusion/agreement and I wanted to close https://issues.redhat.com/browse/OSPRH-10125 before my upcoming PTO.

I ask, given our retro where we wanted to be more intentional about the work we are accepting for each release.

Right, isn't that for planning? Does not also mean we can't accept and work on features that're planned for future releases and even merge part of whole of it if completed? I agree that this is linked to https://issues.redhat.com/browse/OSPRH-821 feature for FR2, but custom network name was supported with OSP17.1 (many customers use it) and is possibly a gap in OSP18 that I wanted to implement in FR1.

But it's ok, if we want to further discussion to arrive at any agreement.