Open rabi opened 1 week ago
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: rabi
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Build failed (check pipeline). Post recheck
(without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.
https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/8c2584f8647c4ac8890dfe3dd12cd75c
:heavy_check_mark: openstack-k8s-operators-content-provider SUCCESS in 1h 52m 33s :heavy_check_mark: podified-multinode-edpm-deployment-crc SUCCESS in 1h 12m 29s :heavy_check_mark: cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 24m 50s :x: openstack-operator-tempest-multinode FAILURE in 1h 39m 48s
One still needs to have the required name. So it doesn't make sense to remove the validation, only to place it's analogue somewhere else. Instead you should modify it, to check for presence of at least one network with the required name.
Sorry I don't understand what you mean here??. Network name
can be anything. What matters is the serviceName
in netconfig. serviceName
would fallback to network name
provides. So there can be networks in netconfig like
And then you provide them in nodeset spec as
networks:
So the existing validation is not correct.
One still needs to have the required name. So it doesn't make sense to remove the validation, only to place it's analogue somewhere else. Instead you should modify it, to check for presence of at least one network with the required name.
Sorry I don't understand what you mean here??. Network
name
can be anything. What matters is theserviceName
in netconfig.serviceName
would fallback to networkname
provides. So there can be networks in netconfig like* name: net1 serviceName: ctlplane .... * name: net2 serviceNet: internalapi
And then you provide them in nodeset spec as
networks:
* net1 * net2
So the existing validation is not correct.
Sorry. Perhaps I didn't make it entirely clear. The current validation, the one you are removing, is incorrect, in that it assumes something that was true, an no longer is, to still be true. That is correct. However, removing it completely, meaning that we would no longer validate these fields at all, would imply that all network names are acceptable in any form.
But you are also adding this, into ipam.go
:
for _, net := range nets {
if strings.EqualFold(string(net.Name), dataplanev1.CtlPlaneNetwork) ||
netServiceNetMap[strings.ToLower(string(net.Name))] == dataplanev1.CtlPlaneNetwork {
foundCtlPlane = true
}
}
if !foundCtlPlane {
msg := fmt.Sprintf("ctlplane network should be defined for node %s", nodeName)
return nil, netServiceNetMap, fmt.Errorf(msg)
}
Checking if there is at least one network with matching name and throwing error if there is none. That seems very similar to the original validation. My question then is, why not just fix the original validation, to do ^^^. Rather than removing it, and placing similar code to be used by reconciliation loop?
.That seems very similar to the original validation. My question then is, why not just fix the original validation, to do ^^^. Rather than removing it, and placing similar code to be used by reconciliation loop?
It's different in many aspects..
The reason validation was moved out of the webhook is because, you need to get the netconfig CR with api call which we won't do in the webhook. Instead the netconfig network/serviceNetMap is already there in ipam.go where the validaton has been done in this PR before reserving ips.
.That seems very similar to the original validation. My question then is, why not just fix the original validation, to do ^^^. Rather than removing it, and placing similar code to be used by reconciliation loop?
It's different in many aspects..
* network name may not have anycase("ctlplane") * network name or serviceNetName for networks in a nodeset has to have "ctlplane" * "ctlplane" does not need to be the first network
The reason validation was moved out of the webhook is because, you need to get the netconfig CR with api call which we won't do in the webhook. Instead the netconfig network/serviceNetMap is already there in ipam.go where the validaton has been done in this PR before reserving ips.
It's a loop that checks names of networks against preset string and throws an error if the list isn't compliant. That sounds very similar to the original, but I digress.
If you consider API calls in webhooks to be verboten, then I guess you really don't have any choice. But that is very much a self imposed constraint that I consider unfortunate. Especially since there are plenty of examples of doing exactly that across the ecosystem. Even in our own project
https://github.com/openshift/lvm-operator/blob/main/api/v1alpha1/lvmcluster_webhook.go#L90 https://github.com/openshift/operator-framework-rukpak/blob/main/internal/webhook/configmaps.go#L35 https://github.com/openshift/sandboxed-containers-operator/blob/devel/api/v1/kataconfig_webhook.go#L61 https://github.com/openstack-k8s-operators/openstack-baremetal-operator/blob/main/api/v1beta1/openstackbaremetalset_webhook.go#L68 https://github.com/openstack-k8s-operators/openstack-baremetal-operator/blob/main/api/v1beta1/openstackprovisionserver_webhook.go#L77
If you consider API calls in webhooks to be verboten, then I guess you really don't have any choice.
I think there were PRs blocked earlier that were making api calls from webhooks and I don't think it's used anywhere other than openstack-baremetal-operator repo for us. I can dig out that discussions if you want. As per openstack-baremetal-operator, some of that has been inherited from osp-director-operator and we're planning to change those.
If you consider API calls in webhooks to be verboten, then I guess you really don't have any choice.
I think there were PRs blocked earlier that were making api calls from webhooks and I don't think it's used anywhere other than openstack-baremetal-operator repo for us. I can dig out that discussions if you want. As per openstack-baremetal-operator, some of that has been inherited from osp-director-operator and we're planning to change those.
I do believe there were, and I still don't agree that it was a good choice to limit ourselves in this way, especially since it seems to be reasonably common practice.
I'm on the side of by default don't do external calls in webhooks as I believe we need to be careful not to slow down the API response generation, and we also be careful not to create API call multiplication related overload situation where a single API call needs a lot of extra API calls in the background to be handled. Then we can look at each cases where this is imposing too much restriction. If the validation has a high potential to catch high impact issues then we can decide to break our default rule and add a specific external call to the webhook.
Now looking at the case at hand.
If the answer to both question is clearly yes, and there is no easy way to make the situation better other than adding the external call to the webhook, then I think this external call is warranted.
Will the current move of validation logic cause that a deployment will fail at the middle?
It would fail in reserving ips. But that's before anything concrete is done by the controller i.e creating ipsets, provisioning baremetal or deploying anything on the nodes.. So it's not high impact I think.
Is it easy for the user to make a mistake and for example typo
Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.
Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.
I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.
As for the comments. I caved to peer pressure at the time. I acknowledge that. However I reserve the right to change my mind. Surely you wouldn't deny me that?
Will the current move of validation logic cause that a deployment will fail at the middle?
It would fail in reserving ips. But that's before anything concrete is done by the controller i.e creating ipsets, provisioning baremetal or deploying anything on the nodes.. So it's not high impact I think.
OK. I agree this is low impact. It fails relatively fast without doing any changes on the edpm node. So I'm OK to move the validation logic to the reconcile loop.
Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.
I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.
I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap.
Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.
Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.
I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.
I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap.
Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.
If we are talking cost/time then we need to consider the calls that would have been avoided if the spec was rejected by webhook. For the sake of argument, let's say that all calls take negligible time, except for those which use client to query API. And let's say that every API call takes the same amount of time.
The option with webhook would, under these assumptions, take 1
unit of time. As it would not let the invalid data anywhere near the reconciliation loop.
Now for the option with check implemented deeper.
First call[1], one call per service [2], cleaning stale reservations[3] and finally the call to get netConfigs [4].
I omit lot of other, potentially expensive calls. And I don't even consider those that should be relatively cheap, like setting conditions etc.
But even then, the proposed 'fast' approach requires multiple times more API calls than the other option.
[1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L140 [2] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/service.go#L140 https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L221 [3]https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L307 [3] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L344
Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.
I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.
I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap. Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.
If we are talking cost/time then we need to consider the calls that would have been avoided if the spec was rejected by webhook. For the sake of argument, let's say that all calls take negligible time, except for those which use client to query API. And let's say that every API call takes the same amount of time.
The option with webhook would, under these assumptions, take
1
unit of time. As it would not let the invalid data anywhere near the reconciliation loop. Now for the option with check implemented deeper.First call[1], one call per service [2], cleaning stale reservations[3] and finally the call to get netConfigs [4].
I omit lot of other, potentially expensive calls. And I don't even consider those that should be relatively cheap, like setting conditions etc.
But even then, the proposed 'fast' approach requires multiple times more API calls than the other option.
[1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L140 [2] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/service.go#L140 https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L221 [3]https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L307 [3] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L344
What you miss is the point where that call takes place. If a call takes place in the webhook it slows down the response from the API. It is on a critical path to update the NodeSet. While if it takes place in the reconcile loop then that API call can be (and will be) throttled if the controller is sending too many API requests to preserve the responsiveness of the API server.
Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.
I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.
I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap. Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.
If we are talking cost/time then we need to consider the calls that would have been avoided if the spec was rejected by webhook. For the sake of argument, let's say that all calls take negligible time, except for those which use client to query API. And let's say that every API call takes the same amount of time. The option with webhook would, under these assumptions, take
1
unit of time. As it would not let the invalid data anywhere near the reconciliation loop. Now for the option with check implemented deeper. First call[1], one call per service [2], cleaning stale reservations[3] and finally the call to get netConfigs [4]. I omit lot of other, potentially expensive calls. And I don't even consider those that should be relatively cheap, like setting conditions etc. But even then, the proposed 'fast' approach requires multiple times more API calls than the other option. [1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L140 [2] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/service.go#L140 https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L221 [3]https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L307 [3] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L344What you miss is the point where that call takes place. If a call takes place in the webhook it slows down the response from the API. It is on a critical path to update the NodeSet. While if it takes place in the reconcile loop then that API call can be (and will be) throttled if the controller is sending too many API requests to preserve the responsiveness of the API server.
If calls get throttled then it will take even longer to see changes take effect, and fail. What time are we trying to save here? Customer wants their valid configuration applied, or to get a response when there is something wrong with it. At very least that's what I would have wanted.
In that the webhook takes less time than the reconciliation loop.
There are two options here. Either it will take slightly, barely noticeably, longer to get a response when submitting new resource. Or it will respond immediately, only to fail even later, once API processes bunch of pointless requests. I don't see much benefit in the second.
Atm (without this PR), if user makes a typo i.e provides a network (not ctlplane) that's not in netconfig, it would fail at the same place i.e reserving ips and the nodeset would have NodeSetIPReservation status as false with the error, for users to fix networks section in the nodeset spec.
I need to clarify something here. I don't dispute that the current validation is no longer fit for purpose. Things have changed since it was implemented and that has been established. What I do suggest is that instead of removing it, and moving code verifying network names deeper, it should be rewritten.
I think where we are disagree is the cost of having a webhook with an external call. I'm on the side that is is not cheap, while you are probably on the side that it is cheap. Btw the validaton is not removed, it is moved from the webhook where it is no longer that practical (cheap) to run it, to the reconcile loop.
If we are talking cost/time then we need to consider the calls that would have been avoided if the spec was rejected by webhook. For the sake of argument, let's say that all calls take negligible time, except for those which use client to query API. And let's say that every API call takes the same amount of time. The option with webhook would, under these assumptions, take
1
unit of time. As it would not let the invalid data anywhere near the reconciliation loop. Now for the option with check implemented deeper. First call[1], one call per service [2], cleaning stale reservations[3] and finally the call to get netConfigs [4]. I omit lot of other, potentially expensive calls. And I don't even consider those that should be relatively cheap, like setting conditions etc. But even then, the proposed 'fast' approach requires multiple times more API calls than the other option. [1] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L140 [2] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/service.go#L140 https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/dataplane/openstackdataplanenodeset_controller.go#L221 [3]https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L307 [3] https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/dataplane/ipam.go#L344What you miss is the point where that call takes place. If a call takes place in the webhook it slows down the response from the API. It is on a critical path to update the NodeSet. While if it takes place in the reconcile loop then that API call can be (and will be) throttled if the controller is sending too many API requests to preserve the responsiveness of the API server.
If calls get throttled then it will take even longer to see changes take effect, and fail. What time are we trying to save here? Customer wants their valid configuration applied, or to get a response when there is something wrong with it. At very least that's what I would have wanted.
In that the webhook takes less time than the reconciliation loop.
We get throttled to keep the API server responsive overall. So that is a good thing. The cluster API is still responsive and the user can still change things (e.g. fix config) in etcd.
In my mind the reconcile loop is allowed to be significantly slower than the API response as it only need to make the world eventually consistent with the config in etcd.
There are two options here. Either it will take slightly, barely noticeably, longer to get a response when submitting new resource. Or it will respond immediately, only to fail even later, once API processes bunch of pointless requests. I don't see much benefit in the second.
By the extension of your first option we could move most of the active logic from the reconcile loop to the webhook. But we don't do that. I guess for the reason of keeping the webhooks as fast as possible as they are running in a codepath where the client waiting for the response synchronously.
Anyhow I think we can agree that we disagree where to draw the line. I draw it at the existence of the first external call as it is easy to spot it. You probably draw it somewhere else, maybe when a single resource validation needs more than X (X >> 1) external calls. But somewhere there is a line, otherwise we could implement every mutation on the cluster inline in the webhook path.
I need to point out, that in this case there can be no eventual consistency, because the input just isn't valid, and never will be valid. If there was a way to reconcile the input somehow, then the loop would be the right place for it.
there can be no eventual consistency, because the input just isn't valid
As mentioned by @gibizer webhooks directly impact scalability and performance of the API server as the call from the api server is synchronous (in this case before admitting the request apiserver calls the controller endpoint for the webhook and then from the webhook you again call the api server to get external/other resources) and the request/thread in apiserver is blocked till all that is finished. That's the reason the best practice is to not make api calls from the webhook, if not absolutely necessary.
Regarding ensuring the input is completely valid before persisting it to etcd, you would require lots of validations i.e services listed in nodeset should be valid, fixedIP provided in networks should be from the network allocation ranges, check that enough bmhs are available for the nodes in the spec when using preProvisioned: false
etc. Ensureing everything is not necessary in the webhook, as user can see the error during reconcile and update the CR accordingly.
Anyway, I don't see this debate going anywhere. I would suggest we merge this to complete https://issues.redhat.com/browse/OSPRH-10125. If we decide to ensure that the spec/input is fully valid before admitting requests across operators, we should add all the validations later.
recheck
Anyway, I don't see this debate going anywhere. I would suggest we merge this to complete https://issues.redhat.com/browse/OSPRH-10125. If we decide to ensure that the spec/input is fully valid before admitting requests across operators, we should add all the validations later.
That ought to be for FR2 though aiui, so what's the rush to get this in now as it would end up in FR1 instead? I ask, given our retro where we wanted to be more intentional about the work we are accepting for each release.
I recognize there's a case for committing work if it's done, but in this case it seems like there's some design discussion that is worth having before moving forward especially since the target is FR2.
As mentioned by @gibizer webhooks directly impact scalability and performance of the API server as the call from the api server is synchronous (in this case before admitting the request apiserver calls the controller endpoint for the webhook and then from the webhook you again call the api server to get external/other resources) and the request/thread in apiserver is blocked till all that is finished. That's the reason the best practice is to not make api calls from the webhook, if not absolutely necessary.
FTR, I also agree with that we should not make API calls from webhooks as best practice. I think that's where we should draw the line between webhook validation and reconciliation logic. If the operation/validation needs to make an API call, then it ought to be in the reconcile loop.
We can find multiple examples of it working both ways all over the place, but I don't think that's the point. All logic could be in a webhook, or all logic could be in the reconcile, so we ought to have a best practice on where to draw the line. For me, it's making API calls, and relying on the pattern of eventual consistency.
so what's the rush to get this in now as it would end up in FR1 instead?
There is no rush other than the fact that, AFAIR this has been discussed earlier with the same conclusion/agreement and I wanted to close https://issues.redhat.com/browse/OSPRH-10125 before my upcoming PTO.
I ask, given our retro where we wanted to be more intentional about the work we are accepting for each release.
Right, isn't that for planning? Does not also mean we can't accept and work on features that're planned for future releases and even merge part of whole of it if completed? I agree that this is linked to https://issues.redhat.com/browse/OSPRH-821 feature for FR2, but custom network name was supported with OSP17.1 (many customers use it) and is possibly a gap in OSP18 that I wanted to implement in FR1.
But it's ok, if we want to further discussion to arrive at any agreement.
We now support custom network names.
ctlplane
network name or the network for which serviceNet isctlplane
could be anywhere in the order and it won't matter.Jira: https://issues.redhat.com/browse/OSPRH-10125