solo-io / gloo

The Cloud-Native API Gateway and AI Gateway
https://docs.solo.io/
Apache License 2.0
4.1k stars 446 forks source link

Gloo edge validation webhook validate all the resources regardless of watchNamespaces value #7339

Open asayah opened 2 years ago

asayah commented 2 years ago

Gloo Edge Version

1.13.x (beta)

Kubernetes Version

No response

Describe the bug

Gloo edge validation webhook validate all the resources regardless of watchNamespaces value

Steps to reproduce the bug

Install gloo edge passing watch namespaces or single cluster mode

Expected Behavior

Gloo edge validation web-hook validate only the resources in namespaces in watchNamespaces list

Additional Context

in scope: configure the webhook during install out of scope: updating the webhook post install

bdecoste commented 1 year ago

The webhook is installed pointed to a single gloo pod, but when there are multiple control planes deployed in a single cluster the webhook only makes the validation call to a single, and possibly incorrect, gloo pod. We need to support multiple, namespaced webhooks when there are multiple, namespaced CPs in a single cluster

SantoDE commented 1 year ago

First iteration is fixing the Bug Adam raised. Outsourcing your comment @bdecoste into a separate issue.

bewebi commented 1 year ago

Based on what @bdecoste described, it seems like it might be preferable to have the ValidatingWebhookConfiguration for each installation select only on the watch namespaces it's responsible for using a namespace selector This will also ensure that if there are any other discrepancies between the installations (eg if they have different versions) validation is done appropriate for each installation

This would be a new Helm feature...I'll need to investigate further to have a better sense of how exactly we expose it but should be relatively simple

On the other hand, I'm not sure if the ask that a particular Gloo pod validate resources for namespaces it's not watching is even possible, but if it is I highly doubt it's desirable

I may need to dive deeper on the details, but in a nutshell, validation works by applying the new/updated resource to the controller's existing snapshot and validating that the resulting snapshot is still valid

Therefore currently in the case where the new resource is not watched by the controller it will trivially be valid, as the new/updated resource is absent from both the previous and new snapshots This leads to the scenario Bill described

On the other hand, accounting for all namespaces in a pod that only watches certain namespaces would allow resources that aren't actually valid Consider a scenario where we have two namespaces, East and West, each with a Gloo pod watching only that namespace If we have a VH in the East namespace that delegates to a Route in the West namespace, this is not a valid resource for either controller, however it will be accepted if validation takes all namespaces into account

I'm happy to discuss offline if further clarification is needed or if I'm missing something important about the ask

asayah commented 1 year ago

what is needed here is for a gloo install to validate only resources in namespaces that are defined in the watchNamespaces value

On the other hand, I'm not sure if the ask that a particular Gloo pod validate resources for namespaces it's not watching is even possible, but if it is I highly doubt it's desirable

this is out of scope, we don't need this

bewebi commented 1 year ago

I've got a promising POC underway for a solution but it relies on adding a new job to label watched namespaces on Helm install so that the ValidatingWebhookConfiguration can select namespaces based on those labels

I wanted to follow up on a couple pieces:

While the given solution should work, this is a bit clunky given that I'm aware some customers have had issues with running our existing jobs A beta feature in k8s 1.28 will allow us to solve the problem much more simply, without needing to add labels and therefore without any additional jobs Is it acceptable to introduce this fix later in the 1.16 cycle, require k8s 1.28 in order to use it, and therefore not backport to earlier versions?

I also wanted to clarify under what circumstances this bug causes issues with functionality as opposed to just performance It seems like, as it stands, all Gloo services should receive validation requests on resource modification and for those services which do not watch the namespace the resource is in, validation will be a no-op This is what I've observed in attempting to reproduce the scenario locally Have we observed scenarios where not every Gloo service gets a validation request (and it therefore results in an incorrect validation decision)? If so, can additional details be shared in order to be able to reproduce?

asayah commented 1 year ago
bewebi commented 1 year ago

We do have logic (here) to short-circuit and skip the overwhelming majority of processing that occurs when a validation request is received by a Gloo pod for a resource that is not in a watched namespace On my local KinD cluster there was about a 3ms response time on a validation request for an unwatched namespace compared to 2-3s for a resource in a watched namespace

With that in mind, I'm not sure how much performance benefit there is to be gained by preventing the request entirely

I also am not clear on the "blocker" scenario described there If there are multiple VS with the same host (ie domain collision) but in different namespaces not watched by the same Gloo deployments, I don't think there should be any issue/conflict

If there is a scenario like that, can we reproduce it?

asayah commented 1 year ago

I think you are right, domain colision is not a problem here, since I remember that the customer don't reuse the same domain in each ns, this use case was discussed last year with the customer, it may not be an issue anymore here is an example to test :

do the same test above but this time tweak the 3 created ValidatingWebhookConfiguration to select only gloo1, gloo2, gloo3 respectively

bewebi commented 1 year ago

I've run a rudimentary, highly unscientific experiment tl;dr: It does seem that there is a significant performance reduction caused by validating webhooks watching all namespaces

I started with a cluster with three Gloo installs in separate instances and 1000 VSs in each I used log activity as a rough timing mechanism and found the following:

Time to process a new VS for a deployment...

Additionally deployments with validation not namespace-limited appeared to do processing for ~1s when a VS was added for another namespace, with that processing occurring around the time processing finished on the deployment the VS was added to

Per discussion in slack, we will proceed with a solution that works on k8s versions prior to 1.28 and follow up with the cleaner solution for 1.28

The pre-k8s 1.28 solution will be opt-in The k8s 1.28+ solution will always be enabled The Helm chart will detect k8s version and when k8s 1.28+ is detected will apply the k8s 1.28+ solution and ignore any settings related to the <1.28 solution

sam-heilbron commented 1 year ago

Added release/1.15 label per: https://solo-io-corp.slack.com/archives/C0289JZGV8R/p1693419395706599

bewebi commented 1 year ago

After further exploration and discussion with the team, I believer the "time to process" findings from yesterday were misleading, and possibly incorrect

We're finding that in all cases under the given test circumstances validation requests occur up to 20-40s after the new resource is applied, made by any validating webhooks selecting for the namespace of the resource (including those which select all namespaces as is currently the case in all Gloo releases)

What is more significant is that memory usage does not appear to be impacted under our test circumstances

Here is a screenshot of a profile taken on a pod handling a validation request for a resource in its watched namespace when the webhook makes requests for changes on all namespaces: Screenshot 2023-08-30 at 4 36 02 PM

Here is a screenshot of a profile taken on a pod handling a validation request for a resource in its watched namespace when the webhook watches only the watched namespace: Screenshot 2023-08-30 at 4 36 35 PM

While there is a difference, the processing time is the same order of magnitude and shows that, despite validation requests being logged 20-40s after the initial resource creation, that is not reflective of time spent processing in the validation handler

Also consider a pod for a deployment not watching the namespace a resource is applied in Here is a screenshot of a profile taken on a pod handling a validation request for a resource not in its watched namespace when the webhook makes requests for changes on all namespaces: Screenshot 2023-08-30 at 4 37 03 PM

And finally here is a screenshot of a pod not handling a validation request when a resource is applied in an unwatched namespace and the webhook makes requests for changes only on watched namespaces: Screenshot 2023-08-30 at 4 36 52 PM

The 0.05s processing time is non-zero, but quite small, and unlikely to be a performance issue under our test circumstances.

It is conceivable that there are circumstances we haven't tested where performance would be more of a concern, with increased CPU load and or increased latency on resource validation. We will do some experimentation to try and identify whether we can create such a scenario and @sam-heilbron has also reached out to the customer experiencing pain around this to better understand what they are experiencing and under what circumstances.

We will determine the appropriate solution for Gloo 1.15 and pre-k8s 1.28 installations of Gloo 1.16 based on the response and our findings

Regardless we continue to intend to fix this using the previously mentioned k8s 1.28 beta feature for Gloo installations that use k8s 1.28+

bewebi commented 1 year ago

After customer call it was determined that more research was required on the customer end to determine the performance impacts of this issue as distinct from validation performance overall. We will determine a path forward depending on those findings.

Until then this issue will be moved to the backlog and reassigned to @SantoDE on the product side.

soloio-bot commented 11 months ago

Zendesk ticket #3011 has been linked to this issue.

github-actions[bot] commented 5 months ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.