Validate telemetry configurations + provide safety

sync-by-unito[bot] commented 2 years ago

When the user has requested ServiceMonitor deployment via the k8ssandra CRs, but Prometheus operator (and therefore the ServiceMonitor CRs) has not been installed, the requested resources should not be created and the user should be informed of the error.

Ideally, the validation should be implemented in a validating webhook.

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: K8OP-107

jsanda commented 2 years ago

I'm kinda torn on whether we should prevent creating objects if the ServiceMonitor CRD is not installed. It won't cause an invalid K8ssandraCluster. If the configuration would result in some sort of broken state, then I would definitely favor the validation.

Miles-Garnsey commented 2 years ago

If a user is requesting a state that cannot be delivered due to pre-requisites not being fulfilled, I think we should be rejecting the request as a general principle.

Imagine a situation where you've bootstrapped a 1000 node cluster only to discover that you configured something wrong and need to do a full rolling restart to rectify it. That might be a bit hypothetical and may not fully apply in this case, but I still think failing fast is the best user experience.

I've certainly had trouble as a user when trying to configure things in CassandraDatacenters due to the fact that unknown fields are allowed in that CR. It slows down your debugging loop when there's no webhook to reject invalid requests and you have to wait for Cassandra logging to come up to discover what you've done wrong (which can take 5-10 minutes).

jsanda commented 2 years ago

I was briefly looking at some of the discussion in #482, and it had me wondering again whether this is a good use case for the validating webhook. After further consideration I do not think checking for the ServiceMonitor CRDs across remote clusters is a good use case for the validating webhook for multiple reasons.

First, latency is a big concern. Here is what the official Kubernetes docs have say:

It is recommended that admission webhooks should evaluate as quickly as possible (typically in milliseconds), since they add to API request latency. It is encouraged to use a small timeout for webhooks. See Timeouts for more detail.

We will only be as fast as our slowest connection. Also keep in mind that there could be other webhooks that get invoked as part of the client request.

Secondly, Kubernetes is an eventually consistent system. Controllers continually try to make the desired state match the actual state, right? Is the scenario of the ServiceMonitor CRD missing something that could eventually be resolved? The answer of course is yes. Contrast that with something that could never be considered valid. For example, a negative JVM heap size will always be invalid.

Next, I question whether it is a good practice to perform checks agains objects other than the one which the client is trying to create/update. Suppose my automation runs kubectl create -f /path/to/manifests where the manifests dir includes a K8ssandraCluster manifest as well as the ServiceMonitor CRD manifest. Furthermore, suppose that the K8ssandraCluster manifest is processed first. The request will get rejected, and the ServiceMontior CRD won't get installed. This is not eventually consistent.

For reference here are a couple relevant threads on the Kubernetes slack:

Lastly, I am now of the opinion that we should remove the k8sContexts checks in the webhook and consider how to handle that differently. Maybe a combo of events and status condition. @burmanm wdty?

Miles-Garnsey commented 2 years ago

We will only be as fast as our slowest connection. Also keep in mind that there could be other webhooks that get invoked as part of the client request.

Are webhooks called serially? I assume validating webhooks are called in parallel. In either event, I don't really see the relevance? We don't control "other webhooks".

since they add to API request latency.

My understanding is that latency will only be added to the K8ssandraCluster type. I don't see this as a problem as I do not envisage thousands on K8ssandra clusters running against a single control plane, this isn't like the Pod type where there might be thousands running. I also don't envisage that the K8ssandraCluster is going to be mutated that frequently.

Is the scenario of the ServiceMonitor CRD missing something that could eventually be resolved?

Only via user intervention to install the Prometheus operator. Given how critical monitoring is, it is important that users are notified that things aren't working so that they can undertake that intervention.

It is also important to note that - if Prometheus is not installed - the reconciliation will fall into a loop waiting for it to be ready, which - as well as preventing deployment of Reaper and Stargate - also means that further changes cannot be made to the K8ssandraCluster object.

Next, I question whether it is a good practice to perform checks agains objects other than the one which the client is trying to create/update. Suppose my automation runs kubectl create -f /path/to/manifests where the manifests dir includes a K8ssandraCluster manifest as well as the ServiceMonitor CRD manifest. Furthermore, suppose that the K8ssandraCluster manifest is processed first. The request will get rejected, and the ServiceMontior CRD won't get installed. This is not eventually consistent.

Kustomize has explicitly dealt with this and applies manifests in the order in which they are referenced in the kustomize.yaml. What you're talking about is whether operations are commutative, which has nothing to do with whether a system is eventually consistent. (And Kubernetes' data store, etcd, is actually strongly consistent??)

There shouldn't be any fuzziness regarding what order manifests are applied in, and it is typical that the order needs to be fixed (e.g. you can't install cert-manager after k8ssandra-operator, it must be installed beforehand).

Given the criticality of monitoring and the ease with which users might miss deploying Prometheus on a cluster if they have many DCs, I'm convinced that webhooks should be used to validate prometheus installation status if it is at all feasible to do so within reasonable latency budgets. I can't see that you've presented any evidence indicating that this is not possible.

Miles-Garnsey commented 2 years ago

Also, it is worth noting that ping times between e.g. Dublin -> Adelaide are 280ms and this is one of the longest hops you're ever likely to see.

I do not think it likely that (in a healthy cluster whose API server is responding quickly and is not overburdened) this functionality will take webhook latencies above 1s, which you've referenced as your criteria for acceptance.

The reality is that we are a multi-cluster project, and doing anything useful is going to incur more latency than we'd normally expect in a single-cluster scenario.

In saying that, I'm open to making the API server calls concurrent, although this obviously adds complexity.

burmanm commented 2 years ago

Only via user intervention to install the Prometheus operator. Given how critical monitoring is, it is important that users are notified that things aren't working so that they can undertake that intervention.

If the user doesn't notice they don't have Prometheus (or similar) installed through other means than K8ssandraCluster unable to being created, that monitoring is not important for them. They're not actually worried in that case if the metrics are deployed at all as those are not going to be scraped or used.

This isn't Prometheus operator. The operational capability of Prometheus should be done by the prometheus-operator. Just like we don't monitor that metrics-server is installed or kube-state-metrics, even though they have operational significance to how Kubernetes can be monitored or how functions like HPA work. Or node monitoring.

burmanm commented 2 years ago

Kustomize has explicitly dealt with this and applies manifests in the order in which they are referenced in the kustomize.yaml. What you're talking about is whether operations are commutative, which has nothing to do with whether a system is eventually consistent. (And Kubernetes' data store, etcd, is actually strongly consistent??)

Kustomize doesn't maintain state, so comparing it to a operator sounds odd.

Kubernetes' data-storage and Kubernetes API server consistency are two different things. While etcd itself is consistent to serial writing (not necessarily parallel reads as it does have multiple linearization choices), the processing pipeline of Kubernetes is not guaranteed to be serially consistent. Thus, the read request can serve timestamp wise out-of-date data, as the read could be answered before the write was even processed or acked. This is just the reality of parallel nodes / parallel processing.

This is visible in the semantics definition of Kubernetes API also. Our controller then adds another layer of caching to avoid even calling the API server to make it even "more eventually" consistent when it comes to reads.

burmanm commented 2 years ago

Lastly, I am now of the opinion that we should remove the k8sContexts checks in the webhook and consider how to handle that differently. Maybe a combo of events and status condition. @burmanm wdty?

Depends how we view the "client". It's true that user could be writing ClientConfig on parallel and then the deployment of K8ssandraCluster fails to missing K8sContext, however on the other hand, we do not actually verify that the ClientConfigs are real in sense of what's stored in the Kubernetes but how k8ssandra-operator sees them.

I do agree that It is problematic if the ClientConfig must be connected before we allow creating the K8ssandraCluster (and this is sort of true at the moment as the ClientConfig can't be read if it can't be connected to - but we only verify this when ClientConfig is generated). There could be ClientConfig webhook API though which could verify this, but at the moment our handling is pretty crude and fragile when it comes to ClientConfigs in general.

burmanm commented 2 years ago

owever on the other hand, we do not actually verify that the ClientConfigs are real in sense of what's stored in the Kubernetes but how k8ssandra-operator sees them

Similar to how we treat Cassandra versions also. One can't deploy Cassandra 4.1 or 3.12 unless one lies in the version and overrides with a image. The first version is probably something we should allow in our regexp now that I think about it..

Miles-Garnsey commented 2 years ago

If the user doesn't notice they don't have Prometheus (or similar) installed through other means than K8ssandraCluster unable to being created, that monitoring is not important for them. They're not actually worried in that case if the metrics are deployed at all as those are not going to be scraped or used.

If users have specified telemetry in the manifest then they want telemetry enabled. Your approach here isn't what I'd describe as user friendly.

It would be easy to overlook missing metrics from one datacenter if you have 20+, in that case the cluster is in a very dangerous state and cannot be monitored or even reconciled properly.

This isn't Prometheus operator. The operational capability of Prometheus should be done by the prometheus-operator. Just like we don't monitor that metrics-server is installed or kube-state-metrics, even though they have operational significance to how Kubernetes can be monitored or how functions like HPA work. Or node monitoring.

I do not see the relevance. We are not managing Prometheus resources, we are checking that a dependancy is installed. If operators can't even validate that pre-requisites are installed on a cluster I'm not sure what use they are.

Kustomize doesn't maintain state, so comparing it to a operator sounds odd.

The context for the comment related to about applying manifests in an arbitrary order. The Kubernetes ecosystem does not allow for arbitrarily ordered manifest application in general. We aren't comparing Kustomize to an operator.

Can you folks please take conversations about validation of ClientConfigs to a different ticket. This ticket is about validating telemetry configurations.

Miles-Garnsey commented 2 years ago

I've discussed this with John in a bit of depth over the past couple of days. One path we took was looking at which operators in the community do deep vs shallow validation of incoming resources;

Prometheus operator: looks like shallow validation only, no calls to the API server from inside the webhook.
Banzai Koperator (Kafka operator): does deep validation to confirm whether Kafka clusters exist on the API server.
Kyverno: a lot of the security policies query the API server to determine what roles a subject has, others look at whether a NetworkPolicy is in effect for an incoming pod.

It seems that there are cases where webhooks call back to the API server to do deeper validations. Given that there is still some concern about whether this is good practice we are going to consult internally and place this on hold pending those discussions.

adejanovski commented 2 years ago

It seems like we've hit a blocker as we cannot get to a consensus on this PR. I like the idea that we're doing this check in a webhook as it will clearly not be obvious for everyone that the prometheus operator needs to be installed, but the outstanding question was if status updates would trigger reconciles. It seems like we ran into this recently in the MedusaBackupSchedule work, and @burmanm made it so that the reconciliation wouldn't trigger on status updates. @burmanm, is it something that we could apply here as well? Other question: is it possible to add a flag to disable the webhook in case it creates problems?

jsanda commented 2 years ago

I am a still a 👎 on doing this validation in the webhook.

predicate.GenerationChangedPredicate is used so that writes only to the status subresource do not trigger events; however, this shouldn't be a concern. If we have a validation check in the controller for CRDs being installed and that check fails, the Reconcile method should return an error which will retry the request with a configurable exponential backoff.

Miles-Garnsey commented 2 years ago

I believe the concern was that we were going to end up with a positive feedback loop, where

The webhook would reject a request based on it having read from the API server.
The operator sets a failure status because it can't update the resource.
The setting of the status re-triggers the webhook, which again rejects it.
The operator retries setting the failure status.

3-4 then loop infinitely, and may in fact end up escalating into a retry storm. This might happen anyway, but if we are reading from the API server as part of each retry we get nastier outcome w.r.t. load. (@burmanm can you confirm this was your concern? You expressed it best.)

If we are able to filter out status updates from the webhook's view, then I think that breaks the retry loop outlined above.

@jsanda not sure what you're referring to RE the reconcile method returning an error before retrying - are you just restating your opposition to validating via a webhook and your preferred solution?

k8ssandra / k8ssandra-operator

Validate telemetry configurations + provide safety #235