Unify the use of tains and tolerations in deploy, docs, examples and tests

tnozicka commented 4 months ago

In general and especially with BestEffort QoS class or performance testing it's a good practice to separate the workload onto dedicated nodes. In Kubernetes this is done through Taints and Tolerations which we use in a few places but we lack them in others. They are not currently using a domain name prefix nor are well defined over the code base.

The good thing about tolerations is that with well defined naming they will take effect only if the user chooses to taint their nodes accordingly, otherwise it will stay benign.

Generally, the taint could look like this:

key: "scylla-operator.scylladb.com/dedicated"
operator: "Equal"
value: "infra"
effect: "NoExecute"

There are roughly these classes:

operator infra (operator, webhooks, cert-manager, ...)
manager shared infra (scylla-manager)
tuning infra (scylla-operator-node-tuning)
monitoring (ScyllaDBMonitoring)
regular workloads
scylladb workloads (ScyllaCluster, ScyllaDBDatacenter) (sensitive, resource extensive and special disk requirements)

The issue is that for real clusters some of these classes need to be merged and there isn't a unique set of classes these will belong to, which makes it hard to define generic tolerations so in one case tuning infra and monitoring and operator lands on the same nodes but in other the tolerations split them (like when only some of them need local ssd base storage). So the only rule that could roughly be applied everywhere is to have one default toleration for the scylladb workload and the rest will be repelled by the taints to land anywhere else, until explicitly added. But it also means we have to adjust/extend the storage provisioner settings to configure the non-scylladb nodes as well.

rzetelskik commented 3 months ago

it also means we have to adjust/extend the storage provisioner settings to configure the non-scylladb nodes as well.

Should https://github.com/scylladb/scylla-operator/pull/2046 be extended to handle this in general case, or should I take care of this as part of this issue?

For generic/gke/eks examples that means having a separate NodeConfig manifest for nodes not dedicated for ScyllaDB clusters, most likely creating a loop device since we don't expect having local SSDs there?

tnozicka commented 3 months ago

Should https://github.com/scylladb/scylla-operator/pull/2046 be extended to handle this in general case, or should I take care of this as part of this issue?

Addressing it separately seems cleaner and doesn't bundle their fate.

For generic/gke/eks examples that means having a separate NodeConfig manifest for nodes not dedicated for ScyllaDB clusters, most likely creating a loop device since we don't expect having local SSDs there?

yep, which will probably need a second csi-driver instance and different storage class, not to mix the two storage types

scylla-operator-bot[bot] commented 2 months ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

scylladb / scylla-operator

Unify the use of tains and tolerations in deploy, docs, examples and tests #2049