scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
339 stars 175 forks source link

Cover with e2e scenario of optimization of multiple containers running on the same Node #1149

Open zimnx opened 1 year ago

zimnx commented 1 year ago

We lack e2e coverage for scenario where Operator is optimizing Node when there're multiple optimizable Scylla Pods. To prevent more regressions (#1148), we should add an e2e for following case:

  1. Create NodeConfig to optimize Nodes, pick one Node where it's running
  2. Create ScyllaCluster able to get Guranteed QoS, and land on choosen Node
  3. Observe tuning job for containers being created
  4. Create second ScyllaCluster able to get Guaranteed QoS, with at least one Pod landing on the choosen Node
  5. Previously observed tuning job is updated with different template
mykaul commented 1 year ago

Shouldn't we optimize a node once?

mykaul commented 1 year ago

Shouldn't we optimize a node once?

Or am I confusing K8S node and Scylla node?!

zimnx commented 1 year ago

We have two jobs, one tuning common node stuff, like clock source. The second one, tunes stuff that is dependent on the container resource allocation.

In Kube we request resource quantities, but we don't have any control over for example which CPU will be assigned to Pod. This is decided by kubelet when container is started, so we observe those, and calculate what needs to be optimized. For example, we pin network IRQ to CPUs not exlusively assigned to Scylla containers running on particular Node. So job is updated every time new optimizable Scylla Pod lands on the Node.

mykaul commented 1 year ago

We have two jobs, one tuning common node stuff, like clock source. The second one, tunes stuff that is dependent on the container resource allocation.

In Kube we request resource quantities, but we don't have any control over for example which CPU will be assigned to Pod. This is decided by kubelet when Pod is started, so we observe those, and calculate what needs to be optimized. Say we pin network IRQ to CPUs not exlusively assigned to Scylla containers running on particular Node. So job is updated every time new optimizable Scylla Pod lands on the Node.

Why don't you have control over it? Isn't it what cpuset supposed to do? Depending on the K8S node size (no. of cores), you probably want to 'sacrifice' first 2-4 cores on the 1st NUMA node and use them for network traffic. Consult with @vladzcloudius who's the master of this (for example, if your Scylla pod is on the 2nd NUMA node, etc. - unsure what needs to happen there).

zimnx commented 1 year ago

Why don't you have control over it?

Resource allocation is realized by cgroups in Kube, and it's managed by the kubelet, not us. As I said, in Kube you request resource quantity, not entity. If you want 1 CPU, you will get one, but there's no API to tell which one you want in runtime.

What you can do, is to choose policy used for scheduling: https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy-options

mykaul commented 1 year ago

Why don't you have control over it?

Resource allocation is realized by cgroups in Kube, and it's managed by the kubelet, not us. As I said, in Kube you request resource quantity, not entity. If you want 1 CPU, you will get one, but there's no API to tell which one you want in runtime.

What you can do, is to choose policy used for scheduling: https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy-options

To the best of my knowlege, you can have more granular scheduling. https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/ and https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/ and another one I can't find right now allow you to achieve that (perhaps EKS doesn't yet have all of them)

vladzcloudius commented 1 year ago

@zimnx @mykaul If you plan to run multiple (small) PODs on the same machine (I assume this is the case) in the long run I would recommend allocating IRQ cores in advance as-if you were running a single big scylla on that node. The modern perftune.py (from the master + the "perftune.py: auto-select the same number of IRQ cores on each NUMA" I sent today) can be used for that.

perftune.py allows you to get both IRQ CPUs and compute CPUs masks. I would also recommend using perftune.py for the node tweaking too.

So... Once you get the compute CPU mask you can restrict K8S to use only these CPUs for assigning to your PODs.

scylla-operator-bot[bot] commented 4 months ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle stale

scylla-operator-bot[bot] commented 3 months ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle rotten

tnozicka commented 3 months ago

/remove-lifecycle stale /triage accepted

scylla-operator-bot[bot] commented 2 months ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle stale

scylla-operator-bot[bot] commented 1 month ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle rotten