Open zimnx opened 1 year ago
Shouldn't we optimize a node once?
Shouldn't we optimize a node once?
Or am I confusing K8S node and Scylla node?!
We have two jobs, one tuning common node stuff, like clock source. The second one, tunes stuff that is dependent on the container resource allocation.
In Kube we request resource quantities, but we don't have any control over for example which CPU will be assigned to Pod. This is decided by kubelet when container is started, so we observe those, and calculate what needs to be optimized. For example, we pin network IRQ to CPUs not exlusively assigned to Scylla containers running on particular Node. So job is updated every time new optimizable Scylla Pod lands on the Node.
We have two jobs, one tuning common node stuff, like clock source. The second one, tunes stuff that is dependent on the container resource allocation.
In Kube we request resource quantities, but we don't have any control over for example which CPU will be assigned to Pod. This is decided by kubelet when Pod is started, so we observe those, and calculate what needs to be optimized. Say we pin network IRQ to CPUs not exlusively assigned to Scylla containers running on particular Node. So job is updated every time new optimizable Scylla Pod lands on the Node.
Why don't you have control over it? Isn't it what cpuset supposed to do? Depending on the K8S node size (no. of cores), you probably want to 'sacrifice' first 2-4 cores on the 1st NUMA node and use them for network traffic. Consult with @vladzcloudius who's the master of this (for example, if your Scylla pod is on the 2nd NUMA node, etc. - unsure what needs to happen there).
Why don't you have control over it?
Resource allocation is realized by cgroups in Kube, and it's managed by the kubelet, not us. As I said, in Kube you request resource quantity, not entity. If you want 1 CPU, you will get one, but there's no API to tell which one you want in runtime.
What you can do, is to choose policy used for scheduling: https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy-options
Why don't you have control over it?
Resource allocation is realized by cgroups in Kube, and it's managed by the kubelet, not us. As I said, in Kube you request resource quantity, not entity. If you want 1 CPU, you will get one, but there's no API to tell which one you want in runtime.
What you can do, is to choose policy used for scheduling: https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#static-policy-options
To the best of my knowlege, you can have more granular scheduling. https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/ and https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/ and another one I can't find right now allow you to achieve that (perhaps EKS doesn't yet have all of them)
@zimnx @mykaul If you plan to run multiple (small) PODs on the same machine (I assume this is the case) in the long run I would recommend allocating IRQ cores in advance as-if you were running a single big scylla on that node. The modern perftune.py (from the master
+ the "perftune.py: auto-select the same number of IRQ cores on each NUMA" I sent today) can be used for that.
perftune.py allows you to get both IRQ CPUs and compute CPUs masks. I would also recommend using perftune.py for the node tweaking too.
So... Once you get the compute CPU mask you can restrict K8S to use only these CPUs for assigning to your PODs.
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
/lifecycle stale
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
/lifecycle rotten
/remove-lifecycle stale /triage accepted
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
/lifecycle stale
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
/lifecycle rotten
We lack e2e coverage for scenario where Operator is optimizing Node when there're multiple optimizable Scylla Pods. To prevent more regressions (#1148), we should add an e2e for following case: