nokia / CPU-Pooler

A Device Plugin for Kubernetes, which exposes the CPU cores as consumable Devices to the Kubernetes scheduler.
BSD 3-Clause "New" or "Revised" License
92 stars 22 forks source link

Adding hyper-threading awareness to exclusive CPU pool management #56

Closed Levovar closed 3 years ago

Levovar commented 3 years ago

Pool configuration is expanded with a new attribute called "hyperThreadingPolicy" The current behaviour is referred to as "singleThreaded", and is the default cause of backward compatibility and whatnot When this parameter is set to "multiThreaded" however, CPUSetter expands the cpuset of an exclusive container with all HT siblings of all the assigned exclusive cores

This behaviour is useful for multi-threaded DPDK applications which can squeeze out extra performance by utilizing the sibling(s) of their own exclusive cores; while still ensuring guaranteed performance and allocations i.e. no other application can utilize these threads and introduce noisy neighbour problems (unlike CPU Manager where this behaviour is a best-effort functionality) The implementation assumes the exclusive pools are defined the same way, i.e. only main physical core IDs are listed in the pool configuration

Missing documentation update, testing, and thinking about the possibility to provide Pod-level HT policy overwrite capability (i.e. pool level config overwritten with user defined value when Pod is instantiated with a specific annotation) so the functionality can be enabled on-demaand after a SW upgrade, without the need to re-create existing CPU pools

TimoLindqvist commented 3 years ago

I haven't experimented with the latest DPDK but at least earlier the DPDK application had to receive the core(s) where it is running as an argument (there are different options how to give it). So environment variable EXCLUSIVE_CORES should contain the full cpu (logical) list. With hyperthreading enabled, it should list all threads.

Here is a link to DPDK EAL parameters: http://doc.dpdk.org/guides/linux_gsg/linux_eal_parameters.html#common-eal-parameters

So I think the cpu device plugin needs to know the cpu topology. Based on the cpu requests, it could select physical cores and add the thread siblings to the list of (logical) cpus where the application can be running. Then another issue is how the pools should be configured. Should it contain always the physical core ids or should it be the "logical" cpus (processor in Linux context). When hyperthreading is enabled, finding out the physical core ids is a bit trickier but there are some tools available so it shouldn't be an issue.

I was also thinking if there is actually need for some applications to know which cpus belong to same physical core. Application might decide to not use the thread sibling even the hyperthreading is enabled.

Levovar commented 3 years ago

So environment variable EXCLUSIVE_CORES should contain the full cpu (logical) list.

good call, I will look into this!

So I think the cpu device plugin needs to know the cpu topology.

yeah my idea behind the PR is to leave the Device Plugin component out of this, and only make the CPUSetter HT aware. I don't think there is a need to expose the logical cores to the scheduler, because we wouldl ike to implement a guaranteed HT allocation policy, unlike CPU Manager. So the PR assumes device advertisement is as before, the operator includes physical cores to the pool, those cores get advertised and scheduling decisions made based on them, and at the end CPUSetter expands the list with the logical cores if needed. Selecting physical cores in a HT system might not be trivial true, but anyone already using Pooler did it already considering this is how it works :) So this apporach also results in "no migration needed" extra benefit when upgrading to the new release.

Application might decide to not use the thread sibling even the hyperthreading is enabled.

ye this is why I will be adding a Pod-level control option too in the next commit. Operator will be able to set default policy on the pool level, but apps will be able to selectively overwrite it in case they have special needs. I don't think a cluster-wide policy can work for all apps in a MT system, I already heard real life use-cases for each apporaches. A Pod level overwrite would also lessen resource fragmentation, because the different apps could still use the same pool.

Levovar commented 3 years ago

created a new package and re-factored all topology discovery functionality from both CPUSetter and the Device Plugin component into it. we now also store both HT, and NUMA topology in the Device Manager memory area so if we want make some extra magic happen in the future the information is readily available

EXCLUSIVE_CPUS is now using the new package to correctly fill it with all IDs based on the newly introduced HT policy parameter

Levovar commented 3 years ago

sure! trying to test it on a real system for now

I might do the Pod-level overwrite in a different PR, cause we need to involve the DP in it due to the exclusive CPU environment variable. the problem is that the DP has abs no Pod-level information available to it, so to be able to correctly handle a Pod level overwrite of the HT policy the DP needs to start parsing the Kubelet checkpoint file too to figure out the Pod the device is allocated to, then read its spec from the API server so it is a bigger change than initially anticipated, prob needs one more refactoring

Levovar commented 3 years ago

Current content is at least backward compatible:

]# docker logs fb6566385047
I0125 16:13:19.672145       1 pool.go:82] Using configuration file /etc/cpu-pooler/poolconfig-baremetal-baremetal-fi-805-workerbm-0.yaml for pool config
I0125 16:13:19.672199       1 cpu-device-plugin.go:250] Pool configuration {map[default_pool:{18-19,58-59 18-19,58-59 singleThreaded} exclusive_pool:{2-6,20-34 2-6,20-34 **singleThreaded**} shared_pool:{7-17,35-39,47-57,75-79 7-17,35-39,47-57,75-79 singleThreaded}] map[kubernetes.io/hostname:baremetal-baremetal-fi-805-workerbm-0]}
I0125 16:13:19.672242       1 cpu-device-plugin.go:181] Starting plugin for pool: exclusive_pool
I0125 16:13:20.173106       1 cpu-device-plugin.go:45] Starting CPU Device Plugin server at: /var/lib/kubelet/device-plugins/cpudp_exclusive_pool.sock
I0125 16:13:20.173666       1 cpu-device-plugin.go:68] CPU Device Plugin server started serving
I0125 16:13:20.174436       1 cpu-device-plugin.go:231] CPU device plugin registered with the Kubelet
I0125 16:13:20.174446       1 cpu-device-plugin.go:181] Starting plugin for pool: shared_pool
I0125 16:13:20.571726       1 cpu-device-plugin.go:45] Starting CPU Device Plugin server at: /var/lib/kubelet/device-plugins/cpudp_shared_pool.sock
I0125 16:13:20.574100       1 cpu-device-plugin.go:68] CPU Device Plugin server started serving
I0125 16:13:20.574782       1 cpu-device-plugin.go:231] CPU device plugin registered with the Kubelet
I0125 16:31:43.002346       1 cpu-device-plugin.go:142] CPUs allocated: 6: Num of CPUs 1
[root@cpupod-exclusive /]# printenv | grep EXCLUSIVE
EXCLUSIVE_CPUS=6
[root@cpupod-exclusive /]# cat /sys/fs/cgroup/cpuset/cpuset.cpus
6

on another note: the starter binary also needs to be adapted so it expects the right CPUSet and does not fial when siblings are assigned together

Levovar commented 3 years ago

Added new Setter unit tests for the HT policies. Tests actually found issues in the code: insane ;) Corrected them all, and also all the linting issues, and the existing tests.

I'm not yet sure how to integrate the new fake lscpu into Travis, but at least locally it works nicely if it is installed directly into /usr/bin.

Levovar commented 3 years ago

Ended up containerizing the UT execution so it can be reliably run on any platform

running this container is now integrated into Travis, CI is now green!

Levovar commented 3 years ago

@TimoLindqvist : documentation updated, only final functional testing remains barring some bug fixes if needed the PR is ready from my perspective

TimoLindqvist commented 3 years ago

This looks good from my perspective.

Levovar commented 3 years ago

aand it works on a real envionrment as well

# kubectl get po cpupod-exclusive -o yaml | grep -i exclusive_pool
        nokia.k8s.io/exclusive_pool: "1"
        nokia.k8s.io/exclusive_pool: "1"
# docker exec -ti a588c5b36185 bash
[root@cpupod-exclusive /]# printenv | grep -i exclusive
HOSTNAME=cpupod-exclusive
CPU_POOLS=exclusive
EXCLUSIVE_CPUS=2,42
CONTAINER_NAME=cputest-exclusive
[root@cpupod-exclusive /]# cat /sys/fs/cgroup/cpuset/cpuset.cpus
2,42