intel / CPU-Manager-for-Kubernetes

Kubernetes Core Manager for NFV workloads
Apache License 2.0
187 stars 83 forks source link

CMK reconfiguration function errors #300

Open ylhsiehitri opened 3 years ago

ylhsiehitri commented 3 years ago

Hi folks,

I use CMK v1.5.2 release, and Kubernetes v.1.16.7 I tried the CMK reconfigure function but fails, it seems the code has some errors.

My patch for the issues 1~5 below is here: https://github.com/ylhsiehitri/CPU-Manager-for-Kubernetes/commit/c195694cec1168ecfecbfefa688f1d223dce538d With the patch, reconfigure function can resize pool successfully now. But the problem in 5(b) is left to be solved, core affinity in cmk configmap is changed successfully but in OS it is unchanged actually. (this case could be produced by: create two pods with one exclusive core for each --> make pod-1 completed while pod-2 keeps running --> reconfigure exclusive pool size to be only 1 --> CMK would change pod-2's core affinity to pod-1's, but actually unchanged in OS)

  1. Bug in lock: Within reconfigure.reconfigure(), it goes as follows into error loop (a) it calls config.lock(): in the config.lock(), the "owner" annotation derived from configmap is empty, so it is set to be the name of the associated pod (b) then, reconfigure_directory() is called: (i) In reconfigure_directory(), discover.add_node_er() is called, in which config.lock() is called again (ii) now in the config.lock(), the "owner" is not empty (it was just set in (a)), so following the if-else statement, it goes to the "if" section and leads to infinite loop. ==> This error infinite loop could be solved, just correct the if statement in config.lock(): if owner != "" and owner != self.owner

  2. Missing argument "namespace" These places miss required argument "namespace": (a) In reconfigure.reconfigure_directory() that calls discover.add_node_er() and discover.add_node_oir() (b) In describe.describe() that calls config.Config()" (c) In isolate.isolate() that calls config.Config() ==> Fixed, also the function headers of reconfigure.reconfigure() and describe.describe() are also corrected, added with argument "namespace"

  3. Non-default OS environment variable, e.g. NODE_NAME, should be given in pod yaml In CMK code, it often uses os.getenv("NODE_NAME") to get environment variable. However it is not given by default. ==> Should define env NODE_NAME in pod yaml. I added it in the cmk-isolate-pod.yaml as an example.

  4. (Discussion) Redundant augument In reconfigure.reconfigure() function header, the argument "node_name" is redundant since it will be reassigned in the function body ==> Or, the reconfigure function was designed to allow user to specify node_name?

  5. yaml.safe_load() fails for "Procs" object deserialization In reaffinitize.py, it uses yaml.safe_load() to deserialize the yaml-serialized Procs object, but fails (a) solved bug: source code: yaml.safe_load(config["config"]) correct to : yaml.safe_load(config.data["config"]) (b) unsolved: We expect the corrected yaml.safe_load() in (a) would return a Procs object, but actually it encounters exception error that it does not know how do the convertion. ==> Anyone knows how to deal with it simply? As far as I know, yaml.safe_load() can do deserialize for object of simple structure. But like the class Procs defined in reconfigure.py, it is not of simple structure: it has an dict, of which the values are Pid object. For such case, python provide yaml.add_constructor() mechnism to define how to reconstruct the object.