rh-ecosystem-edge / kernel-module-management

The kernel module management operator builds, signs and loads kernel modules on OpenShift.
https://openshift-kmm.netlify.app
Apache License 2.0
26 stars 16 forks source link

Day1 kernel module support via KMM in OCP #410

Closed yevgeny-shnaidman closed 1 year ago

yevgeny-shnaidman commented 1 year ago

Issue Summary

Currently KMM supports only Day2 operations: loading/replacing/upgrading kernel modules only after full installation of the OCP cluster. Allowing some kind of Day1 support ( loading kernel module prior to full cluster installation) will increase the usability of KMM

Proposed solution

The solution is rendered on the level of root FS, and not on the level of initrmfs, using the MCO, ignition configuration and the same driver containers as used by KMM

Example MachineConfig ```yaml apiVersion: machineconfiguration.openshift.io/v kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: sfc name: replace-sfc spec: config: ignition: version: 3.2.0 systemd: units: - contents: | [Unit] Description=Replace in-tree sfc driver with oot sfc driver Before=network-pre.target Wants=network-pre.target DefaultDependencies=no [Service] User=root Type=oneshot TimeoutSec=10 ExecStartPre=ls /usr/local/bin ExecStart=/usr/local/bin/replace-sfc-driver.sh PrivateTmp=yes RemainAfterExit=no TimeoutSec=60 [Install] WantedBy=multi-user.target enabled: true name: "replace-sfc.service" - contents: | [Unit] Description=Pull oot sfc driver container After=network-online.target Wants=network-online.target DefaultDependencies=no [Service] User=root Type=oneshot ExecStart=/usr/local/bin/pull-sfc-driver.sh PrivateTmp=yes RemainAfterExit=no TimeoutSec=900 [Install] WantedBy=multi-user.target enabled: true name: "pull-sfc-image.service" - enabled: false mask: true name: crio-wipe.service storage: files: - path: "/usr/local/bin/replace-sfc-driver.sh" mode: 511 overwrite: true user: name: "root" contents: source: "data:text/plain;base64,IyEvYmluL2Jhc2gKSU1BR0U9InF1YXkuaW8veXNobmFpZG0vY2l0aS1kcml2ZXJzIgpUQUc9IjQuMTAuMjUiCmVjaG8gImJlZm9yZSBjaGVja2luZyBwb2RtYW4gaW1hZ2VzIgppZiBwb2RtYW4gaW1hZ2VzIHwgZ3JlcCAkSU1BR0UgfCBncmVwIC1xICRUQUc7IHRoZW4KICAgIGVjaG8gIkltYWdlICRJTUFHRTokVEFHIGZvdW5kIGluIHRoZSBsb2NhbCByZWdpc3RyeSwgcmVtb3ZpbmcgaW4tdHJlZSBzZmMiCiAgICBtb2Rwcm9iZSAtciBzZmMKICAgIGVjaG8gIlJ1bm5pbmcgY29udGFpbmVyIGltYWdlIHRvIGluc2VydCB0aGUgb290IHNmYyIKICAgIHBvZG1hbiBydW4gLS1wcml2aWxlZ2VkIC0tZW50cnlwb2ludCBtb2Rwcm9iZSAkSU1BR0U6JFRBRyAtZCAvb3B0IHNmYwogICAgZWNobyAiT09UIHNmYyBpcyBpbnNlcnRlZCIKZWxzZQogICBlY2hvICJJbWFnZSAkSU1BR0U6JFRBRyBpcyBub3QgcHJlc2VudCBpbiBsb2NhbCByZWdpc3RyeSwgd2lsbCB0cnkgYWZ0ZXIgcmVib290IgpmaQo=" - path: "/usr/local/bin/pull-sfc-driver.sh" mode: 493 overwrite: true user: name: "root" contents: source: "data:text/plain;base64,IyEvYmluL2Jhc2gKSU1BR0U9InF1YXkuaW8veXNobmFpZG0vY2l0aS1kcml2ZXJzIgpUQUc9IjQuMTAuMjUiCmlmIHBvZG1hbiBpbWFnZSBsaXN0IHwgZ3JlcCAkSU1BR0UgfCBncmVwIC1xICRUQUc7IHRoZW4KICAgIGVjaG8gIkltYWdlICRJTUFHRSBmb3VuZCBpbiB0aGUgbG9jYWwgcmVnaXN0cnkuTm90aGluZyB0byBkbyIKZWxzZQogICAgZWNobyAiSW1hZ2UgJElNQUdFIG5vdCBmb3VuZCBpbiB0aGUgbG9jYWwgcmVnaXN0cnksIHB1bGxpbmciCiAgICBwb2RtYW4gcHVsbCAkSU1BR0U6JFRBRwogICAgZWNobyAiSW1hZ2UgJElNQUdFOiRUQUcgaGFzIGJlZW4gc3VjY2Vzc2Z1bGx5IHB1bGxlZCwgcmVib290aW5nLi4iCiAgICByZWJvb3QKZmkK" ```

Replacing in-tree kernel module

Install Service will always run modprobe -r command prior to installing the kernel driver. This will either remove the in-tree kernel module, or will do nothing (modprobe -r does not return error in case kernel module is not present). To be safe the command will be run from the entry point of the DriverContainer Image

Integration with Day2 KMM

Once the cluster is installed, KMMO can be deployed with a Module CR that targets the same kernel module ( with the same DriverContainer image, or a different one). This will won't unload the Day1 installed kernel module, and will allow customer to support kernel module upgrade (without node restart if possible) and cluster upgrade predictions via Preflight CRD

MCO/Day1 support models

We can provide 2 support models: off-cluster and in-cluster

off-cluster

In addition to operator image, KMM will provide an executable utility that can be run on any x86_64/arm server, will receive as input the DriverContainer image and the kernel module location, and will produce the MCO yaml that can be applied as manifests during cluster installation

in-cluster

KMMO will support an additional CRD, that will receive the inputs defined above and will produce the same MCOs. KMMO might event apply them itself, although this option is less viable. The executable from the "off-cluster" solution will be re-used in the "off-cluster" solution

Pros/Cons

Pros

  1. no need for layering solution, can be supported immediately, once the functionality is implemented
  2. support cluster upgradability
  3. customer still get the whole support of RH, no need to manage its own OCP images

Cons

  1. does not support add/replacing kernel module at the initrmfs/kernel level
  2. does not provide support for kernel modules that need to be available before switching to root FS ( mainly storage drivers for HW that contains root FS image, or network drivers needed for network access before loading root FS
  3. added complexity to the KMM code base
ybettan commented 1 year ago

I think it is worth adding a MachineConfig example for better understanding of what is actually being applied to the cluster/MCO-static-pod.

hershpa commented 1 year ago

Would it be possible to leverage existing KMM functionality for driver container management directly in Day1? Can KMM become a cluster operator on OCP? The general idea is to let MCO embrace its strengths (machine configuration etc) and KMM to embrace its strengths and avoid overlap between operators.

uMartinXu commented 1 year ago

We are now using the MCO to configure the Node system for KMM to load the modules, for example, we have to make use of MCO to prevent some in tree driver to be loaded and add some kernel booting parameters which is necessary for load the Module for KMM. And in order to avoid rebooting the system We do prefer to run this configuration in day1 instead of day2. So we can make KMM as simple as possible. But so as to dirver container image management as well as the module management. we still think it is good for KMM to handle it. of course we can let MCO to handle it, but that might introudce a lot of complication if KMMO and MCO handle the same thing from differnt Operator in Day1 or Day2.
And We all know KMM already can handle driver container image and module very well.

yevgeny-shnaidman commented 1 year ago

We are now using the MCO to configure the Node system for KMM to load the modules, for example, we have to make use of MCO to prevent some in tree driver to be loaded and add some kernel booting parameters which is necessary for load the Module for KMM. And in order to avoid rebooting the system We do prefer to run this configuration in day1 instead of day2. So we can make KMM as simple as possible. But so as to dirver container image management as well as the module management. we still think it is good for KMM to handle it. of course we can let MCO to handle it, but that might introudce a lot of complication if KMMO and MCO handle the same thing from differnt Operator in Day1 or Day2. And We all know KMM already can handle driver container image and module very well.

@uMartinXu This issue is regarding use-cases where the Day2 KMM is not applicable. It does not replace Day2 KMM, but expand the general KMM options to handle kernel modules that need to be loaded very soon after the boot, way before KMM Operator starts running. Customer can choose which option to use and what is more compatible with hist use-case

yevgeny-shnaidman commented 1 year ago

Would it be possible to leverage existing KMM functionality for driver container management directly in Day1? Can KMM become a cluster operator on OCP? The general idea is to let MCO embrace its strengths (machine configuration etc) and KMM to embrace its strengths and avoid overlap between operators.

@hershpa currently there are no plans to make KMM core operator. In addition, even if KMM becomes core operator, we will still need day1 functionality. Even as a core operator, KMM starts running only after the full boot process of a node and OS has been completed. So, if we need kernel modules to be loaded prior to that, we need to use functionality described in this issue

ybettan commented 1 year ago

/assign @yevgeny-shnaidman

qbarrand commented 1 year ago

@yevgeny-shnaidman I believe we can close this?

yevgeny-shnaidman commented 1 year ago

yes, closing