intel / intel-data-center-gpu-driver-for-openshift

Intel Data Center GPU Drivers for Red Hat OpenShift Container Platform
https://catalog.redhat.com/software/containers/intel/intel-data-center-gpu-driver-container/6495ee55c8b2461e35fb8264
Apache License 2.0
7 stars 5 forks source link

Kernel ABI stability in OCP Minor Version may reduce rebuild efforts of driver container #56

Open hershpa opened 1 year ago

hershpa commented 1 year ago

Summary:

The Kernel Application Binary Interface (kABI) is a set of in-kernel symbols used by drivers and other kernel modules. Currently, the general idea is to rebuild and test the Intel GPU driver container image whenever the kernel version associated with a particular OCP z stream changes. This is the safest approach. Unfortunately, it requires continuous rebuild and test efforts that can be facilitated by automation but still carries a non-zero cost. It may be possible to reduce rebuild efforts based on the theory that no rebuild is required if the kernel ABI does not change across all z streams in a particular OCP minor version X.Y.

Potential Idea:

Assuming that the driver is using the list of stable symbols for which Red Hat guarantees ABI compatibility, consider the following.

Based on RHEL KB,

The kernel-abi-stablelists packages contain reference files, /lib/modules/kabi-/kabistablelist, listing interfaces provided by the kernel that are considered to be stable by Red Hat engineering. Such interfaces are safe for long-term use by third-party loadable device drivers, as well as for other purposes. With Red Hat Enterprise Linux 7 and 8, the stablelist is valid for the particular major release. This means that once a symbol has been introduced into kABI for a particular major release, it will not be removed, nor will its meaning be changed during that kernel major release complete life cycle. With Red Hat Enterprise Linux 9, each minor release will have a unique stablelist that is valid throughout the minor release lifecycle. For more information on this, please refer to the following knowledgebase article;

Red Hat Enterprise Linux 9 kABI Policy Red Hat recommends recompiling kernel modules against every minor release of Red Hat Enterprise Linux.

Based on this other KB, an OCP minor version always uses a certain minor RHEL version.

RHCOS/OCP Versions | RHEL Versions -- | -- 4.11 | RHEL 8.6 4.12 | RHEL 8.6 4.13 | RHEL 9.2

Tentative Conclusion:

It would be reasonable to conclude that for OCP 4.12 based on RHEL8.6, only 1 driver container is required to support all OCP 4.12.z versions as long as the kernel ABI stays the same. Similarly, all z streams for OCP 4.13 based on RHEL9.2 would require a single driver container image.

Goal:

The goal is to understand the pros, cons and the potential risk of this approach. Theoretically, it is possible to use the same driver container with different kernel version as long as the kernel ABI remains stable. It is important to note that

in very rare and special circumstances, a symbol in a kABI stablelist needs to be changed. For example, Red Hat could introduce kABI breakage when a critical security issue cannot be resolved without breaking kABI. Red Hat will inform the partners if such a situation should occur.

In general, even if rebuilds are avoided, it is reasonable to retest the existing driver container when the kernel version changes using automation to ensure compatibility and functionality.

chaitanya1731 commented 8 months ago

@mregmi

hershpa commented 7 months ago

@qbarrand, any comments/feedback on this appreciated. Thanks!

mregmi commented 7 months ago

To add some more info, we are trying to identify if there are any KABI changes between OCP Z releases and detect that. And if any change is detected, we rebuild the driver image and if not we just reuse the previous driver image. For this purpose we are planning to use this project (https://github.com/skozina/kabi-dw) which can detect kABI changes between two kernel versions.