Closed rbo closed 1 year ago
Nvidia gpu provides some channels:
Sadly all to new...
Index registry.redhat.io/redhat/certified-operator-index:v4.8
provides in channel v1.7
the gpu-operator-certified.v1.7.1
# Expose image registry
oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
# Login
podman login -u kubeadmin -p $(oc whoami -t) $HOST
# Create project and image source
oc project openshift-marketplace
oc create is nvidia-gpu-operator-index:v4.8
# Build index
opm index add \
--bundles <registry>/<namespace>/<bundle_image_name>:<tag> \
--tag <registry>/<namespace>/<index_image_name>:<tag> \
[--binary-image <registry_base_image>]
# Add RedHat ISV PGP: https://access.redhat.com/solutions/6542281
opm index prune \
-f registry.redhat.io/redhat/certified-operator-index:v4.8 \
-p gpu-operator-certified \
-t $HOST/openshift-marketplace/nvidia-gpu-operator-index:v4.8
podman push $HOST/openshift-marketplace/nvidia-gpu-operator-index:v4.8
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: nvidia-gpu-operator
namespace: openshift-marketplace
spec:
sourceType: grpc
image: image-registry.openshift-image-registry.svc:5000/openshift-marketplace/nvidia-gpu-operator-index:4.8
displayName: Nvidia GPU Operator from OpenShift 4.8
publisher: grpc
EOF
Installing GPU Operator from Catalog "Nvidia GPU Operator from OpenShift 4.8" and select channel 1.7.1
I have to entitle my cluster too...
oc create -f 0003-cluster-wide-machineconfigs.yaml
Cluster wide entitlement applied...
Nvidia driver won't built:
+ make -s -j SYSSRC=/lib/modules/4.18.0-372.36.1.el8_6.x86_64/build nv-linux.o nv-modeset-linux.o
In file included from /usr/src/nvidia-460.73.01/kernel/nvidia/os-interface.c:17:
/usr/src/nvidia-460.73.01/kernel/common/inc/nv-time.h: In function 'nv_sleep_ms':
/usr/src/nvidia-460.73.01/kernel/common/inc/nv-time.h:208:18: error: 'struct task_struct' has no member named 'state'; did you mean '__state'?
current->state = TASK_INTERRUPTIBLE;
^~~~~
__state
make[2]: *** [scripts/Makefile.build:317: /usr/src/nvidia-460.73.01/kernel/nvidia/os-interface.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [Makefile:1584: _module_/usr/src/nvidia-460.73.01/kernel] Error 2
make: *** [Makefile:80: modules] Error 2
++ make -s -j SYSSRC=/lib/modules/4.18.0-372.36.1.el8_6.x86_64/build clean
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
Stopping NVIDIA persistence daemon...
As K20 GPUs are no longer supported, I am closing this issue.
Because latest driver 525.60.13 (Operator 22.9.1) don't support the gpu anymore.
Latest supported driver for k20xm: 460.106.00
It looks like Nvidia operator 1.7.x brings 460.73.01 based on https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/release-notes.html