Nvidia GPU k20xm - driver installation fail

rbo commented 1 year ago

Because latest driver 525.60.13 (Operator 22.9.1) don't support the gpu anymore.

Latest supported driver for k20xm: 460.106.00

It looks like Nvidia operator 1.7.x brings 460.73.01 based on https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/release-notes.html

rbo commented 1 year ago

Nvidia gpu provides some channels:

v1.10
v1.11
v22.9
stable

Sadly all to new...

rbo commented 1 year ago

Index registry.redhat.io/redhat/certified-operator-index:v4.8 provides in channel v1.7 the gpu-operator-certified.v1.7.1

rbo commented 1 year ago

# Expose image registry
oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')

# Login
podman login -u kubeadmin -p $(oc whoami -t) $HOST

# Create project and image source
oc project openshift-marketplace
oc create is nvidia-gpu-operator-index:v4.8

# Build index
opm index add \
    --bundles <registry>/<namespace>/<bundle_image_name>:<tag> \
    --tag <registry>/<namespace>/<index_image_name>:<tag> \
    [--binary-image <registry_base_image>] 

# Add RedHat ISV PGP: https://access.redhat.com/solutions/6542281
opm index prune \
    -f registry.redhat.io/redhat/certified-operator-index:v4.8 \
    -p  gpu-operator-certified \
    -t $HOST/openshift-marketplace/nvidia-gpu-operator-index:v4.8

podman push $HOST/openshift-marketplace/nvidia-gpu-operator-index:v4.8

oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: nvidia-gpu-operator
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: image-registry.openshift-image-registry.svc:5000/openshift-marketplace/nvidia-gpu-operator-index:4.8
  displayName: Nvidia GPU Operator from OpenShift 4.8
  publisher: grpc
EOF

Installing GPU Operator from Catalog "Nvidia GPU Operator from OpenShift 4.8" and select channel 1.7.1

rbo commented 1 year ago

I have to entitle my cluster too...

rbo commented 1 year ago

oc create -f 0003-cluster-wide-machineconfigs.yaml

Cluster wide entitlement applied...

rbo commented 1 year ago

Nvidia driver won't built:

+ make -s -j SYSSRC=/lib/modules/4.18.0-372.36.1.el8_6.x86_64/build nv-linux.o nv-modeset-linux.o
In file included from /usr/src/nvidia-460.73.01/kernel/nvidia/os-interface.c:17:
/usr/src/nvidia-460.73.01/kernel/common/inc/nv-time.h: In function 'nv_sleep_ms':
/usr/src/nvidia-460.73.01/kernel/common/inc/nv-time.h:208:18: error: 'struct task_struct' has no member named 'state'; did you mean '__state'?
         current->state = TASK_INTERRUPTIBLE;
                  ^~~~~
                  __state
make[2]: *** [scripts/Makefile.build:317: /usr/src/nvidia-460.73.01/kernel/nvidia/os-interface.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [Makefile:1584: _module_/usr/src/nvidia-460.73.01/kernel] Error 2
make: *** [Makefile:80: modules] Error 2
++ make -s -j SYSSRC=/lib/modules/4.18.0-372.36.1.el8_6.x86_64/build clean
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
Stopping NVIDIA persistence daemon...

DanielFroehlich commented 1 year ago

As K20 GPUs are no longer supported, I am closing this issue.

stormshift / support

Nvidia GPU k20xm - driver installation fail #113