openshift-psap / special-resource-operator-deprecated

Apache License 2.0
10 stars 12 forks source link

segmentation fault during nvidia driver validation #24

Open michaelcourcy opened 4 years ago

michaelcourcy commented 4 years ago

Hi

SRO=release-4.3 
OCP=4.3.19
Kernel Version= 4.18.0-147.8.1.el8_1.x86_64
OS Image=Red Hat Enterprise Linux CoreOS 43.81.202005071438.0 (Ootpa)

I got a segfault during nvidia driver validation phase :

NAME                                       READY   STATUS                 RESTARTS   AGE
nvidia-driver-daemonset-frmw5              1/1     Running                0          2m30s
nvidia-driver-daemonset-gsm2k              1/1     Running                0          2m30s
nvidia-driver-daemonset-swtht              1/1     Running                0          2m30s
nvidia-driver-internal-1-build             0/1     Completed              0          5m42s
nvidia-driver-validation                   0/1     CreateContainerError   0          82s
special-resource-operator-9445d58f-xfkn7   1/1     Running                0          5m48s

and in the events

20s         Warning   Failed             pod/nvidia-driver-validation         Error: container create failed: time="2020-05-18T17:21:35Z" level=warning msg="signal: killed"
time="2020-05-18T17:21:35Z" level=error msg="container_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"process_linux.go:432: running prestart hook 0 caused \\\\\\\"error running hook: signal: segmentation fault (core dumped), stdout: , stderr: \\\\\\\"\\\"\""
container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: signal: segmentation fault (core dumped), stdout: , stderr: \\\"\""

I check also the version of the driver that SRO installed :

oc exec nvidia-driver-daemonset-frmw5 -- nvidia-smi 
Mon May 18 17:33:27 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   29C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Thanks

michaelcourcy commented 4 years ago

I just gave a chance to use a P3 instance on aws but same result I still have a segfault.

michaelcourcy commented 4 years ago

Digging a bit I understand that's actually the prestart hook installed in every nvidia nodes is responsible of the segmentation fault :

apiVersion: v1
data:
  oci-nvidia-hook-json: |
    {
        "version": "1.0.0",
        "hook": {
            "path": "/run/nvidia/driver/usr/bin/nvidia-container-toolkit",
            "args": ["nvidia-container-runtime-hook", "prestart"],
            "env": [
                "PATH=/run/nvidia/driver/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ]
        },
        "when": {
            "always": true,
            "commands": [".*"]
        },
        "stages": ["prestart"]
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2020-05-24T17:58:10Z"
  name: nvidia-driver
  namespace: openshift-sro
  ownerReferences:
  - apiVersion: sro.openshift.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: SpecialResource
    name: nvidia-gpu
    uid: 34aa2cbf-3391-4f52-a048-4cf452f163c7
  resourceVersion: "163000746"
  selfLink: /api/v1/namespaces/openshift-sro/configmaps/nvidia-driver
  uid: 1fc70991-5afd-4825-b400-dc4e145a5b4f

if the following env vars are in the containers :

  env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "compute,utility"
        - name: NVIDIA_REQUIRE_CUDA  
          value: "cuda>=5.0"

I wonder if that relate to this issue if yes we should try to update the nvidia-container-runtime-hook executable. But that's not clear for me how to do that. I'll dig in the previous buildconfig 0000-state-driver-buildconfig. As I understand it it's the one that will be used by the daemonset to provide the hook and the executable.