Open michaelcourcy opened 4 years ago
I just gave a chance to use a P3 instance on aws but same result I still have a segfault.
Digging a bit I understand that's actually the prestart hook installed in every nvidia nodes is responsible of the segmentation fault :
apiVersion: v1
data:
oci-nvidia-hook-json: |
{
"version": "1.0.0",
"hook": {
"path": "/run/nvidia/driver/usr/bin/nvidia-container-toolkit",
"args": ["nvidia-container-runtime-hook", "prestart"],
"env": [
"PATH=/run/nvidia/driver/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
]
},
"when": {
"always": true,
"commands": [".*"]
},
"stages": ["prestart"]
}
kind: ConfigMap
metadata:
creationTimestamp: "2020-05-24T17:58:10Z"
name: nvidia-driver
namespace: openshift-sro
ownerReferences:
- apiVersion: sro.openshift.io/v1alpha1
blockOwnerDeletion: true
controller: true
kind: SpecialResource
name: nvidia-gpu
uid: 34aa2cbf-3391-4f52-a048-4cf452f163c7
resourceVersion: "163000746"
selfLink: /api/v1/namespaces/openshift-sro/configmaps/nvidia-driver
uid: 1fc70991-5afd-4825-b400-dc4e145a5b4f
if the following env vars are in the containers :
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
- name: NVIDIA_REQUIRE_CUDA
value: "cuda>=5.0"
I wonder if that relate to this issue if yes we should try to update the nvidia-container-runtime-hook executable. But that's not clear for me how to do that. I'll dig in the previous buildconfig 0000-state-driver-buildconfig. As I understand it it's the one that will be used by the daemonset to provide the hook and the executable.
Hi
I got a segfault during nvidia driver validation phase :
and in the events
I check also the version of the driver that SRO installed :
Thanks