openshift-psap / special-resource-operator-deprecated

Apache License 2.0
10 stars 12 forks source link

Support for RHEL 7 Operating System for GPUs in Openshift 4.3 #22

Closed relyt0925 closed 3 years ago

relyt0925 commented 4 years ago

There appears to be some conflicting documentation around the official support for GPUs in the RHEL 7 operating system. There are various docs that point to this being at a GA level of support:

https://access.redhat.com/solutions/4908611

With the general availability of the NVIDIA driver container and NVIDIA GPU operator, NVIDIA GPUs are now enabled on OpenShift 4 on both RHEL CoreOS and RHEL 7 worker nodes.

https://www.openshift.com/blog/creating-a-gpu-enabled-node-with-openshift-4-2-in-amazon-ec2 ^This one goes up to using the Node Feature Detector on GPU workers but does not include the steps for deploying the GPU Operator here. It also does not result in a fully functioning GPU environment.

However, when I go through the steps of deploying the NFD and Special Resource Operator through the installed operators addons: It breaks all future pod creations on the node since it appears the GLIBC version of the nvidia-toolkit-container does not match what is expected for RHEL 7:

Events:

  Type     Reason     Age        From                    Message

  ----     ------     ----       ----                    -------

  Normal   Scheduled  <unknown>  default-scheduler       Successfully assigned nvidia-gpu/nvidia-gpu-device-plugin-qdfmn to 10.177.155.99

  Warning  Failed     7m55s      kubelet, 10.177.155.99  Error: container create failed: time="2020-04-27T19:53:10-05:00" level=warning msg="signal: killed"

time="2020-04-27T19:53:10-05:00" level=error msg="container_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"process_linux.go:432: running prestart hook 0 caused \\\\\\\"error running hook: exit status 1, stdout: , stderr: /run/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /run/nvidia/toolkit/libnvidia-container.so.1)\\\\\\\\n\\\\\\\"\\\"\""

container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: /run/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /run/nvidia/toolkit/libnvidia-container.so.1)\\\\n\\\"\""

  Warning  Failed  7m54s  kubelet, 10.177.155.99  Error: container create failed: time="2020-04-27T19:53:11-05:00" level=warning msg="signal: killed"

time="2020-04-27T19:53:11-05:00" level=error msg="container_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"process_linux.go:432: running prestart hook 0 caused \\\\\\\"error running hook: exit status 1, stdout: , stderr: /run/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /run/nvidia/toolkit/libnvidia-container.so.1)\\\\\\\\n\\\\\\\"\\\"\""

I also was able to find no documentation on how to do a build of that specifically for RHEL 7 so I opened the upstream NVIDIA issue: https://github.com/NVIDIA/gpu-operator/issues/58

What I'm trying to clarify is the current support of RHEL 7 GPUs in Openshift 4.3. Is that currently at a GA level of support? And if it's not supported is there plans to support it on a future release of Openshift 4.3?

If it is supported: Is there some special additional steps I need to do in order to get the proper support.

cc @zvonkok as this relates to our emails but I just wanted to open this issue for a central place of information

relyt0925 commented 4 years ago

Another user with the GPU Operator failing if it matters: https://github.com/NVIDIA/gpu-operator/issues/55

relyt0925 commented 4 years ago

I also tried the ubi8 image and it goes into CrashLoopBackoff on my RHEL 7 workers and does not work

Tylers-MBP:release tylerlisowski$ kubectl logs  -n nvidia-gpu nvidia-gpu-runtime-enablement-nwfz2 
chcon: failed to change context of '/run/nvidia/driver/dev/core' to 'system_u:object_r:container_file_t:s0': Operation not supported
chcon: failed to change context of '/run/nvidia/driver/dev/fd' to 'system_u:system_r:container_file_t:s0': Operation not supported
chcon: failed to change context of '/run/nvidia/driver/dev/stderr' to 'system_u:system_r:container_file_t:s0': Permission denied
chcon: failed to change context of '/run/nvidia/driver/dev/stdout' to 'system_u:system_r:container_file_t:s0': Permission denied
+ shopt -s lastpipe
+++ realpath /work/run.sh
++ dirname /work/run.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ DAEMON=0
+ '[' 1 -eq 0 ']'
+ main /run/nvidia
+ local -r destination=/run/nvidia
+ shift
+ RUNTIME=crio
+ TOOLKIT_ARGS='--symlink /usr/local/nvidia'
+ RUNTIME_ARGS=
++ getopt -l no-daemon,toolkit-args:,runtime:,runtime-args: -o nt:r:u: --
+ options=' --'
+ [[ 0 -ne 0 ]]
+ eval set -- ' --'
++ set -- --
+ for opt in ${options}
+ case "${opt}" in
+ shift
+ break
+ ensure::oneof docker crio
+ echo crio
++ cat -
+ local -r val=crio
+ for match in "$@"
+ [[ crio == \d\o\c\k\e\r ]]
+ for match in "$@"
+ [[ crio == \c\r\i\o ]]
+ return 0
+ _init
+ log INFO _init
+ local -r level=INFO
+ shift
+ local -r message=_init
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' _init
[INFO] _init
+ exec
+ flock -n 3
+ echo 1545397
+ trap _shutdown EXIT
+ log INFO '=================Starting the NVIDIA Container Toolkit================='
+ local -r level=INFO
+ shift
+ local -r 'message==================Starting the NVIDIA Container Toolkit================='
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' '=================Starting the NVIDIA Container Toolkit================='
[INFO] =================Starting the NVIDIA Container Toolkit=================
+ toolkit /run/nvidia --symlink /usr/local/nvidia
+ shopt -s lastpipe
+++ realpath /work/toolkit
++ dirname /work/toolkit.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ packages=("/usr/bin/nvidia-container-runtime" "/usr/bin/nvidia-container-toolkit" "/usr/bin/nvidia-container-cli" "/etc/nvidia-container-runtime/config.toml" "/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1")
+ '[' 3 -eq 0 ']'
+ toolkit::install /run/nvidia --symlink /usr/local/nvidia
+ local destination=/run/nvidia/toolkit
+ shift
+ [[ 2 -ne 0 ]]
+ toolkit::usage
+ cat
Usage: /work/toolkit COMMAND [ARG...]

Commands:
  install DESTINATION
+ exit 1
+ _shutdown
+ log INFO _shutdown
+ local -r level=INFO
+ shift
+ local -r message=_shutdown
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' _shutdown
[INFO] _shutdown
+ rm -f /run/nvidia/toolkit.pid
Tylers-MBP:release tylerlisowski$ kubectl get pods  -n nvidia-gpu 
NAME                                      READY   STATUS                  RESTARTS   AGE
nvidia-gpu-device-plugin-d468x            0/1     Init:CrashLoopBackOff   269        2d8h
nvidia-gpu-driver-build-1-build           0/1     Completed               0          7m9s
nvidia-gpu-driver-container-rhel7-dc8j8   1/1     Running                 0          5m12s
nvidia-gpu-runtime-enablement-nwfz2       0/1     Error                   2          25s
Tylers-MBP:release tylerlisowski$ kubectl get pods  -n nvidia-gpu 
Tylers-MBP:release tylerlisowski$ kubectl get pods  -n nvidia-gpu  nvidia-gpu-runtime-enablement-nwfz2 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 172.30.229.142/32
    cni.projectcalico.org/podIPs: 172.30.229.142/32
    k8s.v1.cni.cncf.io/networks-status: |-
      [{
          "name": "k8s-pod-network",
          "ips": [
              "172.30.229.142"
          ],
          "dns": {}
      }]
    openshift.io/scc: privileged
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: "2020-05-01T01:44:24Z"
  generateName: nvidia-gpu-runtime-enablement-
  labels:
    app: nvidia-gpu-runtime-enablement
    controller-revision-hash: 85cfb5fb99
    pod-template-generation: "4"
  name: nvidia-gpu-runtime-enablement-nwfz2
  namespace: nvidia-gpu
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: nvidia-gpu-runtime-enablement
    uid: 3adc65f8-db4f-49ed-831b-b3cf28cbee6f
  resourceVersion: "35877790"
  selfLink: /api/v1/namespaces/nvidia-gpu/pods/nvidia-gpu-runtime-enablement-nwfz2
  uid: db23f6da-622d-43a4-b24d-5ac4e9f208f8
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - 10.177.155.99
  containers:
  - command:
    - /bin/entrypoint.sh
    env:
    - name: TOOLKIT_ARGS
      value: --symlink /usr/local/nvidia
    - name: RUNTIME_ARGS
    - name: RUNTIME
      value: crio
    image: nvidia/container-toolkit:1.0.0-beta.1-ubi8
    imagePullPolicy: IfNotPresent
    name: nvidia-gpu-runtime-enablement-ctr
    resources: {}
    securityContext:
      privileged: true
      seLinuxOptions:
        level: s0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /bin/entrypoint.sh
      name: entrypoint
      readOnly: true
      subPath: entrypoint.sh
    - mountPath: /var/run/docker.sock
      name: docker-socket
    - mountPath: /run/nvidia
      mountPropagation: Bidirectional
      name: nvidia-install-path
    - mountPath: /etc/docker
      name: docker-config
    - mountPath: /usr/local/nvidia
      name: nvidia-local
    - mountPath: /usr/share/containers/oci/hooks.d
      name: crio-hooks
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: nvidia-gpu-runtime-enablement-token-brtk9
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostPID: true
  imagePullSecrets:
  - name: nvidia-gpu-runtime-enablement-dockercfg-kxc9d
  initContainers:
  - command:
    - /bin/entrypoint.sh
    image: quay.io/openshift-psap/ubi8-kmod
    imagePullPolicy: Always
    name: specialresource-driver-validation-nvidia-gpu
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /bin/entrypoint.sh
      name: init-entrypoint
      readOnly: true
      subPath: entrypoint.sh
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: nvidia-gpu-runtime-enablement-token-brtk9
      readOnly: true
  nodeName: 10.177.155.99
  nodeSelector:
    feature.node.kubernetes.io/pci-10de.present: "true"
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: nvidia-gpu-runtime-enablement
  serviceAccountName: nvidia-gpu-runtime-enablement
  terminationGracePeriodSeconds: 30
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - configMap:
      defaultMode: 448
      name: nvidia-gpu-runtime-enablement-entrypoint
    name: entrypoint
  - configMap:
      defaultMode: 448
      name: nvidia-gpu-runtime-enablement-init-entrypoint
    name: init-entrypoint
  - hostPath:
      path: /var/run/docker.sock
      type: ""
    name: docker-socket
  - hostPath:
      path: /run/nvidia
      type: ""
    name: nvidia-install-path
  - hostPath:
      path: /etc/docker
      type: ""
    name: docker-config
  - hostPath:
      path: /usr/local/nvidia
      type: ""
    name: nvidia-local
  - hostPath:
      path: /etc/containers/oci/hooks.d
      type: ""
    name: crio-hooks
  - name: nvidia-gpu-runtime-enablement-token-brtk9
    secret:
      defaultMode: 420
      secretName: nvidia-gpu-runtime-enablement-token-brtk9
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-05-01T01:44:27Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-05-01T01:44:24Z"
    message: 'containers with unready status: [nvidia-gpu-runtime-enablement-ctr]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-05-01T01:44:24Z"
    message: 'containers with unready status: [nvidia-gpu-runtime-enablement-ctr]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-05-01T01:44:24Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://b44b4cd97b9e938450a4506547aba5d32ef73ddfad1d2b623420a9ed1fdecc8b
    image: docker.io/nvidia/container-toolkit:1.0.0-beta.1-ubi8
    imageID: docker.io/nvidia/container-toolkit@sha256:4f610626ace87b2a94531f036d8068fcd4fd471061bd43862123cd34d9e3bab0
    lastState:
      terminated:
        containerID: cri-o://b44b4cd97b9e938450a4506547aba5d32ef73ddfad1d2b623420a9ed1fdecc8b
        exitCode: 1
        finishedAt: "2020-05-01T01:47:28Z"
        reason: Error
        startedAt: "2020-05-01T01:47:28Z"
    name: nvidia-gpu-runtime-enablement-ctr
    ready: false
    restartCount: 5
    started: false
    state:
      waiting:
        message: back-off 2m40s restarting failed container=nvidia-gpu-runtime-enablement-ctr
          pod=nvidia-gpu-runtime-enablement-nwfz2_nvidia-gpu(db23f6da-622d-43a4-b24d-5ac4e9f208f8)
        reason: CrashLoopBackOff
  hostIP: 10.177.155.99
  initContainerStatuses:
  - containerID: cri-o://59a706380937ae2c9151481a72dec720932bf240c056c46d8e5b5ad79c5f72c7
    image: quay.io/openshift-psap/ubi8-kmod:latest
    imageID: quay.io/openshift-psap/ubi8-kmod@sha256:3f64911fc7ffcbd5e1dce2b831e0c13cc3ceece8b662c97538c5b9f368b638ea
    lastState: {}
    name: specialresource-driver-validation-nvidia-gpu
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: cri-o://59a706380937ae2c9151481a72dec720932bf240c056c46d8e5b5ad79c5f72c7
        exitCode: 0
        finishedAt: "2020-05-01T01:44:27Z"
        reason: Completed
        startedAt: "2020-05-01T01:44:27Z"
  phase: Running
  podIP: 172.30.229.142
  podIPs:
  - ip: 172.30.229.142
  qosClass: BestEffort
  startTime: "2020-05-01T01:44:24Z"
Tylers-MBP:release tylerlisowski$ 
relyt0925 commented 4 years ago

Note the runtime is using the nvidia/container-toolkit:1.0.0-beta.1-ubi8 container.

relyt0925 commented 4 years ago

Same errors seen on nvidia/container-toolkit:1.0.2-ubi8

zvonkok commented 4 years ago

Ok pushed a new version please update master. @relyt0925 PTAL

tweeje commented 4 years ago

Are the same issues happening in CoreOS (RHEL 8) with OCP 4.3?

relyt0925 commented 4 years ago

Thanks @zvonkok that looks to have gotten past that issue!!

Currently what I see is 2 things (first one I worked around

The container installs the oci hook at

/etc/containers/oci/hooks.d/oci-nvidia-hook.json

Instead of

/usr/share/containers/oci/hooks.d/oci-nvidia-hook.json

Fixed that by simply copying the installed hook with

cp /etc/containers/oci/hooks.d/oci-nvidia-hook.json /usr/share/containers/oci/hooks.d/

However, now I see the validator init container on the device plugin fail with

Tylers-MBP:openshift4-bm-2-bol8bfp20g31gc9rli70-admin tylerlisowski$ kubectl logs -n nvidia-gpu nvidia-gpu-device-plugin-zfx2n -c specialresource-runtime-validation-nvidia-gpu
checkCudaErrors() Driver API error = 0003 "CUDA_ERROR_NOT_INITIALIZED" from file <../../Common/helper_cuda_drvapi.h>, line 229.
Tylers-MBP:openshift4-bm-2-bol8bfp20g31gc9rli70-admin tylerlisowski$ kubectl get pods -n nvidia-gpu
NAME                                         READY   STATUS       RESTARTS   AGE
nvidia-gpu-device-plugin-zfx2n               0/1     Init:Error   0          5s
relyt0925 commented 4 years ago

Changing the security context of the init container to be

    securityContext:
      privileged: true

Looks to have fixed my problem

relyt0925 commented 4 years ago

However I cannot schedule a GPU pod unless I also give it security context privileged

relyt0925 commented 4 years ago
  initContainers:
  - env:
    - name: SKIP_P2P
      value: "True"
    image: registry.ng.bluemix.net/armada-master/gpu-cuda-tests-8-0:b9a418cfcd1e21b68d47ca1104a44ec078c3e12b
    imagePullPolicy: IfNotPresent
    name: gpu-cuda-tests-8-0
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-b69vv
      readOnly: true

^ That for example will work

relyt0925 commented 4 years ago
  initContainers:
  - env:
    - name: SKIP_P2P
      value: "True"
    image: registry.ng.bluemix.net/armada-master/gpu-cuda-tests-8-0:b9a418cfcd1e21b68d47ca1104a44ec078c3e12b
    imagePullPolicy: IfNotPresent
    name: gpu-cuda-tests-8-0
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-b69vv
      readOnly: true

That will fail initializing CUDA

relyt0925 commented 4 years ago

So far: 1 Workaround for me: Installing the hook at the proper place at my worker

cp /etc/containers/oci/hooks.d/oci-nvidia-hook.json /usr/share/containers/oci/hooks.d/

Current failure afterwards: Code can only run on GPUs if the container has a privileged security context. I had to modify both the init container of the device plugin to be privileged so it started up properly and any GPU containers I scheduled afterwards

relyt0925 commented 4 years ago

It looks potentially to be related to this issue: https://github.com/NVIDIA/nvidia-container-runtime/issues/42

Trying to add that policy locally to see if it changes anything

relyt0925 commented 4 years ago

Found the other workaround I needed to do: https://github.com/NVIDIA/gpu-operator/issues/32 ^ I don't believe this issue has been fixed yet at least for ubi8 containers

@zvonkok should the container be installing at /etc/containers/oci/hooks.d/oci-nvidia-hook.json in Openshift 4? I thought that the hook path was at /usr/share/containers/oci/hooks.d/?

Once that is fixed and nvidia gets a new ubi8 container published this should be able to be utilized for RHEL 7 + CoreOS

zvonkok commented 4 years ago

The fix #32 should have been fixed long ago weird. Need to look into it. The right path is /etc/containers/oci/hooks.d/oci-nvidia-hook.json not the /usr/ one because /etc/ is writable for RHEL7,8 and RH CoreOS

relyt0925 commented 4 years ago

Ok we will adjust our hooks path to point there

zvonkok commented 4 years ago

@relyt0925 No this does not need to be installed, this is for OCP 3.11, we are dealing here solely with the container_file_t context in OCP4.

relyt0925 commented 4 years ago
Tylers-MBP:special-resource-operator tylerlisowski$ oc -n nvidia-gpu logs nvidia-gpu-runtime-enablement-9c8dc
chcon: failed to change context of '/run/nvidia/driver/dev/core' to 'system_u:object_r:container_file_t:s0': Operation not supported
chcon: failed to change context of '/run/nvidia/driver/dev/fd' to 'system_u:system_r:container_file_t:s0': Operation not supported
chcon: failed to change context of '/run/nvidia/driver/dev/stderr' to 'system_u:system_r:container_file_t:s0': Permission denied
chcon: failed to change context of '/run/nvidia/driver/dev/stdout' to 'system_u:system_r:container_file_t:s0': Permission denied
+ shopt -s lastpipe
+++ realpath /work/nvidia-toolkit
++ dirname /work/run.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ DAEMON=0
+ '[' 1 -eq 0 ']'
+ main /usr/local/nvidia
+ local -r destination=/usr/local/nvidia
+ shift
+ RUNTIME=crio
+ TOOLKIT_ARGS=
+ RUNTIME_ARGS=
++ getopt -l no-daemon,toolkit-args:,runtime:,runtime-args: -o nt:r:u: --
+ options=' --'
+ [[ 0 -ne 0 ]]
+ eval set -- ' --'
++ set -- --
+ for opt in ${options}
+ case "${opt}" in
+ shift
+ break
+ ensure::oneof docker crio
+ echo crio
++ cat -
+ local -r val=crio
+ for match in "$@"
+ [[ crio == \d\o\c\k\e\r ]]
+ for match in "$@"
+ [[ crio == \c\r\i\o ]]
+ return 0
+ _init
+ log INFO _init
+ local -r level=INFO
+ shift
+ local -r message=_init
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' _init
[INFO] _init
+ exec
+ flock -n 3
+ echo 356078
+ trap _shutdown EXIT
+ log INFO '=================Starting the NVIDIA Container Toolkit================='
+ local -r level=INFO
+ shift
+ local -r 'message==================Starting the NVIDIA Container Toolkit================='
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' '=================Starting the NVIDIA Container Toolkit================='
[INFO] =================Starting the NVIDIA Container Toolkit=================
+ toolkit /usr/local/nvidia
+ shopt -s lastpipe
+++ realpath /work/toolkit
++ dirname /work/toolkit.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ packages=("/usr/bin/nvidia-container-runtime" "/usr/bin/nvidia-container-toolkit" "/usr/bin/nvidia-container-cli" "/etc/nvidia-container-runtime/config.toml")
+ '[' 1 -eq 0 ']'
+ toolkit::install /usr/local/nvidia
+ local destination=/usr/local/nvidia/toolkit
+ shift
+ [[ 0 -ne 0 ]]
+ toolkit::remove /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ log INFO 'toolkit::remove /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::remove /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::remove /usr/local/nvidia/toolkit'
[INFO] toolkit::remove /usr/local/nvidia/toolkit
+ rm -rf /usr/local/nvidia/toolkit
+ log INFO 'toolkit::install '
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::install '
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::install '
[INFO] toolkit::install 
+ '[' -e /etc/debian_version ']'
+ packages+=("/usr/lib64/libnvidia-container.so.1")
+ toolkit::install::packages /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ mkdir -p /usr/local/nvidia/toolkit
+ mkdir -p /usr/local/nvidia/toolkit/.config/nvidia-container-runtime
+ (( i=0 ))
+ (( i < 5 ))
++ readlink -f /usr/bin/nvidia-container-runtime
+ packages[$i]=/usr/bin/nvidia-container-runtime
+ (( i++ ))
+ (( i < 5 ))
++ readlink -f /usr/bin/nvidia-container-toolkit
+ packages[$i]=/usr/bin/nvidia-container-toolkit
+ (( i++ ))
+ (( i < 5 ))
++ readlink -f /usr/bin/nvidia-container-cli
+ packages[$i]=/usr/bin/nvidia-container-cli
+ (( i++ ))
+ (( i < 5 ))
++ readlink -f /etc/nvidia-container-runtime/config.toml
+ packages[$i]=/etc/nvidia-container-runtime/config.toml
+ (( i++ ))
+ (( i < 5 ))
++ readlink -f /usr/lib64/libnvidia-container.so.1
+ packages[$i]=/usr/lib64/libnvidia-container.so.1.0.7
+ (( i++ ))
+ (( i < 5 ))
+ cp /usr/bin/nvidia-container-runtime /usr/bin/nvidia-container-toolkit /usr/bin/nvidia-container-cli /etc/nvidia-container-runtime/config.toml /usr/lib64/libnvidia-container.so.1.0.7 /usr/local/nvidia/toolkit
+ mv /usr/local/nvidia/toolkit/config.toml /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/
+ toolkit::setup::config /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ local -r config_path=/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
+ log INFO 'toolkit::setup::config /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::setup::config /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::setup::config /usr/local/nvidia/toolkit'
[INFO] toolkit::setup::config /usr/local/nvidia/toolkit
+ sed -i 's/^#root/root/;' /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
+ sed -i 's@/run/nvidia/driver@/run/nvidia/driver@;' /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
+ sed -i 's;@/sbin/ldconfig.real;@/run/nvidia/driver/sbin/ldconfig.real;' /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
+ toolkit::setup::cli_binary /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ log INFO 'toolkit::setup::cli_binary /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::setup::cli_binary /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::setup::cli_binary /usr/local/nvidia/toolkit'
[INFO] toolkit::setup::cli_binary /usr/local/nvidia/toolkit
+ mv /usr/local/nvidia/toolkit/nvidia-container-cli /usr/local/nvidia/toolkit/nvidia-container-cli.real
+ tr -s ' \t'
+ cat
+ chmod +x /usr/local/nvidia/toolkit/nvidia-container-cli
+ toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ log INFO 'toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit'
[INFO] toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit
+ mv /usr/local/nvidia/toolkit/nvidia-container-toolkit /usr/local/nvidia/toolkit/nvidia-container-toolkit.real
+ tr -s ' \t'
+ cat
+ chmod +x /usr/local/nvidia/toolkit/nvidia-container-toolkit
+ toolkit::setup::runtime_binary /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ log INFO 'toolkit::setup::runtime_binary /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::setup::runtime_binary /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::setup::runtime_binary /usr/local/nvidia/toolkit'
[INFO] toolkit::setup::runtime_binary /usr/local/nvidia/toolkit
+ mv /usr/local/nvidia/toolkit/nvidia-container-runtime /usr/local/nvidia/toolkit/nvidia-container-runtime.real
+ tr -s ' \t'
+ cat
+ chmod +x /usr/local/nvidia/toolkit/nvidia-container-runtime
+ cd /usr/local/nvidia/toolkit
+ ln -s ./nvidia-container-toolkit /usr/local/nvidia/toolkit/nvidia-container-runtime-hook
+ ln -s ./libnvidia-container.so.1.0.7 /usr/local/nvidia/toolkit/libnvidia-container.so.1
+ cd -
/work
+ crio setup /usr/local/nvidia
+ shopt -s lastpipe
+++ realpath /work/crio
++ dirname /work/crio.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ '[' 2 -eq 0 ']'
+ command=setup
+ shift
+ case "${command}" in
+ crio::setup /usr/local/nvidia
+ '[' 1 -eq 0 ']'
+ local hooksd=/usr/share/containers/oci/hooks.d
+ local ensure=TRUE
+ local -r destination=/usr/local/nvidia/toolkit
+ shift
++ getopt -l hooks-dir:,no-check -o d:c --
+ options=' --'
+ [[ 0 -ne 0 ]]
+ eval set -- ' --'
++ set -- --
+ for opt in ${options}
+ case "${opt}" in
+ shift
+ break
+ [[ TRUE = \T\R\U\E ]]
+ ensure::mounted /usr/share/containers/oci/hooks.d
+ local -r directory=/usr/share/containers/oci/hooks.d
+ grep -q /usr/share/containers/oci/hooks.d
+ mount
+ [[ /usr/local/nvidia/toolkit == *\#* ]]
+ mkdir -p /usr/share/containers/oci/hooks.d
+ cp /work/oci-nvidia-hook.json /usr/share/containers/oci/hooks.d
+ sed -i s#@DESTINATION@#/usr/local/nvidia/toolkit# /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
+ [[ 0 -ne 0 ]]
+ log INFO '=================Done, Now Waiting for signal================='
+ local -r level=INFO
+ shift
+ local -r 'message==================Done, Now Waiting for signal================='
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' '=================Done, Now Waiting for signal================='
[INFO] =================Done, Now Waiting for signal=================
+ trap 'echo '\''Caught signal'\'';         _shutdown;      crio cleanup /usr/local/nvidia;         { kill 356127; exit 0; }' HUP INT QUIT PIPE TERM
+ trap - EXIT
+ true
+ sleep infinity
+ wait 356127
Tylers-MBP:special-resource-operator tylerlisowski$ 
relyt0925 commented 4 years ago
crw-rw-rw-.  1 root root    system_u:object_r:container_file_t:s0          195, 254 May  6 15:30 nvidia-modeset
crw-rw-rw-.  1 root root    system_u:object_r:container_file_t:s0          195,   0 May  6 15:30 nvidia0
crw-rw-rw-.  1 root root    system_u:object_r:container_file_t:s0          195,   1 May  6 15:30 nvidia1
crw-rw-rw-.  1 root root    system_u:object_r:container_file_t:s0          195,   2 May  6 15:30 nvidia2
crw-rw-rw-.  1 root root    system_u:object_r:container_file_t:s0          195,   3 May  6 15:30 nvidia3
crw-rw-rw-.  1 root root    system_u:object_r:container_file_t:s0          195, 255 May  6 15:30 nvidiactl
crw-rw-rw-.  1 root root    system_u:object_r:container_file_t:s0           10, 144 May  6 15:28 nvram
crw-rw-rw-.  1 root root    system_u:object_r:container_file_t:s0            1,  12 May  6 15:28 oldmem
relyt0925 commented 4 years ago
plugin_dir = "/var/lib/cni/bin"[root@test-bol8b9220mt1momn6k90-openshift4b-gpu-00000262 hooks.d]# ls -laZ /etc/containers/oci/hooks.d/
drwxr-xr-x. root root system_u:object_r:etc_t:s0       .
drwxr-xr-x. root root system_u:object_r:etc_t:s0       ..
-rw-r--r--. root root system_u:object_r:etc_t:s0       oci-nvidia-hook.json
[root@test-bol8b9220mt1momn6k90-openshift4b-gpu-00000262 hooks.d]# 
relyt0925 commented 4 years ago

@zvonkok after making the changes we discussed (precreating the container hooks dir before starting CRIO)

I still see the initial rollout fail to deploy with the following error

crw-rw-rw-. root root    system_u:object_r:container_runtime_tmpfs_t:s0 nvidia-uvm
crw-rw-rw-. root root    system_u:object_r:container_runtime_tmpfs_t:s0 nvidia-uvm-tools
crw-rw-rw-. root root    system_u:object_r:container_file_t:s0 nvidia0
crw-rw-rw-. root root    system_u:object_r:container_file_t:s0 nvidia1
crw-rw-rw-. root root    system_u:object_r:container_file_t:s0 nvidia2
crw-rw-rw-. root root    system_u:object_r:container_file_t:s0 nvidia3

Note how nvidia-uvm and nvidia-uvm-tools remain container_runtime_tmpfs_t while everything else is container_file_t. Changing the context on them solves the problem.

relyt0925 commented 4 years ago

This happened in two separate deploys of GPU nodes I did for scratch and is reproducible just by trying to do a fresh rollout of the components.

If you exec in and run the chcon command everything will proceed to rollout.