Closed relyt0925 closed 3 years ago
Another user with the GPU Operator failing if it matters: https://github.com/NVIDIA/gpu-operator/issues/55
I also tried the ubi8 image and it goes into CrashLoopBackoff on my RHEL 7 workers and does not work
Tylers-MBP:release tylerlisowski$ kubectl logs -n nvidia-gpu nvidia-gpu-runtime-enablement-nwfz2
chcon: failed to change context of '/run/nvidia/driver/dev/core' to 'system_u:object_r:container_file_t:s0': Operation not supported
chcon: failed to change context of '/run/nvidia/driver/dev/fd' to 'system_u:system_r:container_file_t:s0': Operation not supported
chcon: failed to change context of '/run/nvidia/driver/dev/stderr' to 'system_u:system_r:container_file_t:s0': Permission denied
chcon: failed to change context of '/run/nvidia/driver/dev/stdout' to 'system_u:system_r:container_file_t:s0': Permission denied
+ shopt -s lastpipe
+++ realpath /work/run.sh
++ dirname /work/run.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ DAEMON=0
+ '[' 1 -eq 0 ']'
+ main /run/nvidia
+ local -r destination=/run/nvidia
+ shift
+ RUNTIME=crio
+ TOOLKIT_ARGS='--symlink /usr/local/nvidia'
+ RUNTIME_ARGS=
++ getopt -l no-daemon,toolkit-args:,runtime:,runtime-args: -o nt:r:u: --
+ options=' --'
+ [[ 0 -ne 0 ]]
+ eval set -- ' --'
++ set -- --
+ for opt in ${options}
+ case "${opt}" in
+ shift
+ break
+ ensure::oneof docker crio
+ echo crio
++ cat -
+ local -r val=crio
+ for match in "$@"
+ [[ crio == \d\o\c\k\e\r ]]
+ for match in "$@"
+ [[ crio == \c\r\i\o ]]
+ return 0
+ _init
+ log INFO _init
+ local -r level=INFO
+ shift
+ local -r message=_init
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' _init
[INFO] _init
+ exec
+ flock -n 3
+ echo 1545397
+ trap _shutdown EXIT
+ log INFO '=================Starting the NVIDIA Container Toolkit================='
+ local -r level=INFO
+ shift
+ local -r 'message==================Starting the NVIDIA Container Toolkit================='
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' '=================Starting the NVIDIA Container Toolkit================='
[INFO] =================Starting the NVIDIA Container Toolkit=================
+ toolkit /run/nvidia --symlink /usr/local/nvidia
+ shopt -s lastpipe
+++ realpath /work/toolkit
++ dirname /work/toolkit.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ packages=("/usr/bin/nvidia-container-runtime" "/usr/bin/nvidia-container-toolkit" "/usr/bin/nvidia-container-cli" "/etc/nvidia-container-runtime/config.toml" "/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1")
+ '[' 3 -eq 0 ']'
+ toolkit::install /run/nvidia --symlink /usr/local/nvidia
+ local destination=/run/nvidia/toolkit
+ shift
+ [[ 2 -ne 0 ]]
+ toolkit::usage
+ cat
Usage: /work/toolkit COMMAND [ARG...]
Commands:
install DESTINATION
+ exit 1
+ _shutdown
+ log INFO _shutdown
+ local -r level=INFO
+ shift
+ local -r message=_shutdown
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' _shutdown
[INFO] _shutdown
+ rm -f /run/nvidia/toolkit.pid
Tylers-MBP:release tylerlisowski$ kubectl get pods -n nvidia-gpu
NAME READY STATUS RESTARTS AGE
nvidia-gpu-device-plugin-d468x 0/1 Init:CrashLoopBackOff 269 2d8h
nvidia-gpu-driver-build-1-build 0/1 Completed 0 7m9s
nvidia-gpu-driver-container-rhel7-dc8j8 1/1 Running 0 5m12s
nvidia-gpu-runtime-enablement-nwfz2 0/1 Error 2 25s
Tylers-MBP:release tylerlisowski$ kubectl get pods -n nvidia-gpu
Tylers-MBP:release tylerlisowski$ kubectl get pods -n nvidia-gpu nvidia-gpu-runtime-enablement-nwfz2 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/podIP: 172.30.229.142/32
cni.projectcalico.org/podIPs: 172.30.229.142/32
k8s.v1.cni.cncf.io/networks-status: |-
[{
"name": "k8s-pod-network",
"ips": [
"172.30.229.142"
],
"dns": {}
}]
openshift.io/scc: privileged
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: "2020-05-01T01:44:24Z"
generateName: nvidia-gpu-runtime-enablement-
labels:
app: nvidia-gpu-runtime-enablement
controller-revision-hash: 85cfb5fb99
pod-template-generation: "4"
name: nvidia-gpu-runtime-enablement-nwfz2
namespace: nvidia-gpu
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: nvidia-gpu-runtime-enablement
uid: 3adc65f8-db4f-49ed-831b-b3cf28cbee6f
resourceVersion: "35877790"
selfLink: /api/v1/namespaces/nvidia-gpu/pods/nvidia-gpu-runtime-enablement-nwfz2
uid: db23f6da-622d-43a4-b24d-5ac4e9f208f8
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- 10.177.155.99
containers:
- command:
- /bin/entrypoint.sh
env:
- name: TOOLKIT_ARGS
value: --symlink /usr/local/nvidia
- name: RUNTIME_ARGS
- name: RUNTIME
value: crio
image: nvidia/container-toolkit:1.0.0-beta.1-ubi8
imagePullPolicy: IfNotPresent
name: nvidia-gpu-runtime-enablement-ctr
resources: {}
securityContext:
privileged: true
seLinuxOptions:
level: s0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /bin/entrypoint.sh
name: entrypoint
readOnly: true
subPath: entrypoint.sh
- mountPath: /var/run/docker.sock
name: docker-socket
- mountPath: /run/nvidia
mountPropagation: Bidirectional
name: nvidia-install-path
- mountPath: /etc/docker
name: docker-config
- mountPath: /usr/local/nvidia
name: nvidia-local
- mountPath: /usr/share/containers/oci/hooks.d
name: crio-hooks
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: nvidia-gpu-runtime-enablement-token-brtk9
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostPID: true
imagePullSecrets:
- name: nvidia-gpu-runtime-enablement-dockercfg-kxc9d
initContainers:
- command:
- /bin/entrypoint.sh
image: quay.io/openshift-psap/ubi8-kmod
imagePullPolicy: Always
name: specialresource-driver-validation-nvidia-gpu
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /bin/entrypoint.sh
name: init-entrypoint
readOnly: true
subPath: entrypoint.sh
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: nvidia-gpu-runtime-enablement-token-brtk9
readOnly: true
nodeName: 10.177.155.99
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: nvidia-gpu-runtime-enablement
serviceAccountName: nvidia-gpu-runtime-enablement
terminationGracePeriodSeconds: 30
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
volumes:
- configMap:
defaultMode: 448
name: nvidia-gpu-runtime-enablement-entrypoint
name: entrypoint
- configMap:
defaultMode: 448
name: nvidia-gpu-runtime-enablement-init-entrypoint
name: init-entrypoint
- hostPath:
path: /var/run/docker.sock
type: ""
name: docker-socket
- hostPath:
path: /run/nvidia
type: ""
name: nvidia-install-path
- hostPath:
path: /etc/docker
type: ""
name: docker-config
- hostPath:
path: /usr/local/nvidia
type: ""
name: nvidia-local
- hostPath:
path: /etc/containers/oci/hooks.d
type: ""
name: crio-hooks
- name: nvidia-gpu-runtime-enablement-token-brtk9
secret:
defaultMode: 420
secretName: nvidia-gpu-runtime-enablement-token-brtk9
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-05-01T01:44:27Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2020-05-01T01:44:24Z"
message: 'containers with unready status: [nvidia-gpu-runtime-enablement-ctr]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2020-05-01T01:44:24Z"
message: 'containers with unready status: [nvidia-gpu-runtime-enablement-ctr]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2020-05-01T01:44:24Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: cri-o://b44b4cd97b9e938450a4506547aba5d32ef73ddfad1d2b623420a9ed1fdecc8b
image: docker.io/nvidia/container-toolkit:1.0.0-beta.1-ubi8
imageID: docker.io/nvidia/container-toolkit@sha256:4f610626ace87b2a94531f036d8068fcd4fd471061bd43862123cd34d9e3bab0
lastState:
terminated:
containerID: cri-o://b44b4cd97b9e938450a4506547aba5d32ef73ddfad1d2b623420a9ed1fdecc8b
exitCode: 1
finishedAt: "2020-05-01T01:47:28Z"
reason: Error
startedAt: "2020-05-01T01:47:28Z"
name: nvidia-gpu-runtime-enablement-ctr
ready: false
restartCount: 5
started: false
state:
waiting:
message: back-off 2m40s restarting failed container=nvidia-gpu-runtime-enablement-ctr
pod=nvidia-gpu-runtime-enablement-nwfz2_nvidia-gpu(db23f6da-622d-43a4-b24d-5ac4e9f208f8)
reason: CrashLoopBackOff
hostIP: 10.177.155.99
initContainerStatuses:
- containerID: cri-o://59a706380937ae2c9151481a72dec720932bf240c056c46d8e5b5ad79c5f72c7
image: quay.io/openshift-psap/ubi8-kmod:latest
imageID: quay.io/openshift-psap/ubi8-kmod@sha256:3f64911fc7ffcbd5e1dce2b831e0c13cc3ceece8b662c97538c5b9f368b638ea
lastState: {}
name: specialresource-driver-validation-nvidia-gpu
ready: true
restartCount: 0
state:
terminated:
containerID: cri-o://59a706380937ae2c9151481a72dec720932bf240c056c46d8e5b5ad79c5f72c7
exitCode: 0
finishedAt: "2020-05-01T01:44:27Z"
reason: Completed
startedAt: "2020-05-01T01:44:27Z"
phase: Running
podIP: 172.30.229.142
podIPs:
- ip: 172.30.229.142
qosClass: BestEffort
startTime: "2020-05-01T01:44:24Z"
Tylers-MBP:release tylerlisowski$
Note the runtime is using the nvidia/container-toolkit:1.0.0-beta.1-ubi8
container.
Same errors seen on
nvidia/container-toolkit:1.0.2-ubi8
Ok pushed a new version please update master. @relyt0925 PTAL
Are the same issues happening in CoreOS (RHEL 8) with OCP 4.3?
Thanks @zvonkok that looks to have gotten past that issue!!
Currently what I see is 2 things (first one I worked around
The container installs the oci hook at
/etc/containers/oci/hooks.d/oci-nvidia-hook.json
Instead of
/usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
Fixed that by simply copying the installed hook with
cp /etc/containers/oci/hooks.d/oci-nvidia-hook.json /usr/share/containers/oci/hooks.d/
However, now I see the validator init container on the device plugin fail with
Tylers-MBP:openshift4-bm-2-bol8bfp20g31gc9rli70-admin tylerlisowski$ kubectl logs -n nvidia-gpu nvidia-gpu-device-plugin-zfx2n -c specialresource-runtime-validation-nvidia-gpu
checkCudaErrors() Driver API error = 0003 "CUDA_ERROR_NOT_INITIALIZED" from file <../../Common/helper_cuda_drvapi.h>, line 229.
Tylers-MBP:openshift4-bm-2-bol8bfp20g31gc9rli70-admin tylerlisowski$ kubectl get pods -n nvidia-gpu
NAME READY STATUS RESTARTS AGE
nvidia-gpu-device-plugin-zfx2n 0/1 Init:Error 0 5s
Changing the security context of the init container to be
securityContext:
privileged: true
Looks to have fixed my problem
However I cannot schedule a GPU pod unless I also give it security context privileged
initContainers:
- env:
- name: SKIP_P2P
value: "True"
image: registry.ng.bluemix.net/armada-master/gpu-cuda-tests-8-0:b9a418cfcd1e21b68d47ca1104a44ec078c3e12b
imagePullPolicy: IfNotPresent
name: gpu-cuda-tests-8-0
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-b69vv
readOnly: true
^ That for example will work
initContainers:
- env:
- name: SKIP_P2P
value: "True"
image: registry.ng.bluemix.net/armada-master/gpu-cuda-tests-8-0:b9a418cfcd1e21b68d47ca1104a44ec078c3e12b
imagePullPolicy: IfNotPresent
name: gpu-cuda-tests-8-0
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-b69vv
readOnly: true
That will fail initializing CUDA
So far: 1 Workaround for me: Installing the hook at the proper place at my worker
cp /etc/containers/oci/hooks.d/oci-nvidia-hook.json /usr/share/containers/oci/hooks.d/
Current failure afterwards: Code can only run on GPUs if the container has a privileged security context. I had to modify both the init container of the device plugin to be privileged so it started up properly and any GPU containers I scheduled afterwards
It looks potentially to be related to this issue: https://github.com/NVIDIA/nvidia-container-runtime/issues/42
Trying to add that policy locally to see if it changes anything
Found the other workaround I needed to do: https://github.com/NVIDIA/gpu-operator/issues/32 ^ I don't believe this issue has been fixed yet at least for ubi8 containers
@zvonkok should the container be installing at /etc/containers/oci/hooks.d/oci-nvidia-hook.json
in Openshift 4? I thought that the hook path was at /usr/share/containers/oci/hooks.d/
?
Once that is fixed and nvidia gets a new ubi8 container published this should be able to be utilized for RHEL 7 + CoreOS
The fix #32 should have been fixed long ago weird. Need to look into it.
The right path is /etc/containers/oci/hooks.d/oci-nvidia-hook.json
not the /usr/
one because /etc/
is writable for RHEL7,8 and RH CoreOS
Ok we will adjust our hooks path to point there
@relyt0925 No this does not need to be installed, this is for OCP 3.11, we are dealing here solely with the container_file_t context in OCP4.
Tylers-MBP:special-resource-operator tylerlisowski$ oc -n nvidia-gpu logs nvidia-gpu-runtime-enablement-9c8dc
chcon: failed to change context of '/run/nvidia/driver/dev/core' to 'system_u:object_r:container_file_t:s0': Operation not supported
chcon: failed to change context of '/run/nvidia/driver/dev/fd' to 'system_u:system_r:container_file_t:s0': Operation not supported
chcon: failed to change context of '/run/nvidia/driver/dev/stderr' to 'system_u:system_r:container_file_t:s0': Permission denied
chcon: failed to change context of '/run/nvidia/driver/dev/stdout' to 'system_u:system_r:container_file_t:s0': Permission denied
+ shopt -s lastpipe
+++ realpath /work/nvidia-toolkit
++ dirname /work/run.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ DAEMON=0
+ '[' 1 -eq 0 ']'
+ main /usr/local/nvidia
+ local -r destination=/usr/local/nvidia
+ shift
+ RUNTIME=crio
+ TOOLKIT_ARGS=
+ RUNTIME_ARGS=
++ getopt -l no-daemon,toolkit-args:,runtime:,runtime-args: -o nt:r:u: --
+ options=' --'
+ [[ 0 -ne 0 ]]
+ eval set -- ' --'
++ set -- --
+ for opt in ${options}
+ case "${opt}" in
+ shift
+ break
+ ensure::oneof docker crio
+ echo crio
++ cat -
+ local -r val=crio
+ for match in "$@"
+ [[ crio == \d\o\c\k\e\r ]]
+ for match in "$@"
+ [[ crio == \c\r\i\o ]]
+ return 0
+ _init
+ log INFO _init
+ local -r level=INFO
+ shift
+ local -r message=_init
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' _init
[INFO] _init
+ exec
+ flock -n 3
+ echo 356078
+ trap _shutdown EXIT
+ log INFO '=================Starting the NVIDIA Container Toolkit================='
+ local -r level=INFO
+ shift
+ local -r 'message==================Starting the NVIDIA Container Toolkit================='
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' '=================Starting the NVIDIA Container Toolkit================='
[INFO] =================Starting the NVIDIA Container Toolkit=================
+ toolkit /usr/local/nvidia
+ shopt -s lastpipe
+++ realpath /work/toolkit
++ dirname /work/toolkit.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ packages=("/usr/bin/nvidia-container-runtime" "/usr/bin/nvidia-container-toolkit" "/usr/bin/nvidia-container-cli" "/etc/nvidia-container-runtime/config.toml")
+ '[' 1 -eq 0 ']'
+ toolkit::install /usr/local/nvidia
+ local destination=/usr/local/nvidia/toolkit
+ shift
+ [[ 0 -ne 0 ]]
+ toolkit::remove /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ log INFO 'toolkit::remove /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::remove /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::remove /usr/local/nvidia/toolkit'
[INFO] toolkit::remove /usr/local/nvidia/toolkit
+ rm -rf /usr/local/nvidia/toolkit
+ log INFO 'toolkit::install '
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::install '
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::install '
[INFO] toolkit::install
+ '[' -e /etc/debian_version ']'
+ packages+=("/usr/lib64/libnvidia-container.so.1")
+ toolkit::install::packages /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ mkdir -p /usr/local/nvidia/toolkit
+ mkdir -p /usr/local/nvidia/toolkit/.config/nvidia-container-runtime
+ (( i=0 ))
+ (( i < 5 ))
++ readlink -f /usr/bin/nvidia-container-runtime
+ packages[$i]=/usr/bin/nvidia-container-runtime
+ (( i++ ))
+ (( i < 5 ))
++ readlink -f /usr/bin/nvidia-container-toolkit
+ packages[$i]=/usr/bin/nvidia-container-toolkit
+ (( i++ ))
+ (( i < 5 ))
++ readlink -f /usr/bin/nvidia-container-cli
+ packages[$i]=/usr/bin/nvidia-container-cli
+ (( i++ ))
+ (( i < 5 ))
++ readlink -f /etc/nvidia-container-runtime/config.toml
+ packages[$i]=/etc/nvidia-container-runtime/config.toml
+ (( i++ ))
+ (( i < 5 ))
++ readlink -f /usr/lib64/libnvidia-container.so.1
+ packages[$i]=/usr/lib64/libnvidia-container.so.1.0.7
+ (( i++ ))
+ (( i < 5 ))
+ cp /usr/bin/nvidia-container-runtime /usr/bin/nvidia-container-toolkit /usr/bin/nvidia-container-cli /etc/nvidia-container-runtime/config.toml /usr/lib64/libnvidia-container.so.1.0.7 /usr/local/nvidia/toolkit
+ mv /usr/local/nvidia/toolkit/config.toml /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/
+ toolkit::setup::config /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ local -r config_path=/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
+ log INFO 'toolkit::setup::config /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::setup::config /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::setup::config /usr/local/nvidia/toolkit'
[INFO] toolkit::setup::config /usr/local/nvidia/toolkit
+ sed -i 's/^#root/root/;' /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
+ sed -i 's@/run/nvidia/driver@/run/nvidia/driver@;' /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
+ sed -i 's;@/sbin/ldconfig.real;@/run/nvidia/driver/sbin/ldconfig.real;' /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
+ toolkit::setup::cli_binary /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ log INFO 'toolkit::setup::cli_binary /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::setup::cli_binary /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::setup::cli_binary /usr/local/nvidia/toolkit'
[INFO] toolkit::setup::cli_binary /usr/local/nvidia/toolkit
+ mv /usr/local/nvidia/toolkit/nvidia-container-cli /usr/local/nvidia/toolkit/nvidia-container-cli.real
+ tr -s ' \t'
+ cat
+ chmod +x /usr/local/nvidia/toolkit/nvidia-container-cli
+ toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ log INFO 'toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit'
[INFO] toolkit::setup::toolkit_binary /usr/local/nvidia/toolkit
+ mv /usr/local/nvidia/toolkit/nvidia-container-toolkit /usr/local/nvidia/toolkit/nvidia-container-toolkit.real
+ tr -s ' \t'
+ cat
+ chmod +x /usr/local/nvidia/toolkit/nvidia-container-toolkit
+ toolkit::setup::runtime_binary /usr/local/nvidia/toolkit
+ local -r destination=/usr/local/nvidia/toolkit
+ log INFO 'toolkit::setup::runtime_binary /usr/local/nvidia/toolkit'
+ local -r level=INFO
+ shift
+ local -r 'message=toolkit::setup::runtime_binary /usr/local/nvidia/toolkit'
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' 'toolkit::setup::runtime_binary /usr/local/nvidia/toolkit'
[INFO] toolkit::setup::runtime_binary /usr/local/nvidia/toolkit
+ mv /usr/local/nvidia/toolkit/nvidia-container-runtime /usr/local/nvidia/toolkit/nvidia-container-runtime.real
+ tr -s ' \t'
+ cat
+ chmod +x /usr/local/nvidia/toolkit/nvidia-container-runtime
+ cd /usr/local/nvidia/toolkit
+ ln -s ./nvidia-container-toolkit /usr/local/nvidia/toolkit/nvidia-container-runtime-hook
+ ln -s ./libnvidia-container.so.1.0.7 /usr/local/nvidia/toolkit/libnvidia-container.so.1
+ cd -
/work
+ crio setup /usr/local/nvidia
+ shopt -s lastpipe
+++ realpath /work/crio
++ dirname /work/crio.sh
+ readonly basedir=/work
+ basedir=/work
+ source /work/common.sh
++ readonly RUN_DIR=/run/nvidia
++ RUN_DIR=/run/nvidia
++ readonly LOCAL_DIR=/usr/local/nvidia
++ LOCAL_DIR=/usr/local/nvidia
++ readonly TOOLKIT_DIR=/run/nvidia/toolkit
++ TOOLKIT_DIR=/run/nvidia/toolkit
++ readonly PID_FILE=/run/nvidia/toolkit.pid
++ PID_FILE=/run/nvidia/toolkit.pid
++ readonly CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ CRIO_HOOKS_DIR=/usr/share/containers/oci/hooks.d
++ readonly CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ CRIO_HOOK_FILENAME=oci-nvidia-hook.json
++ '[' -t 2 ']'
++ readonly LOG_NO_TTY=1
++ LOG_NO_TTY=1
++ '[' 0 -eq 1 ']'
+ '[' 2 -eq 0 ']'
+ command=setup
+ shift
+ case "${command}" in
+ crio::setup /usr/local/nvidia
+ '[' 1 -eq 0 ']'
+ local hooksd=/usr/share/containers/oci/hooks.d
+ local ensure=TRUE
+ local -r destination=/usr/local/nvidia/toolkit
+ shift
++ getopt -l hooks-dir:,no-check -o d:c --
+ options=' --'
+ [[ 0 -ne 0 ]]
+ eval set -- ' --'
++ set -- --
+ for opt in ${options}
+ case "${opt}" in
+ shift
+ break
+ [[ TRUE = \T\R\U\E ]]
+ ensure::mounted /usr/share/containers/oci/hooks.d
+ local -r directory=/usr/share/containers/oci/hooks.d
+ grep -q /usr/share/containers/oci/hooks.d
+ mount
+ [[ /usr/local/nvidia/toolkit == *\#* ]]
+ mkdir -p /usr/share/containers/oci/hooks.d
+ cp /work/oci-nvidia-hook.json /usr/share/containers/oci/hooks.d
+ sed -i s#@DESTINATION@#/usr/local/nvidia/toolkit# /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
+ [[ 0 -ne 0 ]]
+ log INFO '=================Done, Now Waiting for signal================='
+ local -r level=INFO
+ shift
+ local -r 'message==================Done, Now Waiting for signal================='
+ local fmt_on=
+ local -r fmt_off=
+ case "${level}" in
+ fmt_on=
+ printf '%s[%s]%s %b\n' '' INFO '' '=================Done, Now Waiting for signal================='
[INFO] =================Done, Now Waiting for signal=================
+ trap 'echo '\''Caught signal'\''; _shutdown; crio cleanup /usr/local/nvidia; { kill 356127; exit 0; }' HUP INT QUIT PIPE TERM
+ trap - EXIT
+ true
+ sleep infinity
+ wait 356127
Tylers-MBP:special-resource-operator tylerlisowski$
crw-rw-rw-. 1 root root system_u:object_r:container_file_t:s0 195, 254 May 6 15:30 nvidia-modeset
crw-rw-rw-. 1 root root system_u:object_r:container_file_t:s0 195, 0 May 6 15:30 nvidia0
crw-rw-rw-. 1 root root system_u:object_r:container_file_t:s0 195, 1 May 6 15:30 nvidia1
crw-rw-rw-. 1 root root system_u:object_r:container_file_t:s0 195, 2 May 6 15:30 nvidia2
crw-rw-rw-. 1 root root system_u:object_r:container_file_t:s0 195, 3 May 6 15:30 nvidia3
crw-rw-rw-. 1 root root system_u:object_r:container_file_t:s0 195, 255 May 6 15:30 nvidiactl
crw-rw-rw-. 1 root root system_u:object_r:container_file_t:s0 10, 144 May 6 15:28 nvram
crw-rw-rw-. 1 root root system_u:object_r:container_file_t:s0 1, 12 May 6 15:28 oldmem
plugin_dir = "/var/lib/cni/bin"[root@test-bol8b9220mt1momn6k90-openshift4b-gpu-00000262 hooks.d]# ls -laZ /etc/containers/oci/hooks.d/
drwxr-xr-x. root root system_u:object_r:etc_t:s0 .
drwxr-xr-x. root root system_u:object_r:etc_t:s0 ..
-rw-r--r--. root root system_u:object_r:etc_t:s0 oci-nvidia-hook.json
[root@test-bol8b9220mt1momn6k90-openshift4b-gpu-00000262 hooks.d]#
@zvonkok after making the changes we discussed (precreating the container hooks dir before starting CRIO)
I still see the initial rollout fail to deploy with the following error
crw-rw-rw-. root root system_u:object_r:container_runtime_tmpfs_t:s0 nvidia-uvm
crw-rw-rw-. root root system_u:object_r:container_runtime_tmpfs_t:s0 nvidia-uvm-tools
crw-rw-rw-. root root system_u:object_r:container_file_t:s0 nvidia0
crw-rw-rw-. root root system_u:object_r:container_file_t:s0 nvidia1
crw-rw-rw-. root root system_u:object_r:container_file_t:s0 nvidia2
crw-rw-rw-. root root system_u:object_r:container_file_t:s0 nvidia3
Note how nvidia-uvm and nvidia-uvm-tools remain container_runtime_tmpfs_t
while everything else is container_file_t
. Changing the context on them solves the problem.
This happened in two separate deploys of GPU nodes I did for scratch and is reproducible just by trying to do a fresh rollout of the components.
If you exec in and run the chcon command everything will proceed to rollout.
There appears to be some conflicting documentation around the official support for GPUs in the RHEL 7 operating system. There are various docs that point to this being at a GA level of support:
https://access.redhat.com/solutions/4908611
https://www.openshift.com/blog/creating-a-gpu-enabled-node-with-openshift-4-2-in-amazon-ec2 ^This one goes up to using the Node Feature Detector on GPU workers but does not include the steps for deploying the GPU Operator here. It also does not result in a fully functioning GPU environment.
However, when I go through the steps of deploying the NFD and Special Resource Operator through the installed operators addons: It breaks all future pod creations on the node since it appears the GLIBC version of the nvidia-toolkit-container does not match what is expected for RHEL 7:
I also was able to find no documentation on how to do a build of that specifically for RHEL 7 so I opened the upstream NVIDIA issue: https://github.com/NVIDIA/gpu-operator/issues/58
What I'm trying to clarify is the current support of RHEL 7 GPUs in Openshift 4.3. Is that currently at a GA level of support? And if it's not supported is there plans to support it on a future release of Openshift 4.3?
If it is supported: Is there some special additional steps I need to do in order to get the proper support.
cc @zvonkok as this relates to our emails but I just wanted to open this issue for a central place of information