Open bzizou opened 2 years ago
No errors in the logs.
Message [TAKTUK OUTPUT] bigfoot10-3: perl - init (4265): output > [job_resource_manager_cgroups][30][bigfoot10][DEBUG] Deny NVIDIA GPUs: 1
Doing a manual config hides the GPU:
root@bigfoot10:~# echo 'c 195:1 rwm' > /dev/oar_cgroups_links/devices/oar/bzizou_31/devices.deny
Workaround: running /usr/bin/nvidia-smi -L || exit 5
from the /etc/default/oar-node
startup script fixes the problem (probably by load nvidia drivers). It also checks if nvidia drivers are ok at boot time by the way.
The
Enable_devices_cg = "YES"
enables hide of GPU devices that are not reserved in the current job. But the feature doesn't seem to work for the first job just after a reboot of the node. The next jobs are ok. Tested with Debian 9.13 nodes, V100 and A100 GPUS, rebooted several times, the problem is reproducible