oar-team / oar

OAR is a versatile resource and task manager (also called a batch scheduler) for clusters and other computing infrastructures.
http://oar.imag.fr/
GNU General Public License v2.0
43 stars 22 forks source link

job_resource_manager_cgroups: Nvidia devices not hidden for the first job after boot #193

Open bzizou opened 2 years ago

bzizou commented 2 years ago

The Enable_devices_cg = "YES" enables hide of GPU devices that are not reserved in the current job. But the feature doesn't seem to work for the first job just after a reboot of the node. The next jobs are ok. Tested with Debian 9.13 nodes, V100 and A100 GPUS, rebooted several times, the problem is reproducible

bzizou commented 2 years ago

No errors in the logs. Message [TAKTUK OUTPUT] bigfoot10-3: perl - init (4265): output > [job_resource_manager_cgroups][30][bigfoot10][DEBUG] Deny NVIDIA GPUs: 1 Doing a manual config hides the GPU:

root@bigfoot10:~# echo 'c 195:1 rwm' > /dev/oar_cgroups_links/devices/oar/bzizou_31/devices.deny
bzizou commented 2 years ago

Workaround: running /usr/bin/nvidia-smi -L || exit 5 from the /etc/default/oar-node startup script fixes the problem (probably by load nvidia drivers). It also checks if nvidia drivers are ok at boot time by the way.