Open pcanal opened 1 month ago
Thanks for this report. These errors refer to fork
: are we sure the resource we are lacking are threads and not PIDs? Is the configuration of the machine "sane", i.e. allowing an adequate number of subprocesses per process?
It looks okay:
$ cat /proc/sys/kernel/threads-max
7897651
$ cat /proc/sys/kernel/pid_max
4194304
$ cat /proc/sys/vm/max_map_count
262144
jupyter-pcanal-rootdevel:quick-devel pcanal$ ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 3948825
max locked memory (kbytes, -l) 8192
max memory size (kbytes, -m) unlimited
open files (-n) 1048576
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4194304
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Ok, I think we have at least 2 problems here. The first is related to these errors Unable to register cuDNN/cuFFT/cuBLAS factory: Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS when one has already been registered
For those, I propose you set up your machine following the hints on this thread https://github.com/tensorflow/tensorflow/issues/62075 (it's a bug).
As for fork: retry: Resource temporarily unavailable
, again it looks like a configuration matter relative to the node. Some research shows pages like this one https://unix.stackexchange.com/questions/205016/fork-retry-resource-temporarily-unavailable, that hints to configurations like the one in /etc/sysctl.conf
.
All in all, I am inclined to consider this item relative to the platform at hand and not to ROOT.
Just trying to understand whether more information is available about this item. I would like to find out whether this is an issue of ROOT(test) or the configuration of the machine...
Hi @pcanal, can you check if the situation is better with https://github.com/root-project/root/pull/16717 merged?
Check duplicate issues.
Description
When running with
ctest -j 32
on a node with 127 cores (see below for more details), one of the run had many failures due to running out of thread resources. The list of affected test includes:those (and possibly
tutorial-tmva-keras-MulticlassKeras-py
which did not run because it requires the previous test)Reproducer
Other errors:
ROOT version
master
Installation method
hand build
Operating system
Alma9
Additional context
Node is VM with 128GB of RAM and is access via Jupyter notebook.