root-project / root

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
https://root.cern
Other
2.7k stars 1.28k forks source link

roottest running out of threads !? #16552

Open pcanal opened 1 month ago

pcanal commented 1 month ago

Check duplicate issues.

Description

When running with ctest -j 32 on a node with 127 cores (see below for more details), one of the run had many failures due to running out of thread resources. The list of affected test includes:

47:PyMVA-Keras-Classification                                
348:PyMVA-Keras-Regression 
349:PyMVA-Keras-Multiclass  
985:tutorial-tmva-TMVA_SOFIE_Keras
1238:tutorial-tmva-RBatchGenerator_PyTorch-py  
1239:tutorial-tmva-RBatchGenerator_TensorFlow-py   
1247:tutorial-tmva-TMVA_SOFIE_RDataFrame-py        
1252:tutorial-tmva-keras-GenerateModel-py       
1253:tutorial-tmva-keras-MulticlassKeras-py       
1584:roottest-root-io-evolution-make              
1641:roottest-root-io-newstl-make

those (and possibly tutorial-tmva-keras-MulticlassKeras-py which did not run because it requires the previous test)

Reproducer

347/2278 Testing: PyMVA-Keras-Classification
347/2278 Test: PyMVA-Keras-Classification
Command: "/usr/bin/cmake" "-DCMD=/home/pcanal/root_working/build/quick-devel/tmva/pymva/test/testPyKerasClassification" "-DSYS=/home/pcanal/root_working/build/quick-devel" "-P" "/home/pcanal/root_working/code/quick-devel/cmake/modules/RootTestDriver.cmake"
Directory: /home/pcanal/root_working/build/quick-devel/tmva/pymva/test
"PyMVA-Keras-Classification" start time: Sep 24 20:01 UTC
Output:
----------------------------------------------------------
Get test data...
Generate keras model...
2024-09-24 20:01:12.572604: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-24 20:01:12.572668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-24 20:01:12.573914: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-24 20:01:12.581129: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-24 20:01:15.157134: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:549: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/pcanal/.local/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
2024-09-24 20:01:26.401521: F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_numa_-1_Eigen creation via pthread_create() failed.
[ERROR] Failed to generate model using python
CMake Error at /home/pcanal/root_working/code/quick-devel/cmake/modules/RootTestDriver.cmake:232 (message):
  error code: 1

<end of output>
Test time =  54.61 sec
----------------------------------------------------------
Test Failed.
"PyMVA-Keras-Classification" end time: Sep 24 20:02 UTC
"PyMVA-Keras-Classification" time elapsed: 00:00:54

Other errors:

14323:    system_error: Resource temporarily unavailable
614356:/bin/sh: fork: retry: Resource temporarily unavailable
614357:/bin/sh: fork: retry: Resource temporarily unavailable
614358:/bin/sh: fork: retry: Resource temporarily unavailable
614359:/bin/sh: fork: retry: Resource temporarily unavailable
614360:/bin/sh: fork: Resource temporarily unavailable
614444:/bin/sh: fork: retry: Resource temporarily unavailable
614445:/bin/sh: fork: retry: Resource temporarily unavailable
614446:/bin/sh: fork: retry: Resource temporarily unavailable
614447:/bin/sh: fork: retry: Resource temporarily unavailable
616571:LLVM ERROR: pthread_create failed: Resource temporarily unavailable
616573:sh: fork: retry: Resource temporarily unavailable
616574:sh: fork: retry: Resource temporarily unavailable
616575:sh: fork: retry: Resource temporarily unavailable
616576:sh: fork: retry: Resource temporarily unavailable
616577:sh: fork: Resource temporarily unavailable

ROOT version

master

Installation method

hand build

Operating system

Alma9

Additional context

Node is VM with 128GB of RAM and is access via Jupyter notebook.

jupyter-pcanal-rootdevel:quick-devel pcanal$ uname -a
Linux jupyter-pcanal-rootdevel 6.3.12-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul  6 04:05:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
CPU(s):                  127
  On-line CPU(s) list:   0-126
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7543 32-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  1
    Core(s) per socket:  1
dpiparo commented 1 month ago

Thanks for this report. These errors refer to fork: are we sure the resource we are lacking are threads and not PIDs? Is the configuration of the machine "sane", i.e. allowing an adequate number of subprocesses per process?

pcanal commented 1 month ago

It looks okay:

$ cat /proc/sys/kernel/threads-max
7897651
$ cat /proc/sys/kernel/pid_max 
4194304
$ cat /proc/sys/vm/max_map_count
262144
jupyter-pcanal-rootdevel:quick-devel pcanal$ ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 3948825
max locked memory           (kbytes, -l) 8192
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 4194304
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited
dpiparo commented 1 month ago

Ok, I think we have at least 2 problems here. The first is related to these errors Unable to register cuDNN/cuFFT/cuBLAS factory: Attempting to register factory for plugin cuDNN/cuFFT/cuBLAS when one has already been registered For those, I propose you set up your machine following the hints on this thread https://github.com/tensorflow/tensorflow/issues/62075 (it's a bug).

As for fork: retry: Resource temporarily unavailable, again it looks like a configuration matter relative to the node. Some research shows pages like this one https://unix.stackexchange.com/questions/205016/fork-retry-resource-temporarily-unavailable, that hints to configurations like the one in /etc/sysctl.conf .

All in all, I am inclined to consider this item relative to the platform at hand and not to ROOT.

dpiparo commented 1 month ago

Just trying to understand whether more information is available about this item. I would like to find out whether this is an issue of ROOT(test) or the configuration of the machine...

guitargeek commented 2 weeks ago

Hi @pcanal, can you check if the situation is better with https://github.com/root-project/root/pull/16717 merged?