all CUDA-capable devices are busy or unavailable

pchampio commented 3 years ago

Upon training, pkwrap fails at around the 25 iter. The node I'm training on has 2 GTX 1080Ti. Do I need to use a node with more GPUs?

$ local/chain/train.py --stage 4 --conf configs/tdnnf_e2e
pkwrap INFO: Reading config
pkwrap INFO: Initializing model
pkwrap INFO: Starting training from stage=0
pkwrap INFO: 2021-02-19 01:38:19 Running iter=0 of 150 with 2 jobs and lr=0.002000
[...]
pkwrap INFO: 2021-02-19 02:09:12 Running iter=24 of 150 with 2 jobs and lr=0.001313
pkwrap INFO: 2021-02-19 02:10:28 Running iter=25 of 150 with 3 jobs and lr=0.001935
run.pl: job failed, log is in exp/chain/e2e_tdnnf/log/train.25.2.log

# local/chain/e2e/tuning/tdnnf.py --dir exp/chain/e2e_tdnnf --mode training --lr 0.0019348400313112875 --frame-shift 0 --egs ark:exp/chain/e2e_tdnnf/egs/cegs.17.ark --l2-regularize-factor 0.3333333333333333 --minibatch-size 16 --new-model exp/chain/e2e_tdnnf/25.2.pt exp/chain/e2e_tdnnf/25.pt 
# Started at Fri Feb 19 02:10:28 CET 2021
#
WARNING ([5.5.888~1-d619]:SelectGpuId():cu-device.cc:197) Will try again to get a GPU after 20 seconds.
Fri Feb 19 02:10:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:03:00.0 Off |                  N/A |
| 38%   63C    P2   251W / 250W |  11021MiB / 11178MiB |     52%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:82:00.0 Off |                  N/A |
| 35%   60C    P2   186W / 250W |  10741MiB / 11178MiB |     92%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5315      C   python3                         11017MiB |
|    1   N/A  N/A      5306      C   python3                         10737MiB |
+-----------------------------------------------------------------------------+
LOG ([5.5.888~1-d619]:SelectGpuId():cu-device.cc:206) num-gpus=2. Device 0: all CUDA-capable devices are busy or unavailable.  Device 1: all CUDA-capable devices are busy or unavailable.  
ERROR ([5.5.888~1-d619]:SelectGpuId():cu-device.cc:207) Failed to create CUDA context, no more unused GPUs? 

[ Stack-Trace: ]
/srv/storage/talc@talc-data.nancy/multispeech/calcul/users/pchampion/lab/pkwrap/pkwrap/egs/librispeech/v1/../../../../kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x7a4) [0x7f4dbb9019d6]
/srv/storage/talc@talc-data.nancy/multispeech/calcul/users/pchampion/lab/pkwrap/pkwrap/egs/librispeech/v1/../../../../kaldi/src/lib/libkaldi-base.so(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x7f4dbb90328d]
/srv/storage/talc@talc-data.nancy/multispeech/calcul/users/pchampion/lab/pkwrap/pkwrap/egs/librispeech/v1/../../../../kaldi/src/lib/libkaldi-cudamatrix.so(kaldi::CuDevice::SelectGpuId(std::string)+0x850) [0x7f4dba4f65dc]
/srv/storage/talc@talc-data.nancy/multispeech/calcul/users/pchampion/lab/pkwrap/venv/lib/python3.8/site-packages/pkwrap-0.2.27.2-py3.8-linux-x86_64.egg/_pkwrap.cpython-38-x86_64-linux-gnu.so(InstantiateKaldiCuda()+0x63) [0x7f4dbb936f23]
/srv/storage/talc@talc-data.nancy/multispeech/calcul/users/pchampion/lab/pkwrap/venv/lib/python3.8/site-packages/pkwrap-0.2.27.2-py3.8-linux-x86_64.egg/_pkwrap.cpython-38-x86_64-linux-gnu.so(+0x3086a) [0x7f4dbb93986a]
/srv/storage/talc@talc-data.nancy/multispeech/calcul/users/pchampion/lab/pkwrap/venv/lib/python3.8/site-packages/pkwrap-0.2.27.2-py3.8-linux-x86_64.egg/_pkwrap.cpython-38-x86_64-linux-gnu.so(+0x43424) [0x7f4dbb94c424]
python3(PyCFunction_Call+0x56) [0x561dcfccaf76]
python3(_PyObject_MakeTpCall+0x22f) [0x561dcfc8885f]
python3(_PyEval_EvalFrameDefault+0x4596) [0x561dcfd0ff56]
python3(+0x18bc0b) [0x561dcfcd6c0b]
python3(+0x10077f) [0x561dcfc4b77f]
python3(+0x18bc0b) [0x561dcfcd6c0b]
python3(+0x10077f) [0x561dcfc4b77f]
python3(_PyEval_EvalCodeWithName+0x659) [0x561dcfcd5e19]
python3(_PyFunction_Vectorcall+0x1e3) [0x561dcfcd6943]
python3(_PyObject_FastCallDict+0x24b) [0x561dcfcd74cb]
python3(_PyObject_Call_Prepend+0x63) [0x561dcfcd7733]
python3(+0x18c8ca) [0x561dcfcd78ca]
python3(_PyObject_MakeTpCall+0x1a4) [0x561dcfc887d4]
python3(_PyEval_EvalFrameDefault+0x11d0) [0x561dcfd0cb90]
python3(_PyEval_EvalCodeWithName+0x2d2) [0x561dcfcd5a92]
python3(PyEval_EvalCodeEx+0x44) [0x561dcfcd6754]
python3(PyEval_EvalCode+0x1c) [0x561dcfd64edc]
python3(+0x219f84) [0x561dcfd64f84]
python3(+0x24c1f4) [0x561dcfd971f4]
python3(PyRun_FileExFlags+0xa1) [0x561dcfc5f6e1]
python3(PyRun_SimpleFileExFlags+0x3b4) [0x561dcfc5fac6]
python3(+0x11598b) [0x561dcfc6098b]
python3(Py_BytesMain+0x39) [0x561dcfd99d19]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f4e836c509b]
python3(+0x1dee93) [0x561dcfd29e93]

Traceback (most recent call last):
  File "local/chain/e2e/tuning/tdnnf.py", line 139, in <module>
    ChainE2EModel(Net, cmd_line=True)
  File "/srv/storage/talc@talc-data.nancy/multispeech/calcul/users/pchampion/lab/pkwrap/venv/lib/python3.8/site-packages/pkwrap-0.2.27.2-py3.8-linux-x86_64.egg/pkwrap/chain.py", line 422, in __init__
    self.call_by_mode()
  File "/srv/storage/talc@talc-data.nancy/multispeech/calcul/users/pchampion/lab/pkwrap/venv/lib/python3.8/site-packages/pkwrap-0.2.27.2-py3.8-linux-x86_64.egg/pkwrap/chain.py", line 447, in call_by_mode
    self.train()
  File "/srv/storage/talc@talc-data.nancy/multispeech/calcul/users/pchampion/lab/pkwrap/venv/lib/python3.8/site-packages/pkwrap-0.2.27.2-py3.8-linux-x86_64.egg/pkwrap/chain.py", line 744, in train
    kaldi.InstantiateKaldiCuda()
RuntimeError: kaldi::KaldiFatalError
# Accounting: time=21 threads=1
# Ended (code 1) at Fri Feb 19 02:10:49 CET 2021, elapsed time 21 seconds

pchampio commented 3 years ago

BTW, I've set nvidia-smi -c 3. Is this correct ?

pchampio commented 3 years ago

Oh, it's because of the configs/tdnnf_e2e that indicates:

num_jobs_initial = 2
num_jobs_final = 5

(but again, I'll like to know if nvidia-smi -c 3 is correct) Should I expect similar results even-through I'm running on la node with less GPUs ? Is there a way to 'fake' been on multiples GPUs ? (a wait mode ?)

mrsrikanth commented 3 years ago

Hi,

May be you can just set num_jobs_final to 2? That way the training will never use more GPUs than the number present?

I don't use nvidia-smi -c 3, but that I think depends on the hardware setup.

Srikanth

pchampio commented 3 years ago

Okay! Thanks for the information.

idiap / pkwrap

all CUDA-capable devices are busy or unavailable #10