Closed aarora8 closed 3 years ago
Could you show us the output of
nvidia-smi
?
Thanks, it seems that the issue is related to GPU and NVCC. I ran on a particular GPU and it completed successfully. I got the following output:
# Running on r2n02
# Started at Sun Oct 31 22:32:08 EDT 2021
# /home/hltcoe/aarora/miniconda3/envs/k2_scratch2/bin/python3 ./tdnn/train.py
2021-10-31 22:32:10,245 INFO [train.py:481] Training started
2021-10-31 22:32:10,245 INFO [train.py:482] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lr': 0.01, 'feature_dim': 23, 'weight_decay': 1e-06, 'start_epoch': 0, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 20, 'valid_interval': 10, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30.0, 'bucketing_sampler': False, 'num_buckets': 10, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'env_info': {'k2-version': '1.9', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7178d67e594bc7fa89c2b331ad7bd1c62a6a9eb4', 'k2-git-date': 'Tue Oct 26 10:12:54 2021', 'lhotse-version': '0.11.0.dev+git.7f56dd1.clean', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'coe_asr2', 'icefall-git-sha1': 'e06baf3-dirty', 'icefall-git-date': 'Sun Oct 31 19:53:21 2021', 'icefall-path': '/exp/aarora/icefall_work_env/icefall', 'k2-path': '/exp/aarora/icefall_work_env/k2_me/k2/python/k2/__init__.py', 'lhotse-path': '/exp/aarora/icefall_work_env/lhotse/lhotse/__init__.py'}}
2021-10-31 22:32:10,281 INFO [lexicon.py:176] Loading pre-compiled data/lang_phone/Linv.pt
2021-10-31 22:32:13,366 INFO [asr_datamodule.py:145] About to get train cuts
2021-10-31 22:32:13,367 INFO [asr_datamodule.py:242] About to get train cuts
2021-10-31 22:32:13,385 INFO [asr_datamodule.py:148] About to create train dataset
2021-10-31 22:32:13,385 INFO [asr_datamodule.py:199] Using SingleCutSampler.
2021-10-31 22:32:13,388 INFO [asr_datamodule.py:205] About to create train dataloader
2021-10-31 22:32:13,388 INFO [asr_datamodule.py:218] About to get test cuts
2021-10-31 22:32:13,388 INFO [asr_datamodule.py:248] About to get test cuts
2021-10-31 22:32:14,011 INFO [train.py:420] Epoch 0, batch 0, loss[loss=1.061, over 2805 frames.], tot_loss[loss=1.061, over 2805 frames.], batch size: 5
2021-10-31 22:32:14,524 INFO [train.py:420] Epoch 0, batch 10, loss[loss=0.4313, over 2695 frames.], tot_loss[loss=0.6688, over 22140.152947017563 frames.], batch size: 5
2021-10-31 22:32:15,141 INFO [train.py:444] Epoch 0, validation loss=0.862, over 17976 frames.
2021-10-31 22:32:51,819 INFO [train.py:444] Epoch 14, validation loss=0.01105, over 17976 frames.
2021-10-31 22:32:52,079 INFO [checkpoint.py:62] Saving checkpoint to tdnn/exp/epoch-14.pt
2021-10-31 22:32:52,086 INFO [train.py:553] Done!
# Accounting: time=44 threads=1
# Finished at Sun Oct 31 22:32:52 EDT 2021 with status 0
Nvidia-smi of this GPU (r2n02) is as follows:
Sun Oct 31 22:49:10 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN RTX On | 00000000:3B:00.0 Off | N/A |
| 41% 26C P8 15W / 200W | 1MiB / 24220MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN RTX On | 00000000:5E:00.0 Off | N/A |
| 40% 25C P8 10W / 200W | 1MiB / 24220MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN RTX On | 00000000:B1:00.0 Off | N/A |
| 41% 25C P8 15W / 200W | 1MiB / 24220MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA TITAN RTX On | 00000000:D9:00.0 Off | N/A |
| 40% 26C P8 14W / 200W | 1MiB / 24220MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Nvidia-smi of GPU (r7n04) is as follows:
Sun Oct 31 22:51:12 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Tesla V1... On | 00000000:1A:00.0 Off | 0 |
| N/A 27C P0 24W / 200W | 0MiB / 32510MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Tesla V1... On | 00000000:1B:00.0 Off | 0 |
| N/A 52C P0 180W / 200W | 31729MiB / 32510MiB | 100% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA Tesla V1... On | 00000000:1C:00.0 Off | 0 |
| N/A 27C P0 24W / 200W | 0MiB / 32510MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA Tesla V1... On | 00000000:3D:00.0 Off | 0 |
| N/A 28C P0 24W / 200W | 0MiB / 32510MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA Tesla V1... On | 00000000:3E:00.0 Off | 0 |
| N/A 28C P0 24W / 200W | 0MiB / 32510MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA Tesla V1... On | 00000000:8B:00.0 Off | 0 |
| N/A 27C P0 24W / 200W | 0MiB / 32510MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA Tesla V1... On | 00000000:8C:00.0 Off | 0 |
| N/A 25C P0 24W / 200W | 0MiB / 32510MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA Tesla V1... On | 00000000:B4:00.0 Off | 0 |
| N/A 25C P0 25W / 200W | 0MiB / 32510MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
I suspect that the issue is caused by https://github.com/k2-fsa/k2/blob/master/CMakeLists.txt#L211
# see https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
# https://www.myzhar.com/blog/tutorials/tutorial-nvidia-gpu-cuda-compute-capability/
set(K2_COMPUTE_ARCH_CANDIDATES 35 50 60 61 70 75)
You can use the above two links to look up your GPU architecture. If it is not listed in K2_COMPUTE_ARCH_CANDIDATES
, you can add it and recompile k2.
ok, thank you so much, got it.
The compute capability (70) of GPU in r7n04 (NVIDIA Tesla V100)is listed in K2_COMPUTE_ARCH_CANDIDATES. Do you think, compiling on V100 would help.
The compute capability (70) of GPU in r7n04 (NVIDIA Tesla V100)is listed in K2_COMPUTE_ARCH_CANDIDATES. Do you think, compiling on V100 would help.
Did you compile k2 from source on the machine with NVIDIA TITAN RTX GPUs and run it on another machine with V100 GPUs?
Yeah, I compiled k2 on NVIDIA TITAN RTX GPU and run it on V100 GPUs. I will now compile it with V100 GPU.
If yes, I would suggest two ways:
(1) Compile k2 separately on each machine
(2) Modify https://github.com/k2-fsa/k2/blob/master/CMakeLists.txt#L232
message(STATUS "K2_COMPUTE_ARCHS: ${K2_COMPUTE_ARCHS}")
Add the following line before the above line
set(K2_COMPUTE_ARCHS 70 75)
and then compile k2 on either machine. The two machines can share a single version with this approach.
A third alternative is to compile k2 on the machine with V100 GPUs and run it on another machine without modifying the source code of k2, but its speed at runtime may be affected.
ok, thank you, got it.
Thanks, my scripts are now running with out invalid device function error.
Hi, I installed k2 from source and lhotse via pip. To check if my (k2 and lhotse) installation is ok, I am trying to run Yes No recipe. I did not change anything in the scripts however, while running yes no recipe, I am getting an error (RuntimeError: invalid device function). I am getting the same error in librispeech recipe and a recipe which I wrote. It seems to be due to installation and probably nvcc version and if anybody can help me with this error. My log with environment information is as follows: