k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.1k stars 214 forks source link

y = k2.RaggedTensor(y).to(device) RuntimeError: Some bad things happened. Please read the above error messages and stack #1214

Open TszSimLaw opened 1 year ago

TszSimLaw commented 1 year ago

./pruned_transducer_stateless7_bbpe/train.py

Exception:

-- Process 3 terminated with the following error:

y = k2.RaggedTensor(y).to(device) RuntimeError: Some bad things happened. Please read the above error messages and stack

csukuangfj commented 1 year ago

Could you post more error logs?

TszSimLaw commented 1 year ago

File "train.py", line 1249, in main() File "train.py", line 1240, in main mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True) File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 1097, in run scan_pessimistic_batches_for_oom( File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 1206, in scan_pessimistic_batches_foroom loss, = compute_loss( File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 672, in compute_loss y = k2.RaggedTensor(y).to(device) RuntimeError: Some bad things happened. Please read the above error messages and stack trace. If you are using Python, the following command may be helpful:

  gdb --args python /path/to/your/code.py

(You can use `gdb` to debug the code. Please consider compiling
a debug version of k2.)
TszSimLaw commented 1 year ago

Could you post more error logs?

I had used gdb ; run train.py --world-size 4 --num-epochs 30 --start-epoch 1 --exp-dir pruned_transducer_stateless7_bbpe/exp --max-duration 400 error as above My torch version is 1.7.1, k2 version is 1.23.4

csukuangfj commented 1 year ago

File "train.py", line 1249, in main() File "train.py", line 1240, in main mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True) File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 1097, in run scan_pessimistic_batches_for_oom( File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 1206, in scan_pessimistic_batches_foroom loss, = compute_loss( File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 672, in compute_loss y = k2.RaggedTensor(y).to(device) RuntimeError: Some bad things happened. Please read the above error messages and stack trace. If you are using Python, the following command may be helpful:

  gdb --args python /path/to/your/code.py

(You can use `gdb` to debug the code. Please consider compiling
a debug version of k2.)

Could you give even more error logs?

TszSimLaw commented 1 year ago

File "train.py", line 1249, in main() File "train.py", line 1240, in main mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True) File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception: -- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/anaconda3/envs/kaldi/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 1097, in run scan_pessimistic_batches_for_oom( File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 1206, in scan_pessimistic_batches_foroom loss, = compute_loss( File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 672, in compute_loss y = k2.RaggedTensor(y).to(device) RuntimeError: Some bad things happened. Please read the above error messages and stack trace. If you are using Python, the following command may be helpful:

  gdb --args python /path/to/your/code.py

(You can use `gdb` to debug the code. Please consider compiling
a debug version of k2.)

Could you give even more error logs?

The training log is as follows :

2023-06-21 10:17:20,038 INFO [train.py:951] (3/4) Training started 2023-06-21 10:17:20,038 INFO [train.py:961] (3/4) Device: cuda:3 2023-06-21 10:17:20,039 INFO [train.py:970] (3/4) {'frame_shift_ms': 10.0, 'allowed_excess_duration_ratio': 0.1, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.23.4', 'k2-build-type': 'Release', 'k2-with-cuda': False, 'k2-git-sha1': '1031d2c7a6d64fe733283a498495030b6005fb97', 'k2-git-date': 'Thu Feb 23 14:26:54 2023', 'lhotse-version': '1.15.0', 'torch-version': '1.7.1+cu110', 'torch-cuda-available': True, 'torch-cuda-version': '11.0', 'python-version': '3.8', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/home/res1/yx_data/nlp/asr_pro', 'k2-path': '/home/h3c/anaconda3/envs/kaldi/lib/python3.8/site-packages/k2/init.py', 'lhotse-path': '/home/h3c/anaconda3/envs/kaldi/lib/python3.8/site-packages/lhotse/init.py', 'hostname': 'h3c-R5300-G5', 'IP address': '127.0.1.1'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('pruned_transducer_stateless7_bbpe/exp'), 'bpe_model': 'data/lang_bbpe_500/bbpe.model', 'base_lr': 0.05, 'lr_batches': 5000, 'lr_epochs': 3.5, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': False, 'num_encoder_layers': '2,4,3,2,4', 'feedforward_dims': '1024,1024,2048,2048,1024', 'nhead': '8,8,8,8,8', 'encoder_dims': '384,384,384,384,384', 'attention_dims': '192,192,192,192,192', 'encoder_unmasked_dims': '256,256,256,256,256', 'zipformer_downsampling_factors': '1,2,4,8,2', 'cnn_module_kernels': '31,31,31,31,31', 'decoder_dim': 512, 'joiner_dim': 512, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 400, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500} 2023-06-21 10:17:20,039 INFO [train.py:972] (3/4) About to create model 2023-06-21 10:17:20,550 INFO [zipformer.py:178] (3/4) At encoder stack 4, which has downsampling_factor=2, we will combine the outputs of layers 1 and 3, with downsampling_factors=2 and 8. 2023-06-21 10:17:20,562 INFO [train.py:976] (3/4) Number of model parameters: 70369391 2023-06-21 10:17:24,876 INFO [train.py:991] (3/4) Using DDP 2023-06-21 10:17:24,976 INFO [asr_datamodule.py:407] (3/4) About to get train cuts 2023-06-21 10:17:24,977 INFO [train.py:1072] (3/4) Filtering short and long utterances. 2023-06-21 10:17:24,978 INFO [train.py:1075] (3/4) Tokenizing and encoding texts in train cuts. 2023-06-21 10:17:24,978 INFO [asr_datamodule.py:224] (3/4) About to get Musan cuts 2023-06-21 10:17:24,978 INFO [asr_datamodule.py:229] (3/4) Enable MUSAN 2023-06-21 10:17:24,978 INFO [asr_datamodule.py:252] (3/4) Enable SpecAugment 2023-06-21 10:17:24,978 INFO [asr_datamodule.py:253] (3/4) Time warp factor: 80 2023-06-21 10:17:24,978 INFO [asr_datamodule.py:263] (3/4) Num frame mask: 10 2023-06-21 10:17:24,978 INFO [asr_datamodule.py:276] (3/4) About to create train dataset 2023-06-21 10:17:24,978 INFO [asr_datamodule.py:303] (3/4) Using DynamicBucketingSampler. 2023-06-21 10:17:28,180 INFO [asr_datamodule.py:320] (3/4) About to create train dataloader 2023-06-21 10:17:28,181 INFO [asr_datamodule.py:414] (3/4) About to get dev cuts 2023-06-21 10:17:28,181 INFO [train.py:1091] (3/4) Tokenizing and encoding texts in valid cuts. 2023-06-21 10:17:28,181 INFO [asr_datamodule.py:351] (3/4) About to create dev dataset 2023-06-21 10:17:28,401 INFO [asr_datamodule.py:370] (3/4) About to create dev dataloader 2023-06-21 10:17:28,401 INFO [train.py:1198] (3/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-06-21 10:29:52,105 INFO [train.py:1176] (3/4) Saving batch to pruned_transducer_stateless7_bbpe/exp/batch-24933b83-7577-50a9-a491-f0b2ea1fca65.pt 2023-06-21 10:29:52,127 INFO [train.py:1182] (3/4) features shape: torch.Size([20, 2000, 80]) 2023-06-21 10:29:52,128 INFO [train.py:1186] (3/4) num tokens: 1981

All error logs had been posted

csukuangfj commented 1 year ago

Could you change

File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 672, in compute_loss
y = k2.RaggedTensor(y).to(device)

to

print(y)
print(device)
y = k2.RaggedTensor(y).to(device)

and post the output?

TszSimLaw commented 1 year ago

Could you change

File "/home/icefall/egs/tal_csasr/ASR/pruned_transducer_stateless7_bbpe/train.py", line 672, in compute_loss
y = k2.RaggedTensor(y).to(device)

to

print(y)
print(device)
y = k2.RaggedTensor(y).to(device)

and post the output?

ok, as follows

2023-06-21 12:15:43,212 INFO [train.py:1201] (3/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-06-21 12:15:43,232 INFO [asr_datamodule.py:370] (1/4) About to create dev dataloader 2023-06-21 12:15:43,232 INFO [train.py:1201] (1/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM. 2023-06-21 12:15:43,268 INFO [asr_datamodule.py:370] (2/4) About to create dev dataloader 2023-06-21 12:15:43,268 INFO [train.py:1201] (2/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM. [[17, 14, 11, 276, 233, 24, 31, 19, 11, 39, 76, 9, 9, 12, 179, 216, 299, 24, 106, 187, 290, 20, 24, 23, 7, 127, 290, 20, 11, 95, 21, 115, 92, 192, 37, 29, 14, 51, 58, 24, 30, 19, 323, 4, 44, 140, 390, 51, 129, 21, 158, 80, 44, 327, 134, 293, 190, 129, 21, 158, 80, 44, 327, 134, 293, 190, 123, 6, 332, 35, 318, 37, 27, 95, 12, 143, 115, 12, 91, 109, 24, 217, 213, 321, 111, 84, 189, 189, 89, 61, 4, 44, 140, 405, 213, 72, 192, 117, 83, 134, 293, 190, 66, 23, 390, 51, 27], [391, 266, 229, 228, 33, 17, 26, 15, 4, 46, 120, 46, 238, 3, 29, 14, 93, 9, 3, 208, 81, 20, 99, 84, 7, 5, 87, 75, 25, 27, 208, 81, 20, 99, 7, 5, 17, 5, 17, 5, 129, 21, 158, 80, 342, 20, 30, 19, 174, 99, 6, 5, 46, 120, 30, 19, 29, 14, 3, 331, 12, 69, 279, 445, 150, 46, 331, 12, 69, 279, 445, 150, 46, 9, 3, 21, 144, 125, 331, 12, 69, 279, 445, 150, 20, 37, 8, 141, 176, 8, 90, 141, 4, 7, 5, 46, 120], [6, 9, 3, 25, 114, 113, 178, 4, 8, 121, 225, 416, 3, 484, 114, 113, 178, 6, 3, 7, 5, 381, 12, 205, 132, 16, 47, 94, 31, 10, 15, 35, 6, 5, 329, 170, 9, 13, 76, 186, 24, 11, 6, 5, 114, 113, 178, 4, 8, 121, 225, 416, 3, 29, 14, 26, 395, 302, 25, 24, 114, 113, 178, 9, 3, 7, 5, 381, 12, 205, 132, 9, 3, 25, 129, 410, 4, 23, 187, 428, 22, 226, 244, 22, 244, 308, 32, 47, 123, 26, 217, 3, 26, 250, 12, 145, 244, 285, 3, 7, 5, 408, 80, 22, 69, 109, 8, 218, 141, 94, 31, 10, 15, 6, 169, 129, 378, 25, 4, 3, 39, 76, 185, 28, 37, 184, 29, 14, 10, 15, 6, 5, 9, 13, 186, 24], [17, 10, 15, 9, 250, 12, 145, 244, 39, 31, 191, 73, 34, 212, 424, 129, 398, 249, 128, 8, 116, 141, 4, 44, 140, 21, 64, 125, 25, 12, 244, 160, 8, 121, 168, 24, 67, 12, 244, 160, 265, 22, 70, 116, 32, 67, 6, 429, 22, 90, 179, 150, 4, 46, 120, 382, 6, 154, 22, 90, 179, 150, 4, 46, 120, 26, 9, 3, 445, 150, 46, 10, 15, 6, 335, 236, 9, 37, 195, 337, 7, 28, 377, 54, 212, 445, 150, 46, 3, 105, 14, 112], [217, 3, 6, 5, 54, 268, 405, 213, 3, 322, 8, 90, 118, 4, 7, 5, 348, 104, 232, 15, 117, 34, 39, 76, 273, 167, 46, 120, 66, 29, 14, 32, 32, 124, 4, 30, 19, 54, 268, 72, 66, 29, 14, 32, 165, 12, 261, 121, 4, 217, 221, 213, 9, 3, 10, 15, 34, 19, 52, 4, 44, 140, 122, 348, 104, 422, 280, 7, 28, 87, 75, 25, 189, 339, 10, 15, 354, 73, 17, 5, 392, 4, 3, 7, 5, 199, 132, 45, 132, 116, 12, 272, 362, 29, 14, 240, 209, 95, 39, 31, 49, 17, 221, 213, 221, 213, 13, 7, 98, 232, 15, 392, 148, 156, 86, 50, 346, 3, 23, 422, 280, 4, 87, 75, 25], [26, 204, 230, 95, 13, 122, 13, 122, 17, 63, 217, 207, 85, 6, 3, 7, 5, 454, 98, 46, 10, 15, 37, 35, 7, 28, 6, 5, 328, 57, 328, 57, 3, 49, 34, 396, 98, 46, 96, 3, 25, 148, 4, 51, 58, 30, 19, 6, 5, 53, 254, 3, 49, 34, 454, 98, 46, 148, 445, 150, 46, 207, 85, 6, 3, 7, 5, 454, 98, 46, 94, 31, 34, 360, 93, 4, 289, 461, 28, 10, 15, 112, 203, 360, 93, 4, 51, 58, 9, 3, 10, 13, 122, 8, 144, 125, 12, 126, 115, 148, 22, 103, 216, 22, 64, 261], [17, 5, 12, 143, 115, 12, 91, 109, 95, 12, 115, 77, 32, 4, 307, 205, 300, 9, 3, 7, 127, 314, 214, 136, 335, 66, 108, 12, 91, 109, 6, 7, 459, 61, 4, 65, 26, 322, 8, 90, 118, 4, 46, 268, 87, 281, 187, 94, 31, 25, 4, 65, 6, 5, 4, 65, 42, 405, 213, 87, 281, 427, 282, 306, 76, 299, 292, 13, 83, 6, 159, 159, 12, 92, 118, 6, 326, 5, 89, 4, 300, 201, 12, 220, 153, 95, 12, 115, 77, 32, 4, 12, 115, 133, 21, 111, 158, 17, 5, 209, 268, 37, 4, 221, 213, 395, 302, 12, 144, 194, 73, 7, 104, 149, 12, 176, 64, 24], [235, 110, 42, 11, 35, 6, 14, 482, 95, 35, 73, 19, 52, 24, 10, 10, 15, 159, 52, 60, 66, 124, 233, 48, 6, 5, 20, 189, 339, 13, 3, 25, 73, 25, 6, 154, 311, 6, 127, 210, 11, 199, 101, 4, 6, 127, 95, 3, 3, 9, 3, 3, 108, 131, 168, 336, 4, 163, 238, 9, 3, 131, 168, 336, 412, 12, 90, 287, 163, 238, 247, 149, 247, 149, 364, 147, 163, 238, 355, 16, 13, 16], [16, 94, 31, 13, 98, 238, 6, 5, 46, 120, 26, 89, 24, 344, 54, 27, 17, 22, 138, 151, 30, 6, 5, 59, 33, 197, 26, 3, 81, 20, 217, 3, 26, 13, 7, 98, 3, 404, 54, 26, 66, 23, 34, 404, 54, 4, 53, 497, 151, 287, 21, 151, 69, 84, 26, 31, 221, 63, 163, 238, 192, 117, 24, 10, 15, 108, 6, 93, 4, 81, 20, 333, 300, 404, 54, 81, 20, 106, 201, 385, 4, 60, 23, 280, 4, 27, 87, 75, 25, 425, 20, 19, 52, 4, 81, 20, 7, 258, 95, 3, 300, 404, 54, 81, 20, 27], [124, 7, 28, 358, 81, 20, 4, 329, 250, 429, 16, 13, 16, 358, 81, 20, 23, 329, 250, 429, 357, 8, 70, 136, 7, 28, 17, 329, 250, 429, 56, 7, 429, 3, 29, 14, 193, 16, 13, 16, 56, 7, 429, 3, 208, 81, 20, 17, 14, 56, 211, 429, 9, 3, 42, 56, 211, 429, 3, 13, 3, 9, 3, 368, 8, 69, 243, 81, 20, 8, 176, 145, 302, 201, 192, 117, 4, 368, 8, 69, 243, 81, 20], [104, 104, 104, 17, 10, 15, 60, 3, 249, 70, 388, 120, 37, 25, 366, 26, 25, 6, 5, 8, 262, 152, 12, 152, 90, 22, 115, 194, 105, 14, 93, 469, 43, 303, 33, 55, 36, 43, 74, 55, 107, 270, 171, 18, 41, 43, 74, 55, 107, 270, 171, 18, 41, 4, 51, 58, 9, 3, 7, 12, 152, 90, 22, 115, 194, 17, 469, 43, 303, 33, 55, 36, 9, 3, 29, 14, 29, 14, 105, 14, 93, 9, 150, 11, 25, 398, 8, 125, 218, 12, 200, 200, 12, 64, 91, 105, 14, 93, 27, 9, 3], [16, 16, 106, 32, 8, 115, 111, 249, 248, 245, 12, 218, 176, 11, 95, 39, 31, 260, 6, 150, 150, 6, 5, 137, 88, 366, 87, 75, 25, 463, 4, 8, 145, 64, 8, 145, 64, 78, 24, 131, 248, 8, 128, 126, 169, 72, 39, 31, 366, 32, 10, 273, 167, 11, 12, 143, 115, 12, 91, 109, 395, 302, 106, 32, 24, 10, 15, 60, 3, 7, 318, 37, 35, 7, 28, 47, 42, 365, 177, 3, 6, 5, 7, 258, 445, 150, 46, 29, 14, 3, 7, 258, 445, 150, 46, 9, 3, 39, 31, 49, 391, 266, 229, 228, 33, 357, 386, 4, 150, 46], [26, 260, 298, 26, 430, 5, 20, 9, 25, 11, 252, 259, 166, 22, 168, 115, 30, 19, 27, 320, 21, 138, 151, 29, 29, 14, 17, 10, 15, 11, 35, 282, 306, 8, 200, 168, 21, 151, 69, 252, 199, 275, 3, 13, 3, 25, 27, 34, 8, 116, 160, 131, 126, 242, 159, 11, 38, 108, 26, 233, 149, 347, 34, 326, 8, 91, 103, 242, 159, 11, 38, 108, 26, 233, 149, 347, 34, 29, 14, 29, 14, 242, 159, 245, 5, 307, 97, 12, 92, 118, 9, 49, 245, 5, 425, 20, 34, 29, 14, 242, 159], [26, 3, 7, 5, 29, 14, 20, 425, 20, 16, 67, 34, 29, 14, 242, 19, 10, 15, 25, 24, 35, 7, 5, 8, 64, 130, 12, 262, 126, 7, 5, 20, 4, 20, 371, 26, 3, 7, 5, 29, 14, 20, 3, 7, 5, 175, 20, 163, 253, 20, 229, 230, 3, 433, 20, 229, 230, 3, 425, 20, 26, 4, 20, 371, 10, 15, 3, 35, 6, 5, 20, 26, 19, 52, 122, 210, 11, 7, 5, 87, 75, 25, 6, 96, 4, 155, 310, 183, 155, 239, 36], [53, 110, 18, 18, 107, 27, 6, 5, 379, 54, 27, 357, 78, 38, 187, 35, 204, 21, 92, 141, 353, 193, 6, 5, 161, 182, 57, 208, 36, 36, 68, 72, 3, 161, 182, 57, 6, 5, 232, 4, 331, 12, 69, 279, 289, 461, 96, 72, 124, 24, 27, 6, 5, 72, 393, 38, 35, 7, 28, 161, 182, 57, 148, 208, 36, 36, 68, 474, 4, 66, 23, 367, 119, 53, 497, 151, 176, 309, 358, 4, 27], [75, 206, 3, 12, 152, 143, 177, 290, 4, 65, 3, 13, 3, 11, 196, 164, 129, 378, 8, 158, 256, 12, 152, 143, 30, 19, 72, 196, 164, 129, 378, 4, 175, 412, 193, 16, 13, 16, 11, 3, 442, 409, 4, 217, 3, 75, 206, 99, 84, 43, 4, 65, 10, 15, 9, 3, 285, 3, 196, 164, 129, 378, 8, 158, 256, 12, 152, 143, 217, 3, 221, 26, 4, 95, 13, 196, 164, 207, 85, 13, 442, 409, 27, 16, 47, 94, 31, 76, 99, 84, 43, 17, 14, 63, 9, 3, 7, 5, 112, 203, 13, 442, 409, 4, 129, 75, 206, 99, 7, 5, 44, 246, 4, 65, 9, 3, 7, 5, 13, 405, 98, 4, 44, 246], [208, 9, 3, 284, 43, 162, 43, 88, 4, 81, 20, 387, 163, 17, 10, 15, 34, 6, 5, 46, 120, 212, 113, 55, 274, 19, 52, 26, 9, 13, 76, 99, 284, 43, 162, 43, 88, 26, 9, 38, 99, 208, 6, 5, 20, 334, 22, 77, 225, 9, 3, 81, 20, 387, 163, 32, 67, 94, 31, 25, 10, 15, 6, 5, 240, 209, 147, 166, 3], [42, 16, 7, 358, 8, 64, 172, 4, 425, 20, 26, 95, 23, 7, 7, 127, 334, 22, 77, 225, 4, 8, 121, 225, 416, 27, 9, 3, 10, 15, 242, 19, 122, 35, 73, 106, 187, 4, 379, 54, 9, 3, 106, 187, 4, 81, 20, 26, 260, 13, 7, 93, 425, 20, 12, 121, 126, 45, 111, 287, 192, 37, 10, 15, 296, 317, 3, 13, 7, 93, 4, 11, 476, 10, 51, 58, 67, 9, 3, 10, 15, 302, 201, 34, 233, 411, 323, 402, 96, 3, 13, 3, 399, 11, 186, 379, 54, 27], [307, 179, 39, 31, 381, 44, 246, 104, 72, 39, 31, 381, 7, 432, 44, 246, 382, 156, 86, 50, 271, 285, 381, 7, 432, 44, 246, 72, 9, 3, 63, 15, 22, 90, 179, 22, 90, 179, 150, 44, 246, 22, 115, 168, 8, 128, 135, 4, 13, 360, 156, 86, 146, 39, 31, 450, 21, 272, 101, 405, 381, 7, 5, 44, 246, 104, 72, 39, 31, 450, 72, 39, 31, 87, 281, 8, 362, 70, 12, 216, 128, 381, 7, 432, 44, 246, 382, 156, 86, 50, 271, 285, 76, 381, 7, 432, 44, 246], [6, 5, 299, 292, 24, 67, 42, 11, 405, 98, 37, 17, 6, 5, 282, 306, 395, 302, 448, 473, 11, 386, 486, 9, 3, 178, 170, 50, 57, 237, 313, 36, 171, 86, 255, 110, 17, 10, 150, 11, 366, 75, 206, 11, 38, 108, 376, 170, 197, 313, 36, 43, 171, 86, 255, 110]] cuda:3

csukuangfj commented 1 year ago

Does it crash after printing?

TszSimLaw commented 1 year ago

Does it crash after printing?

no crash.

then print the log:

[F] /home/runner/work/k2/k2/k2/csrc/device_guard.h:66:static int32_t k2::DeviceGuard::GetDevice() k2 compiled without CUDA support

[ Stack-Trace: ] /home/anaconda3/envs/kaldi/lib/python3.8/site-packages/k2/lib/libk2_log.so(k2::internal::GetStackTrace()+0x47) [0x7ffeb6ab7077] /home/anaconda3/envs/kaldi/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x8439a) [0x7ffeb759e39a] /home/anaconda3/envs/kaldi/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x116daf) [0x7ffeb7630daf] /home/anaconda3/envs/kaldi/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0x9418d) [0x7ffeb75ae18d] /home/anaconda3/envs/kaldi/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so(+0xecf15) [0x7ffeb7606f15] /home/anaconda3/envs/kaldi/bin/python(PyCFunction_Call+0x52) [0x4e1072] /home/anaconda3/envs/kaldi/bin/python(_PyObject_MakeTpCall+0x3eb) [0x4d1f7b] /home/anaconda3/envs/kaldi/bin/python() [0x4e965b] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalFrameDefault+0x4d48) [0x4ccdd8] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c6f45] /home/anaconda3/envs/kaldi/bin/python(_PyFunction_Vectorcall+0x19c) [0x4db1ac] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalFrameDefault+0x172c) [0x4c97bc] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c6f45] /home/anaconda3/envs/kaldi/bin/python(_PyFunction_Vectorcall+0x19c) [0x4db1ac] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalFrameDefault+0x172c) [0x4c97bc] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c6f45] /home/anaconda3/envs/kaldi/bin/python(_PyFunction_Vectorcall+0x19c) [0x4db1ac] /home/anaconda3/envs/kaldi/bin/python(PyObject_Call+0x5e) [0x4ed53e] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalFrameDefault+0x1f03) [0x4c9f93] /home/anaconda3/envs/kaldi/bin/python(_PyFunction_Vectorcall+0x106) [0x4db116] /home/anaconda3/envs/kaldi/bin/python(PyObject_Call+0x5e) [0x4ed53e] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalFrameDefault+0x1f03) [0x4c9f93] /home/anaconda3/envs/kaldi/bin/python(_PyFunction_Vectorcall+0x106) [0x4db116] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalFrameDefault+0xa3e) [0x4c8ace] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c6f45] /home/anaconda3/envs/kaldi/bin/python(_PyFunction_Vectorcall+0x19c) [0x4db1ac] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalFrameDefault+0xa3e) [0x4c8ace] /home/anaconda3/envs/kaldi/bin/python(_PyFunction_Vectorcall+0x106) [0x4db116] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalFrameDefault+0x907) [0x4c8997] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c6f45] /home/anaconda3/envs/kaldi/bin/python(_PyFunction_Vectorcall+0x19c) [0x4db1ac] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalFrameDefault+0x172c) [0x4c97bc] /home/anaconda3/envs/kaldi/bin/python(_PyEval_EvalCodeWithName+0x1f5) [0x4c6f45] /home/anaconda3/envs/kaldi/bin/python(PyEval_EvalCodeEx+0x39) [0x4c6d49] /home/anaconda3/envs/kaldi/bin/python(PyEval_EvalCode+0x1b) [0x56d7eb] /home/anaconda3/envs/kaldi/bin/python() [0x58cb21] /home/anaconda3/envs/kaldi/bin/python() [0x5868df] /home/anaconda3/envs/kaldi/bin/python(PyRun_StringFlags+0x7b) [0x584eab] /home/anaconda3/envs/kaldi/bin/python(PyRun_SimpleStringFlags+0x3b) [0x584d8b] /home/anaconda3/envs/kaldi/bin/python(Py_RunMain+0x15b) [0x583f7b] /home/anaconda3/envs/kaldi/bin/python(Py_BytesMain+0x39) [0x5618a9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7ffff703fc87] /home//anaconda3/envs/kaldi/bin/python() [0x56175e]

2023-06-21 12:21:57,449 INFO [train.py:1179] (3/4) Saving batch to pruned_transducer_stateless7_bbpe/exp/batch-24933b83-7577-50a9-a491-f0b2ea1fca65.pt 2023-06-21 12:21:57,469 INFO [train.py:1185] (3/4) features shape: torch.Size([20, 2000, 80]) 2023-06-21 12:21:57,470 INFO [train.py:1189] (3/4) num tokens: 1981

csukuangfj commented 1 year ago

[F] /home/runner/work/k2/k2/k2/csrc/device_guard.h:66:static int32_t k2::DeviceGuard::GetDevice() k2 compiled without CUDA support

You are using a CPU version of k2. Please install a CUDA version.

Please follow the documentation to check that you have indeed installed a CUDA version by running

python3 -m k2.version
TszSimLaw commented 1 year ago

[F] /home/runner/work/k2/k2/k2/csrc/device_guard.h:66:static int32_t k2::DeviceGuard::GetDevice() k2 compiled without CUDA support

You are using a CPU version of k2. Please install a CUDA version.

Please follow the documentation to check that you have indeed installed a CUDA version by running

python3 -m k2.version

Many thanks,i'll try