k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.1k stars 214 forks source link

Issue with swoosh_r #1209

Closed AmirHussein96 closed 1 year ago

AmirHussein96 commented 1 year ago

I compiled the latest version of k2 successfully as shown below:

python -m k2.version
Collecting environment information...

k2 version: 1.24.3
Build type: Release
Git SHA1: 1a76309e5c6343c4d18965b7ce134a7f311d9d3a
Git date: Sun May 28 09:04:03 2023
Cuda used to build k2: 10.2
cuDNN used to build k2: 8.0.5
Python version used to build k2: 3.8
OS used to build k2: Red Hat Enterprise Linux Server release 7.8 (Maipo)
CMake version: 3.18.0
GCC version: 6.5.0
CMAKE_CUDA_FLAGS:   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_60,code=sm_60 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_60,code=sm_60 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable  -Wno-strict-overflow
PyTorch version used to build k2: 1.12.1+cu102
PyTorch is using Cuda: 10.2
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False

However when I tried training a zipformer I got an error with swoosh_r function, something related to invalid device function. The log is below:

2023-06-12 06:09:48,606 INFO [train_asr.py:1109] (0/4) Training started
2023-06-12 06:09:48,614 INFO [train_asr.py:1109] (1/4) Training started
2023-06-12 06:09:48,615 INFO [train_asr.py:1119] (1/4) Device: cuda:1
2023-06-12 06:09:48,616 INFO [train_asr.py:1119] (0/4) Device: cuda:0
2023-06-12 06:09:48,617 INFO [train_asr.py:1109] (3/4) Training started
2023-06-12 06:09:48,617 INFO [train_asr.py:1119] (3/4) Device: cuda:3
2023-06-12 06:09:48,618 INFO [train_asr.py:1109] (2/4) Training started
2023-06-12 06:09:48,619 INFO [train_asr.py:1119] (2/4) Device: cuda:2
2023-06-12 06:09:48,629 INFO [train_asr.py:1130] (1/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 1000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 15000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '1a76309e5c6343c4d18965b7ce134a7f311d9d3a', 'k2-git-date': 'Sun May 28 09:04:03 2023', 'lhotse-version': '1.16.0.dev+git.4fe02b9.clean', 'torch-version': '1.12.1+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/alt-arabic/speech/amir/competitions/IWSLT/icefall', 'k2-path': '/alt-arabic/speech/amir/competitions/IWSLT/k2/k2/python/k2/__init__.py', 'lhotse-path': '/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/lhotse/__init__.py', 'hostname': 'crimv3mgpu005', 'IP address': '10.141.0.4'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-asr-small'), 'bpe_model': 'data/lang_bpe_ta_1000/bpe.model', 'bpe_tgt_model': 'data/lang_bpe_en_1000/bpe.model', 'base_lr': 0.01, 'lr_batches': 5000, 'lr_epochs': 4, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 16000, 'keep_last_k': 10, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '256,512,768,1024,768,512', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '128,256,256,512,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '64,128,128,256,128,128', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 256, 'joiner_dim': 256, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 1000}
2023-06-12 06:09:48,629 INFO [train_asr.py:1132] (1/4) About to create model
2023-06-12 06:09:48,629 INFO [train_asr.py:1130] (3/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 1000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 15000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '1a76309e5c6343c4d18965b7ce134a7f311d9d3a', 'k2-git-date': 'Sun May 28 09:04:03 2023', 'lhotse-version': '1.16.0.dev+git.4fe02b9.clean', 'torch-version': '1.12.1+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/alt-arabic/speech/amir/competitions/IWSLT/icefall', 'k2-path': '/alt-arabic/speech/amir/competitions/IWSLT/k2/k2/python/k2/__init__.py', 'lhotse-path': '/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/lhotse/__init__.py', 'hostname': 'crimv3mgpu005', 'IP address': '10.141.0.4'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-asr-small'), 'bpe_model': 'data/lang_bpe_ta_1000/bpe.model', 'bpe_tgt_model': 'data/lang_bpe_en_1000/bpe.model', 'base_lr': 0.01, 'lr_batches': 5000, 'lr_epochs': 4, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 16000, 'keep_last_k': 10, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '256,512,768,1024,768,512', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '128,256,256,512,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '64,128,128,256,128,128', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 256, 'joiner_dim': 256, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 1000}
2023-06-12 06:09:48,629 INFO [train_asr.py:1130] (0/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 1000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 15000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '1a76309e5c6343c4d18965b7ce134a7f311d9d3a', 'k2-git-date': 'Sun May 28 09:04:03 2023', 'lhotse-version': '1.16.0.dev+git.4fe02b9.clean', 'torch-version': '1.12.1+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/alt-arabic/speech/amir/competitions/IWSLT/icefall', 'k2-path': '/alt-arabic/speech/amir/competitions/IWSLT/k2/k2/python/k2/__init__.py', 'lhotse-path': '/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/lhotse/__init__.py', 'hostname': 'crimv3mgpu005', 'IP address': '10.141.0.4'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-asr-small'), 'bpe_model': 'data/lang_bpe_ta_1000/bpe.model', 'bpe_tgt_model': 'data/lang_bpe_en_1000/bpe.model', 'base_lr': 0.01, 'lr_batches': 5000, 'lr_epochs': 4, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 16000, 'keep_last_k': 10, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '256,512,768,1024,768,512', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '128,256,256,512,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '64,128,128,256,128,128', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 256, 'joiner_dim': 256, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 1000}
2023-06-12 06:09:48,629 INFO [train_asr.py:1132] (3/4) About to create model
2023-06-12 06:09:48,629 INFO [train_asr.py:1132] (0/4) About to create model
2023-06-12 06:09:48,631 INFO [train_asr.py:1130] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 1000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 15000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '1a76309e5c6343c4d18965b7ce134a7f311d9d3a', 'k2-git-date': 'Sun May 28 09:04:03 2023', 'lhotse-version': '1.16.0.dev+git.4fe02b9.clean', 'torch-version': '1.12.1+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': None, 'icefall-git-sha1': None, 'icefall-git-date': None, 'icefall-path': '/alt-arabic/speech/amir/competitions/IWSLT/icefall', 'k2-path': '/alt-arabic/speech/amir/competitions/IWSLT/k2/k2/python/k2/__init__.py', 'lhotse-path': '/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/lhotse/__init__.py', 'hostname': 'crimv3mgpu005', 'IP address': '10.141.0.4'}, 'world_size': 4, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-asr-small'), 'bpe_model': 'data/lang_bpe_ta_1000/bpe.model', 'bpe_tgt_model': 'data/lang_bpe_en_1000/bpe.model', 'base_lr': 0.01, 'lr_batches': 5000, 'lr_epochs': 4, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 16000, 'keep_last_k': 10, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '256,512,768,1024,768,512', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '128,256,256,512,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '64,128,128,256,128,128', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 256, 'joiner_dim': 256, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 200, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'blank_id': 0, 'vocab_size': 1000}
2023-06-12 06:09:48,631 INFO [train_asr.py:1132] (2/4) About to create model
2023-06-12 06:09:49,006 INFO [train_asr.py:1136] (1/4) Number of model parameters: 28487467
2023-06-12 06:09:49,015 INFO [train_asr.py:1136] (0/4) Number of model parameters: 28487467
2023-06-12 06:09:49,015 INFO [train_asr.py:1136] (2/4) Number of model parameters: 28487467
2023-06-12 06:09:49,035 INFO [train_asr.py:1136] (3/4) Number of model parameters: 28487467
2023-06-12 06:09:53,096 INFO [train_asr.py:1151] (2/4) Using DDP
2023-06-12 06:09:53,136 INFO [train_asr.py:1151] (0/4) Using DDP
2023-06-12 06:09:53,187 INFO [train_asr.py:1151] (1/4) Using DDP
2023-06-12 06:09:53,242 INFO [train_asr.py:1151] (3/4) Using DDP
2023-06-12 06:09:53,461 INFO [asr_datamodule.py:383] (3/4) About to get train cuts
2023-06-12 06:09:53,463 INFO [asr_datamodule.py:383] (1/4) About to get train cuts
2023-06-12 06:09:53,464 INFO [asr_datamodule.py:383] (0/4) About to get train cuts
2023-06-12 06:09:53,465 INFO [asr_datamodule.py:383] (2/4) About to get train cuts
2023-06-12 06:09:53,466 INFO [asr_datamodule.py:202] (0/4) Enable MUSAN
2023-06-12 06:09:53,466 INFO [asr_datamodule.py:203] (0/4) About to get Musan cuts
2023-06-12 06:09:53,467 INFO [asr_datamodule.py:202] (3/4) Enable MUSAN
2023-06-12 06:09:53,468 INFO [asr_datamodule.py:203] (3/4) About to get Musan cuts
2023-06-12 06:09:53,468 INFO [asr_datamodule.py:202] (2/4) Enable MUSAN
2023-06-12 06:09:53,468 INFO [asr_datamodule.py:203] (2/4) About to get Musan cuts
2023-06-12 06:09:53,469 INFO [asr_datamodule.py:202] (1/4) Enable MUSAN
2023-06-12 06:09:53,469 INFO [asr_datamodule.py:203] (1/4) About to get Musan cuts
2023-06-12 06:09:55,057 INFO [asr_datamodule.py:232] (0/4) Enable SpecAugment
2023-06-12 06:09:55,058 INFO [asr_datamodule.py:233] (0/4) Time warp factor: 80
2023-06-12 06:09:55,058 INFO [asr_datamodule.py:245] (0/4) Num frame mask: 10
2023-06-12 06:09:55,058 INFO [asr_datamodule.py:258] (0/4) About to create train dataset
2023-06-12 06:09:55,058 INFO [asr_datamodule.py:286] (0/4) Using DynamicBucketingSampler.
2023-06-12 06:09:55,072 INFO [asr_datamodule.py:232] (1/4) Enable SpecAugment
2023-06-12 06:09:55,073 INFO [asr_datamodule.py:233] (1/4) Time warp factor: 80
2023-06-12 06:09:55,073 INFO [asr_datamodule.py:245] (1/4) Num frame mask: 10
2023-06-12 06:09:55,073 INFO [asr_datamodule.py:258] (1/4) About to create train dataset
2023-06-12 06:09:55,073 INFO [asr_datamodule.py:286] (1/4) Using DynamicBucketingSampler.
2023-06-12 06:09:55,077 INFO [asr_datamodule.py:232] (2/4) Enable SpecAugment
2023-06-12 06:09:55,077 INFO [asr_datamodule.py:233] (2/4) Time warp factor: 80
2023-06-12 06:09:55,077 INFO [asr_datamodule.py:245] (2/4) Num frame mask: 10
2023-06-12 06:09:55,077 INFO [asr_datamodule.py:258] (2/4) About to create train dataset
2023-06-12 06:09:55,077 INFO [asr_datamodule.py:286] (2/4) Using DynamicBucketingSampler.
2023-06-12 06:09:55,289 INFO [asr_datamodule.py:232] (3/4) Enable SpecAugment
2023-06-12 06:09:55,289 INFO [asr_datamodule.py:233] (3/4) Time warp factor: 80
2023-06-12 06:09:55,289 INFO [asr_datamodule.py:245] (3/4) Num frame mask: 10
2023-06-12 06:09:55,289 INFO [asr_datamodule.py:258] (3/4) About to create train dataset
2023-06-12 06:09:55,290 INFO [asr_datamodule.py:286] (3/4) Using DynamicBucketingSampler.
2023-06-12 06:09:58,082 INFO [asr_datamodule.py:301] (2/4) About to create train dataloader
2023-06-12 06:09:58,083 INFO [asr_datamodule.py:390] (2/4) About to get dev cuts
2023-06-12 06:09:58,163 INFO [asr_datamodule.py:301] (0/4) About to create train dataloader
2023-06-12 06:09:58,163 INFO [asr_datamodule.py:390] (0/4) About to get dev cuts
2023-06-12 06:09:58,240 INFO [train_asr.py:1350] (2/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM.
2023-06-12 06:09:58,283 INFO [asr_datamodule.py:301] (1/4) About to create train dataloader
2023-06-12 06:09:58,284 INFO [asr_datamodule.py:390] (1/4) About to get dev cuts
2023-06-12 06:09:58,324 INFO [train_asr.py:1350] (0/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM.
2023-06-12 06:09:58,463 INFO [train_asr.py:1350] (1/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM.
2023-06-12 06:09:58,600 INFO [asr_datamodule.py:301] (3/4) About to create train dataloader
2023-06-12 06:09:58,601 INFO [asr_datamodule.py:390] (3/4) About to get dev cuts
2023-06-12 06:09:58,769 INFO [train_asr.py:1350] (3/4) Sanity check -- see if any of the batches in epoch 1 would cause OOM.
[F] /alt-arabic/speech/amir/competitions/IWSLT/k2/k2/csrc/eval.h:148:void k2::EvalDevice(cudaStream_t, int32_t, LambdaT&) [with LambdaT = __nv_dl_wrapper_t<__nv_dl_tag<at::Tensor (*)(torch::autograd::AutogradContext*, at::Tensor, float), k2::SwooshFunction<k2::SwooshRConstants>::forward, 1>, const float*, float, float, float, float, const float*, float*, const float*, unsigned char*>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (98 vs. 0)  Error: invalid device function. 

[ Stack-Trace: ]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x2aab4822f624]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x3d9da) [0x2aab45e299da]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x1436e8) [0x2aab45f2f6e8]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x14a13e) [0x2aab45f3613e]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x14ae19) [0x2aab45f36e19]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x13e9d0) [0x2aab45f2a9d0]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x37d51) [0x2aab45e23d51]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(PyCFunction_Call+0x52) [0x4e0212]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_MakeTpCall+0x3eb) [0x4d0eab]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalFrameDefault+0x5265) [0x4cc245]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyFunction_Vectorcall+0x106) [0x4da136]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3() [0x4e8817]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(PyObject_Call+0x5e) [0x4ec60e]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalFrameDefault+0x204f) [0x4c902f]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_FastCallDict+0x21b) [0x4d049b]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_Call_Prepend+0x60) [0x4e4bb0]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3() [0x542697]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_MakeTpCall+0x3eb) [0x4d0eab]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalFrameDefault+0x4f58) [0x4cbf38]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyFunction_Vectorcall+0x106) [0x4da136]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3() [0x4e8817]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(PyObject_Call+0x5e) [0x4ec60e]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalFrameDefault+0x204f) [0x4c902f]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_FastCallDict+0x21b) [0x4d049b]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_Call_Prepend+0x60) [0x4e4bb0]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3() [0x542697]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_MakeTpCall+0x3eb) [0x4d0eab]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalFrameDefault+0x5265) [0x4cc245]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyFunction_Vectorcall+0x106) [0x4da136]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3() [0x4e8817]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(PyObject_Call+0x5e) [0x4ec60e]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalFrameDefault+0x204f) [0x4c902f]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_FastCallDict+0x21b) [0x4d049b]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_Call_Prepend+0x60) [0x4e4bb0]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3() [0x542697]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_MakeTpCall+0x3eb) [0x4d0eab]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalFrameDefault+0x5265) [0x4cc245]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyFunction_Vectorcall+0x19c) [0x4da1cc]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3() [0x4e8817]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(PyObject_Call+0x5e) [0x4ec60e]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalFrameDefault+0x204f) [0x4c902f]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyFunction_Vectorcall+0x19c) [0x4da1cc]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_FastCallDict+0x25f) [0x4d04df]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3(_PyObject_Call_Prepend+0x60) [0x4e4bb0]
/home/local/QCRI/ahussein/anaconda3/envs/k23/bin/python3() [0x542697]

2023-06-12 06:11:46,686 INFO [train_asr.py:1327] (2/4) Saving batch to zipformer/exp-asr-small/batch-0e51f30d-c6a7-ee39-c4b0-32ccd7c524a5.pt
2023-06-12 06:11:46,717 INFO [train_asr.py:1333] (2/4) features shape: torch.Size([16, 1209, 80])
2023-06-12 06:11:46,718 INFO [train_asr.py:1337] (2/4) num tokens: 566
Traceback (most recent call last):
  File "./zipformer/train_asr.py", line 1402, in <module>
    main()
  File "./zipformer/train_asr.py", line 1393, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/alt-arabic/speech/amir/competitions/IWSLT/icefall_recipe/st/zipformer/train_asr.py", line 1246, in run
    scan_pessimistic_batches_for_oom(
  File "/alt-arabic/speech/amir/competitions/IWSLT/icefall_recipe/st/zipformer/train_asr.py", line 1358, in scan_pessimistic_batches_for_oom
    loss, _ = compute_loss(
  File "/alt-arabic/speech/amir/competitions/IWSLT/icefall_recipe/st/zipformer/train_asr.py", line 793, in compute_loss
    simple_loss, pruned_loss = model(
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/alt-arabic/speech/amir/competitions/IWSLT/icefall_recipe/st/zipformer/model.py", line 127, in forward
    x, x_lens = self.encoder_embed(x, x_lens)
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/alt-arabic/speech/amir/competitions/IWSLT/icefall_recipe/st/zipformer/subsampling.py", line 310, in forward
    x = self.conv(x)
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/local/QCRI/ahussein/anaconda3/envs/k23/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/alt-arabic/speech/amir/competitions/IWSLT/icefall_recipe/st/zipformer/scaling.py", line 1388, in forward
    return k2.swoosh_r(x)
RuntimeError: 
    Some bad things happened. Please read the above error messages and stack
    trace. If you are using Python, the following command may be helpful:

      gdb --args python /path/to/your/code.py

    (You can use `gdb` to debug the code. Please consider compiling
    a debug version of k2.).

    If you are unable to fix it, please open an issue at:

      https://github.com/k2-fsa/k2/issues/new

I also tried running python k2/k2/python/tests/swoosh_r_test.py and it gave similar error:

k2::SwooshFunction<k2::SwooshRConstants>::forward, 1>, const float*, float, float, float, float, const float*, float*, const float*, unsigned char*>; cudaStream_t = CUstream_st*; int32_t = int] Check failed: e == cudaSuccess (98 vs. 0)  Error: invalid device function. 

[ Stack-Trace: ]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/libk2_log.so(k2::internal::GetStackTrace()+0x34) [0x2aab48225624]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x3d9da) [0x2aab45e1f9da]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x1436e8) [0x2aab45f256e8]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x14a13e) [0x2aab45f2c13e]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x14ae19) [0x2aab45f2ce19]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x13e9d0) [0x2aab45f209d0]
/alt-arabic/speech/amir/competitions/IWSLT/k2/build/lib/_k2.cpython-38-x86_64-linux-gnu.so(+0x37d51) [0x2aab45e19d51]
python(PyCFunction_Call+0x52) [0x4e0212]
python(_PyObject_MakeTpCall+0x3eb) [0x4d0eab]
python(_PyEval_EvalFrameDefault+0x5605) [0x4cc5e5]
python() [0x4e8717]
python(_PyEval_EvalFrameDefault+0x907) [0x4c78e7]
python(_PyFunction_Vectorcall+0x106) [0x4da136]
python(_PyEval_EvalFrameDefault+0xa3e) [0x4c7a1e]
python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
python(_PyFunction_Vectorcall+0x19c) [0x4da1cc]
python() [0x4e8817]
python(PyObject_Call+0x5e) [0x4ec60e]
python(_PyEval_EvalFrameDefault+0x204f) [0x4c902f]
python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
python(_PyObject_FastCallDict+0x21b) [0x4d049b]
python(_PyObject_Call_Prepend+0x60) [0x4e4bb0]
python() [0x542697]
python(_PyObject_MakeTpCall+0x3eb) [0x4d0eab]
python(_PyEval_EvalFrameDefault+0x4f58) [0x4cbf38]
python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
python(_PyFunction_Vectorcall+0x19c) [0x4da1cc]
python() [0x4e8817]
python(PyObject_Call+0x5e) [0x4ec60e]
python(_PyEval_EvalFrameDefault+0x204f) [0x4c902f]
python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
python(_PyObject_FastCallDict+0x21b) [0x4d049b]
python(_PyObject_Call_Prepend+0x60) [0x4e4bb0]
python() [0x542697]
python(_PyObject_MakeTpCall+0x3eb) [0x4d0eab]
python(_PyEval_EvalFrameDefault+0x4f58) [0x4cbf38]
python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
python(_PyFunction_Vectorcall+0x19c) [0x4da1cc]
python() [0x4e8817]
python(PyObject_Call+0x5e) [0x4ec60e]
python(_PyEval_EvalFrameDefault+0x204f) [0x4c902f]
python(_PyEval_EvalCodeWithName+0x1f5) [0x4c5e95]
python(_PyObject_FastCallDict+0x21b) [0x4d049b]
python(_PyObject_Call_Prepend+0x60) [0x4e4bb0]
python() [0x542697]
python(_PyObject_MakeTpCall+0x3eb) [0x4d0eab]
python(_PyEval_EvalFrameDefault+0x4f58) [0x4cbf38]
python(_PyFunction_Vectorcall+0x106) [0x4da136]
python(_PyEval_EvalFrameDefault+0xa3e) [0x4c7a1e]
python(_PyFunction_Vectorcall+0x106) [0x4da136]

E
======================================================================
ERROR: test (__main__.TestSwooshR)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "../../k2/k2/python/tests/swoosh_r_test.py", line 127, in test
    k2_y = k2.swoosh_r(x=k2_x, dropout_prob=dropout)
RuntimeError: 
    Some bad things happened. Please read the above error messages and stack
    trace. If you are using Python, the following command may be helpful:

      gdb --args python /path/to/your/code.py

    (You can use `gdb` to debug the code. Please consider compiling
    a debug version of k2.).

    If you are unable to fix it, please open an issue at:

      https://github.com/k2-fsa/k2/issues/new

----------------------------------------------------------------------
Ran 1 test in 3.616s

FAILED (errors=1)
csukuangfj commented 1 year ago

Please have a look at https://k2-fsa.github.io/k2/installation/from_source.html

Screenshot 2023-06-12 at 11 51 39

csukuangfj commented 1 year ago

Does it fix your problem?

AmirHussein96 commented 1 year ago

@csukuangfj Yes it worked now, thank you.