Open sanjuktasr opened 2 weeks ago
How large is your.GPU RAM?
24 gb each for 2 gpus
can you reproduce it?
2024-11-07 13:12:14,793 INFO [train.py:1120] (1/2) Device: cuda:1 2024-11-07 13:12:14,793 INFO [train.py:1120] (0/2) Device: cuda:0 2024-11-07 13:12:14,797 INFO [train.py:1132] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7ff6d891905ff482364f2d0015867b00d89dd8c7', 'k2-git-date': 'Fri Jun 16 12:10:37 2023', 'lhotse-version': '1.16.0.dev+git.aa073f6a.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'fc2df07-dirty', 'icefall-git-date': 'Wed Aug 16 20:02:41 2023', 'icefall-path': '/NAS1/sanjukta_repo_falcon1/zip_exp_6/icefall', 'k2-path': '/home/sanjukta/anaconda3/envs/zipf1/lib/python3.9/site-packages/k2/init.py', 'lhotse-path': '/NAS1/sanjukta_repo_falcon1/zip_exp_6/lhotse/lhotse/init.py', 'hostname': 'asus-System-Product-Name', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-mmi_online/30_05_2024'), 'bpe_model': 'data/8k/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/8k/fbank/', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500} 2024-11-07 13:12:14,797 INFO [train.py:1134] (0/2) About to create model 2024-11-07 13:12:14,797 INFO [train.py:1132] (1/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7ff6d891905ff482364f2d0015867b00d89dd8c7', 'k2-git-date': 'Fri Jun 16 12:10:37 2023', 'lhotse-version': '1.16.0.dev+git.aa073f6a.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'fc2df07-dirty', 'icefall-git-date': 'Wed Aug 16 20:02:41 2023', 'icefall-path': '/NAS1/sanjukta_repo_falcon1/zip_exp_6/icefall', 'k2-path': '/home/sanjukta/anaconda3/envs/zipf1/lib/python3.9/site-packages/k2/init.py', 'lhotse-path': '/NAS1/sanjukta_repo_falcon1/zip_exp_6/lhotse/lhotse/init.py', 'hostname': 'asus-System-Product-Name', 'IP address': '127.0.1.1'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-mmi_online/30_05_2024'), 'bpe_model': 'data/8k/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/8k/fbank/', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500} 2024-11-07 13:12:14,798 INFO [train.py:1134] (1/2) About to create model 2024-11-07 13:12:15,083 INFO [train.py:1138] (0/2) Number of model parameters: 23285615 2024-11-07 13:12:15,085 INFO [train.py:1138] (1/2) Number of model parameters: 23285615 2024-11-07 13:12:15,984 INFO [train.py:1153] (1/2) Using DDP 2024-11-07 13:12:16,069 INFO [train.py:1153] (0/2) Using DDP I executed the same code , the code seems to hang at using DDP, and no progress after then. waited for 5 minutes to check but no progress. code running on GPUs individually
Are you able to reproduce it with librispeech?
yes. this is with librispeech only
then why the data manifest dir is data/8k/fbank in your log?
could you tell us what changes you have made?
I am using different data but codebase is same as librispeech. no changes wise especially in training.
what is the duration distribution of your data?
are you able to reproduce it with the librispeech dataset?
It is a small experimental dataset for testing codebases under librispeech. The training is running on single GPU.
2024-11-05 12:55:26,724 INFO [train.py:1231] (0/2) Training will start from epoch : 1 2024-11-05 12:55:26,725 INFO [train.py:1243] (0/2) Training started 2024-11-05 12:55:26,726 INFO [train.py:1253] (0/2) Device: cuda:0 2024-11-05 12:55:26,728 INFO [train.py:1265] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.24.0.dev+git.866e4a80.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'HEAD', 'icefall-git-sha1': '144163c-clean', 'icefall-git-date': 'Fri Oct 18 14:09:24 2024', 'icefall-path': '/builds/mihup/asr/zipformer/icefall', 'k2-path': '/usr/local/lib/python3.9/dist-packages/k2/init.py', 'lhotse-path': '/workspace/lhotse/lhotse/init.py', 'hostname': 'runner-t2iavcpo-project-47789012-concurrent-0', 'IP address': '172.17.0.3'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-Hindi/2024-11-05T10:55:25Z'), 'bpe_model': 'data/2024-11-05T10:55:25Z/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'aws_access_key_id': None, 'aws_secret_access_key': None, 'finetune': None, 'av': 9, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/2024-11-05T10:55:25Z/fbank', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500} 2024-11-05 12:55:26,728 INFO [train.py:1267] (0/2) About to create model 2024-11-05 12:55:26,733 INFO [train.py:1231] (1/2) Training will start from epoch : 1 2024-11-05 12:55:26,734 INFO [train.py:1243] (1/2) Training started 2024-11-05 12:55:26,734 INFO [train.py:1253] (1/2) Device: cuda:1 2024-11-05 12:55:26,736 INFO [train.py:1265] (1/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.24.0.dev+git.866e4a80.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'HEAD', 'icefall-git-sha1': '144163c-clean', 'icefall-git-date': 'Fri Oct 18 14:09:24 2024', 'icefall-path': '/builds/mihup/asr/zipformer/icefall', 'k2-path': '/usr/local/lib/python3.9/dist-packages/k2/init.py', 'lhotse-path': '/workspace/lhotse/lhotse/init.py', 'hostname': 'runner-t2iavcpo-project-47789012-concurrent-0', 'IP address': '172.17.0.3'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-Hindi/2024-11-05T10:55:25Z'), 'bpe_model': 'data/2024-11-05T10:55:25Z/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'aws_access_key_id': None, 'aws_secret_access_key': None, 'finetune': None, 'av': 9, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/2024-11-05T10:55:25Z/fbank', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500} 2024-11-05 12:55:26,736 INFO [train.py:1267] (1/2) About to create model 2024-11-05 12:55:26,998 INFO [train.py:1271] (0/2) Number of model parameters: 23627887 2024-11-05 12:55:27,047 INFO [train.py:1271] (1/2) Number of model parameters: 23627887 2024-11-05 12:55:27,934 INFO [train.py:1286] (0/2) Using DDP 2024-11-05 12:55:27,986 INFO [train.py:1286] (1/2) Using DDP Traceback (most recent call last): File "/builds/mihup/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1530, in
main()
File "/builds/mihup/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1521, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGTERM
WARNING: script canceled externally (UI, API)