k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
841 stars 273 forks source link

DDP training / MMI recipe convergence #1156

Open umbertocappellazzo opened 1 year ago

umbertocappellazzo commented 1 year ago

Hi,

I'm trying to run the conformer_ctc recipe for LibriSpeech. If I use a single gpu (i.e., world-size = 1), the recipe does work without any issue.

If I use multiple GPUs, the reciped gets stuck while creating the model:

stek@8c5594d300fb:/cappellazzo/icefall_forked/icefall/egs/librispeech/ASR$ ./conformer_ctc/train.py --world-size 2 --num-epochs 80 --exp-dir conformer_ctc/exp_librifull --att-rate 0. --num-decoder-layers 0 --max-duration 400 --num-workers 6
2023-06-29 17:54:50,720 INFO [train.py:610] (0/2) Training started
2023-06-29 17:54:50,720 INFO [train.py:611] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'weight_decay': 1e-06, 'warm_step': 80000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '8dfba614b516b38f09692b94c69c10a7b9fad6e8', 'k2-git-date': 'Wed May 24 11:59:53 2023', 'lhotse-version': '1.16.0.dev+git.8abbf9f.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': '4ee6006-clean', 'icefall-git-date': 'Wed Jun 28 11:00:11 2023', 'icefall-path': '/cappellazzo/icefall_forked/icefall', 'k2-path': '/home/stek/.local/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/home/stek/.local/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': '8c5594d300fb', 'IP address': '172.17.0.8'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 80, 'start_epoch': 0, 'exp_dir': PosixPath('conformer_ctc/exp_librifull'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'att_rate': 0.0, 'num_decoder_layers': 0, 'lr_factor': 5.0, 'seed': 42, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 400, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 6, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures'}
2023-06-29 17:54:50,722 INFO [train.py:610] (1/2) Training started
2023-06-29 17:54:50,722 INFO [train.py:611] (1/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'weight_decay': 1e-06, 'warm_step': 80000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '8dfba614b516b38f09692b94c69c10a7b9fad6e8', 'k2-git-date': 'Wed May 24 11:59:53 2023', 'lhotse-version': '1.16.0.dev+git.8abbf9f.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': '4ee6006-clean', 'icefall-git-date': 'Wed Jun 28 11:00:11 2023', 'icefall-path': '/cappellazzo/icefall_forked/icefall', 'k2-path': '/home/stek/.local/lib/python3.10/site-packages/k2/__init__.py', 'lhotse-path': '/home/stek/.local/lib/python3.10/site-packages/lhotse/__init__.py', 'hostname': '8c5594d300fb', 'IP address': '172.17.0.8'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 80, 'start_epoch': 0, 'exp_dir': PosixPath('conformer_ctc/exp_librifull'), 'lang_dir': PosixPath('data/lang_bpe_500'), 'att_rate': 0.0, 'num_decoder_layers': 0, 'lr_factor': 5.0, 'seed': 42, 'full_libri': True, 'mini_libri': False, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 400, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 6, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'input_strategy': 'PrecomputedFeatures'}
2023-06-29 17:54:50,946 INFO [lexicon.py:168] (0/2) Loading pre-compiled data/lang_bpe_500/Linv.pt
2023-06-29 17:54:50,962 INFO [lexicon.py:168] (1/2) Loading pre-compiled data/lang_bpe_500/Linv.pt
2023-06-29 17:54:51,213 INFO [train.py:659] (0/2) About to create model
2023-06-29 17:54:51,222 INFO [train.py:659] (1/2) About to create model

Basically, the code gets stuck at "about to create model" and it doesn't proceed. Which requirements for running ddp? I'm working with multiple A40 GPUs.

Thank you

csukuangfj commented 1 year ago

Could you please post the logs after pressing ctrl + C?

csukuangfj commented 1 year ago

By the way, our latest and best-performing recipe is https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer

I suggest you try zipformer instead. You can find the training commands and decoding results at https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#zipformer-zipformer--pruned-stateless-transducer--ctc

umbertocappellazzo commented 1 year ago

^CTraceback (most recent call last): File "/cappellazzo/icefall_forked/icefall/egs/librispeech/ASR/./conformer_ctc/train.py", line 819, in main() File "/cappellazzo/icefall_forked/icefall/egs/librispeech/ASR/./conformer_ctc/train.py", line 810, in main mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True) File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 109, in join ready = multiprocessing.connection.wait( File "/opt/conda/lib/python3.10/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/opt/conda/lib/python3.10/selectors.py", line 416, in select fd_event_list = self._selector.poll(timeout) KeyboardInterrupt ^CException ignored in atexit callback: <function _exit_function at 0x7f0419dbb7f0> Traceback (most recent call last): File "/opt/conda/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function p.join() File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) File "/opt/conda/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait return self.poll(os.WNOHANG if timeout == 0.0 else 0) File "/opt/conda/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt:

THis is the logs after keyboard interruption.

Btw, right now I'm interested in conformer_ctc since my colleagues are working with this kind of pipeline outside icefall, so we want to be consistent. Happy to switch to better models in the future, tho

csukuangfj commented 1 year ago

Are there any more logs? Sorry, I cannot figure out what happened from the above logs.

csukuangfj commented 1 year ago

If there are no more logs, I suggest that you use py-spy to get the call stack and find out where it gets stuck.

py-spy: https://github.com/benfred/py-spy

(Note: We are not using py-spy for profiling. We only need py-spy dump --pid <your_pid>)

umbertocappellazzo commented 1 year ago

No additional logs unfortunately. I'll try with py-spy and let you know. Btw, are there any requirements in icefall for running ddp? I can check if the server complies with them. Since the server is quite new and nobody has used ddp before, maybe some installations are required.

csukuangfj commented 1 year ago

Since the server is quite new and nobody has used ddp before, maybe some installations are required.

Have you tried to run PyTorch DDP training before without icefall?

umbertocappellazzo commented 1 year ago

Nope, will try with a simple pytorch ddp script then

umbertocappellazzo commented 1 year ago

Hi FangJun, I managed to solve the problem with ddp, basically the A40 GPUs give some problems by default, and a certain command must be pre-appended to make it work. On multiple V100 for example everything works fine.

Apart from this, I'm trying to make the conformer_mmi recipe for librispeech work but I get this error:

tal avg loss: 0.3903, batch size: 13 2023-07-07 08:51:41,608 INFO [train.py:584] (0/3) Epoch 1, batch 1000, batch avg mmi loss 0.4202, batch avg att loss 0.0000, batch avg loss 0.4202, total avg mmiloss: 0.3952, total avg att loss: 0.0000, total avg loss: 0.3952, batch size: 12 2023-07-07 08:51:41,608 INFO [train.py:584] (1/3) Epoch 1, batch 1000, batch avg mmi loss 0.3977, batch avg att loss 0.0000, batch avg loss 0.3977, total avg mmiloss: 0.3912, total avg att loss: 0.0000, total avg loss: 0.3912, batch size: 19 2023-07-07 08:51:41,611 INFO [train.py:584] (2/3) Epoch 1, batch 1000, batch avg mmi loss 0.5049, batch avg att loss 0.0000, batch avg loss 0.5049, total avg mmiloss: 0.3964, total avg att loss: 0.0000, total avg loss: 0.3964, batch size: 14 [I] /home/runner/work/k2/k2/k2/csrc/intersect_dense.cu:314:k2::FsaVec k2::MultiGraphDenseIntersect::FormatOutput(k2::Array1, k2::Array1) Num-arcs 2172024179 exceeds limit 2147483600, decreasing beam from 6.000000 to 4.500000 Traceback (most recent call last): File "/cappellazzo/icefall_forked/icefall/egs/librispeech/ASR/./conformer_mmi/train.py", line 869, in main() File "/cappellazzo/icefall_forked/icefall/egs/librispeech/ASR/./conformer_mmi/train.py", line 860, in main mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True) File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/cappellazzo/icefall_forked/icefall/egs/librispeech/ASR/conformer_mmi/train.py", line 821, in run train_one_epoch( File "/cappellazzo/icefall_forked/icefall/egs/librispeech/ASR/conformer_mmi/train.py", line 635, in train_one_epoch compute_validation_loss( File "/cappellazzo/icefall_forked/icefall/egs/librispeech/ASR/conformer_mmi/train.py", line 456, in compute_validation_loss loss, mmi_loss, att_loss = compute_loss( File "/cappellazzo/icefall_forked/icefall/egs/librispeech/ASR/conformer_mmi/train.py", line 409, in compute_loss mmi_loss = loss_fn(dense_fsa_vec=dense_fsa_vec, texts=texts) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/cappellazzo/icefall_forked/icefall/icefall/mmi.py", line 215, in forward return func( File "/cappellazzo/icefall_forked/icefall/icefall/mmi.py", line 118, in _compute_mmi_loss_exact_non_optimized den_lats = k2.intersect_dense( File "/home/stek/.local/lib/python3.10/site-packages/k2/autograd.py", line 805, in intersect_dense _IntersectDenseFunction.apply(a_fsas, b_fsas, out_fsa, output_beam, File "/home/stek/.local/lib/python3.10/site-packages/k2/autograd.py", line 562, in forward ragged_arc, arc_map_a, arc_map_b = _k2.intersect_dense( ValueError: cannot create std::vector larger than max_size()

I'm using this command:

NCCL_P2P_LEVEL=NVL python3 ./conformer_mmi/train.py --world-size 3 --num-epochs 40 --exp-dir ./conformer_mmi/exp --full-libri False

Any clue on how to solve this issue? I tried to reduce the max-duration and I got the same error (by default is 200).

I talked with Povey here at JSALT2023 and he told me that there should be the possibility to change some parameters about arcs etc.

Thanks!

umbertocappellazzo commented 1 year ago

Also, Dan mentioned that it could be useful to turn off mmi loss during the very first epochs and just using ctc, and then switching to mmi. I remember this is done in other recipes. Could it be a possible solution?

csukuangfj commented 1 year ago

Also, Dan mentioned that it could be useful to turn off mmi loss during the very first epochs and just using ctc, and then switching to mmi. I remember this is done in other recipes. Could it be a possible solution?

Yes, Dan is right.

Please have a look at the command line argument of train.py

--use-pruned-intersect

It you set it to True, it can further reduce RAM usage.

csukuangfj commented 1 year ago

Also, Dan mentioned that it could be useful to turn off mmi loss during the very first epochs and just using ctc, and then switching to mmi. I remember this is done in other recipes. Could it be a possible solution?

Yes, that's is also possible.

CC @yaozengwei . Zengwei has implemented MMI with zipformer. Please try that recipe.

umbertocappellazzo commented 1 year ago

Also, Dan mentioned that it could be useful to turn off mmi loss during the very first epochs and just using ctc, and then switching to mmi. I remember this is done in other recipes. Could it be a possible solution?

Yes, Dan is right.

Please have a look at the command line argument of train.py

--use-pruned-intersect

It you set it to True, it can further reduce RAM usage.

I've already tried it, but the validation loss is very bad and it diverges after some iterations

umbertocappellazzo commented 1 year ago

Also, Dan mentioned that it could be useful to turn off mmi loss during the very first epochs and just using ctc, and then switching to mmi. I remember this is done in other recipes. Could it be a possible solution?

Yes, that's is also possible.

CC @yaozengwei . Zengwei has implemented MMI with zipformer. Please try that recipe.

The problem is that for my experiments with early exit I need to compute the ctc or mmi loss every some intermediate layers. Desh told me that doing smth like that is not possible for zipformer. Also, I saw that the zipformer works with the transducers, in my case like for conformer_ctc I just need the encoder + a linear layer and then computing the ctc/mmi loss. Maybe I can try to compute the ctc loss for the first iterations and then switch to mmi for the conformer_mmi recipe, so adapt that recipe accordingly. But I remember you suggested that I should switch to zipformer recipes. for me it would be fine, providing that I can carry out early exit and dispense with the transducer decoder

csukuangfj commented 1 year ago

Then I recommend trying the second suggestion from Dan. Only applying MMI loss after having trained it with CTC or transducer loss for some batches/epochs.

desh2608 commented 1 year ago

I agree with @csukuangfj's comment. If you already have a model trained with CTC, you can initialize the encoder from there and continue training with MMI loss.

danpovey commented 1 year ago

guys, I noticed dense_intersect has max_states and max_arcs options, that may not be used in the MMI recipe, but it seems to me we could solve this problem in a more general way by using those options-- perhaps they were not available at the time we were working on MMI.

The max_arcs could be set to, for example, 100 million, and max_states to 10 million.

umbertocappellazzo commented 1 year ago

Quick update: the conformer_mmi recipe with a checkpoint model that used ctc loss (5 epochs) seems to work fine now, no strange errors and the curves are reasonable.

desh2608 commented 1 year ago

guys, I noticed dense_intersect has max_states and max_arcs options, that may not be used in the MMI recipe, but it seems to me we could solve this problem in a more general way by using those options-- perhaps they were not available at the time we were working on MMI.

The max_arcs could be set to, for example, 100 million, and max_states to 10 million.

Yeah I noticed those options, but I was worried if that would blow up GPU memory and we would have to reduce the batch size too much (may be bad for convergence especially since Umberto is training on just 1 GPU, I believe).

umbertocappellazzo commented 1 year ago

guys, I noticed dense_intersect has max_states and max_arcs options, that may not be used in the MMI recipe, but it seems to me we could solve this problem in a more general way by using those options-- perhaps they were not available at the time we were working on MMI. The max_arcs could be set to, for example, 100 million, and max_states to 10 million.

Yeah I noticed those options, but I was worried if that would blow up GPU memory and we would have to reduce the batch size too much (may be bad for convergence especially since Umberto is training on just 1 GPU, I believe).

Now I can use up to 4 (even 6 or 8) A40 GPUs, so I don't have the single-GPU constraint any longer.

Anyways, when I was using mmi loss from the very beginning I noticed that the learning curves were pretty irregular and fluctuating. Now thanks to the ctc warm-up the trend is smooth and regular. I have a hunch that the ctc warm up leads to better results than using mmi from the beginning, even if you fix the the issue of the arcs.