hirofumi0810 / neural_sp

End-to-end ASR/LM implementation with PyTorch
Apache License 2.0
595 stars 141 forks source link

Errors in training streaming MMA decoder for librispeech corpus #223

Closed cjw414 closed 3 years ago

cjw414 commented 3 years ago

Hello hiro. First of all, thank you for your awesome project! It's helping me a lot!

Anyways, I was trying to train MMA decoder for librispeech, with the asr_conf

Howevrer, just as the model started to train, an error occured. The error log was as following:

============================================================================
                                LibriSpeech
============================================================================
============================================================================
                       ASR Training stage (stage:4)
============================================================================
Original utterance num: 281241
Removed 15689 utterances (threshold)
Removed 0 utterances (for CTC)
Original utterance num: 2864
Removed 105 utterances (threshold)
Removed 0 utterances (for CTC)
  0%|                                                                                      | 0/265552 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/jungwonchang/projects1/neural_sp/examples/librispeech/s5/../../../neural_sp/bin/asr/train.py", line 516, in <module>
    save_path = pr.runcall(main)
  File "/home/jungwonchang/projects1/neural_sp/tools/miniconda/lib/python3.7/cProfile.py", line 121, in runcall
    return func(*args, **kw)
  File "/home/jungwonchang/projects1/neural_sp/examples/librispeech/s5/../../../neural_sp/bin/asr/train.py", line 322, in main
    teacher=teacher, teacher_lm=teacher_lm)
  File "/home/jungwonchang/projects1/neural_sp/tools/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/jungwonchang/projects1/neural_sp/tools/miniconda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/jungwonchang/projects1/neural_sp/tools/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data1/jungwonchang/projects/neural_sp/neural_sp/models/seq2seq/speech2text.py", line 256, in forward
    loss, observation = self._forward(batch, task, teacher, teacher_lm)
  File "/mnt/data1/jungwonchang/projects/neural_sp/neural_sp/models/seq2seq/speech2text.py", line 287, in _forward
    batch['trigger_points'])
  File "/home/jungwonchang/projects1/neural_sp/tools/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data1/jungwonchang/projects/neural_sp/neural_sp/models/seq2seq/decoders/transformer.py", line 347, in forward
    eouts, elens, ys, trigger_points=trigger_points)
  File "/mnt/data1/jungwonchang/projects/neural_sp/neural_sp/models/seq2seq/decoders/transformer.py", line 399, in forward_att
    tgt_mask = tgt_mask & causal_mask  # `[B, L (query), L (key)]`
RuntimeError: result type Byte can't be cast to the desired output type Bool
  0%|                                                                                      | 0/265552 [00:00<?, ?it/s]

There was no problem in training with MoCha attention, but such error occured for streaming MMA. As I mentioned in the previous issue #201, my environment settings are as followed

CUDA=10.2
pytorch=1.6.0

Is this an environment issue?

Also, do you have any recommendation among the streaming MMA configurations for the sake of training speed(fastest to train)

hirofumi0810 commented 3 years ago

This is a bug caused by pytorch compatibility. You can quickly fix it by causal_mask = causal_mask.byte(). I'll fix it later.

cjw414 commented 3 years ago

@hirofumi0810 Thanks for the quick reply!

cjw414 commented 3 years ago

By the way, I think there was a slight mistake in the comment

You can quickly fix it by causal_mask = causal_mask.byte().

since the causal_mask was array of uint8 as followed:

(Pdb) causal_mask
tensor([[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0],
         [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0],
         ...
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1]]], device='cuda:0', dtype=torch.uint8)

So I temporarily fixed it with causal_mask = causal_mask.bool() instead!

hirofumi0810 commented 3 years ago

@Jungwon-Chang Thank you! I'll fix it soon.

hirofumi0810 commented 3 years ago

Fixed by #232.