LibriSpeechVocabulary does not have a blank_id

ghost commented 3 years ago

LibriSpeechVocabulary does not have a self.blank_id instance param

Description

When running on LibriSpeech, conformer-small, subwords units, bug Manifests as:

TypeError: ctc_loss() received an invalid combination of arguments - got (Tensor, Tensor, Tensor, Tensor, NoneType, int, bool), but expected one of:
 * (Tensor log_probs, Tensor targets, tuple of ints input_lengths, tuple of ints target_lengths, int blank, int reduction, bool zero_infinity)
      didn't match because some of the arguments have invalid types: (Tensor, Tensor, Tensor, Tensor, NoneType, int, bool)
 * (Tensor log_probs, Tensor targets, Tensor input_lengths, Tensor target_lengths, int blank, int reduction, bool zero_infinity)
      didn't match because some of the arguments have invalid types: (Tensor, Tensor, Tensor, Tensor, NoneType, int, bool)

[To reproduce the bug]

```bash python ./bin/main.py \ audio=melspectrogram \ model=conformer-small \ train=conformer_small_train \ audio.audio_extension=flac \ train.dataset_path=/home/jupyter/LibriSpeech/ \ train.transcripts_path=/home/dan/KoSpeech/data/train.txt \ audio_extension=flac \ audio.feature_extract_by=torchaudio \ train.dataset=libri ```

[Full Hydra config]

```yaml audio: audio_extension: flac sample_rate: 16000 frame_length: 20 frame_shift: 10 normalize: true del_silence: true feature_extract_by: torchaudio time_mask_num: 4 freq_mask_num: 2 spec_augment: true input_reverse: false transform_method: mel n_mels: 80 freq_mask_para: 18 audio_extension: flac transform_method: mel sample_rate: 16000 frame_length: 20 frame_shift: 10 n_mels: 80 normalize: true del_silence: true feature_extract_by: kaldi freq_mask_para: 18 time_mask_num: 4 freq_mask_num: 2 spec_augment: true input_reverse: false model: architecture: conformer teacher_forcing_ratio: 1.0 teacher_forcing_step: 0.01 min_teacher_forcing_ratio: 0.9 dropout: 0.3 bidirectional: false joint_ctc_attention: false max_len: 400 feed_forward_expansion_factor: 4 conv_expansion_factor: 2 input_dropout_p: 0.1 feed_forward_dropout_p: 0.1 attention_dropout_p: 0.1 conv_dropout_p: 0.1 decoder_dropout_p: 0.1 conv_kernel_size: 31 half_step_residual: true num_decoder_layers: 1 decoder_rnn_type: lstm decoder: None encoder_dim: 144 decoder_dim: 320 num_encoder_layers: 16 num_attention_heads: 4 architecture: conformer teacher_forcing_step: 0.0 min_teacher_forcing_ratio: 1.0 joint_ctc_attention: false feed_forward_expansion_factor: int = 4 conv_expansion_factor: 2 input_dropout_p: 0.1 feed_forward_dropout_p: 0.1 attention_dropout_p: 0.1 conv_dropout_p: 0.1 decoder_dropout_p: 0.1 conv_kernel_size: 31 half_step_residual: true encoder_dim: 144 decoder_dim: 320 num_encoder_layers: 16 num_decoder_layers: 1 num_attention_heads: 4 decoder: None train: dataset: libri dataset_path: /home/dan/LibriSpeech/ transcripts_path: /home/dan/KoSpeech/data/train.txt output_unit: character batch_size: 32 save_result_every: 1000 checkpoint_every: 5000 print_every: 10 mode: train num_workers: 4 use_cuda: true init_lr_scale: 0.01 final_lr_scale: 0.001 max_grad_norm: 400 weight_decay: 1.0e-06 seed: 777 resume: false optimizer: adam reduction: mean lr_scheduler: transformer_lr_scheduler optimizer_betas: - 0.9 - 0.98 optimizer_eps: 1.0e-09 warmup_steps: 10000 decay_steps: 80000 peak_lr: 0.0001 final_lr: 1.0e-07 num_epochs: 20 ```

Causes

Parameter int blank of ctc_loss is None instead of int, comes from vocab.blank_id
Parameter self.blank_id is not setup in LibriSpeechVocabulary class, inherited from Vocabulary class
Parameter self.blank_id is None by default in Vocabulary class

Possible Fixes

LibriSpeech dataset seems to be used only in subword settings. By comparison with KsponSpeechVocabulary class in subword settings, blank_id should be set to len(self) in LibriSpeechVocabulary class. However we need to extend label set by one element to take blank label into account, or we get RuntimeError: blank must be in label range error on criterion usage.
Default blank value of CTCLoss is 0. It could be a possibility to label blanks and padding in the same way and let the Neural Net learn to label blank and padding the same way.

I am motivated to make a pull request if you decide a fix strategy, don't hesitate to notify me if you want me to handle it.

Best regards,

Dan Ringwald dan.ringwald12@gmail.com

sooftware commented 3 years ago

Hi! This project focused on Korean ASR. Previously, I had also created code to support LibriSpeech, but it failed to update to suit LibriSpeech due to the recent high number of code updates.Sorry for inconvenience.

ghost commented 3 years ago

Hello,

Thx for the super fast reply.

No worries for librispeech, if I manage to make it work I will just issue a pull request.

Best regards,

Dan

sooftware / kospeech