openspeech-team / openspeech

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
https://openspeech-team.github.io/openspeech/
MIT License
670 stars 112 forks source link

How long does the training take? #204

Closed EomSooHwan closed 1 year ago

EomSooHwan commented 1 year ago

❓ Questions & Help

Could you let me know how long the training should take for each epoch?

Details

The model I am training is Conformer-small encoder-only, and I am using LibriSpeech 960h with a character-wise tokenizer. The batch size is 32, and I am running the code with two RTX 8000. More details are below. The training takes about 8 hours for one epoch, which is quite a lot, even considering the dataset size.

audio: name: melspectrogram sample_rate: 16000 frame_length: 20.0 frame_shift: 10.0 del_silence: false num_mels: 80 apply_spec_augment: true apply_noise_augment: false apply_time_stretch_augment: false apply_joining_augment: false augment: apply_spec_augment: false apply_noise_augment: false apply_joining_augment: false apply_time_stretch_augment: false freq_mask_para: 27 freq_mask_num: 2 time_mask_num: 4 noise_dataset_dir: None noise_level: 0.7 time_stretch_min_rate: 0.7 time_stretch_max_rate: 1.4 dataset: dataset: librispeech dataset_path: $DATASET_PATH dataset_download: false manifest_file_path: $MANIFEST_PATH criterion: criterion_name: ctc reduction: mean zero_infinity: true lr_scheduler: lr: 0.0001 scheduler_name: warmup_reduce_lr_on_plateau lr_patience: 10 lr_factor: 0.3 peak_lr: 0.001 init_lr: 1.0e-05 warmup_steps: 20000 model: model_name: conformer encoder_dim: 144 num_encoder_layers: 16 num_attention_heads: 4 feed_forward_expansion_factor: 4 conv_expansion_factor: 2 input_dropout_p: 0.1 feed_forward_dropout_p: 0.1 attention_dropout_p: 0.1 conv_dropout_p: 0.1 conv_kernel_size: 31 half_step_residual: true optimizer: adam trainer: seed: 1 accelerator: dp accumulate_grad_batches: 1 num_workers: 4 batch_size: 32 check_val_every_n_epoch: 1 gradient_clip_val: 3.0 logger: wandb max_epochs: 20 save_checkpoint_n_steps: 10000 auto_scale_batch_size: binsearch sampler: else name: gpu-fp16 device: gpu use_cuda: true auto_select_gpus: true precision: 16 amp_backend: apex apex_backend: native tokenizer: sos_token: eos_token: pad_token: blank_token: encoding: utf-8 unit: libri_character vocab_path: $VOCAB_PATH

upskyy commented 1 year ago

Speech recognition tends to take a long time training because the sequence is long. If you want to reduce the time, it is recommended to increase the batch size or cut the sequence length short and train.

EomSooHwan commented 1 year ago

I do understand it is natural to take a long time, though even considering that 9 hours per epoch still seems too much... I will try increasing the batch size and see if it helps. Also, do you know if there is any option to set the max sequence length in the configuration?