tencent-ailab / pika

a lightweight speech processing toolkit based on Pytorch and (Py)Kaldi
Apache License 2.0
338 stars 57 forks source link

How to build NN LM #6

Open Durgesh92 opened 3 years ago

Durgesh92 commented 3 years ago

I have trained a custom model, need to know how you guys have built the NN LM?

cweng6 commented 3 years ago

if you have trained an RNNT model, you should be able to do the recognition since it's an end-to-end model. LM is not mandatory for decoding. If you are talking about training an external NN LM, the pipeline will be pretty much same to a language model training pipeline with PyTorch, nothing special.

Durgesh92 commented 3 years ago

Yes, I was talking about external NN LM, The thing is Performance is not up to the mark with RNNT and even with LAS, So thought of integrating external LM into it.

cweng6 commented 3 years ago

Could you elaborate on what kind of training data and what scale (number of hours) you used for RNNT training. As we listed in Readme, the default hyper-parameters of the example are determined according to large-scale training. If you are using several hundred to thousand hours of data, you will definitely need to re-tune.

BTW, MBR and LAS rescorer should boost performance further, but looks like you need a decent baseline first.

Durgesh92 commented 3 years ago

It's around 240 hours of data from CommonVoice. My training parameters are these

      --verbose \
      --optim sgd \
      --initial_lr 0.003 \
      --final_lr 0.0001 \
      --grad_clip 3.0 \
      --num_batches_per_epoch 10000\
      --num_epochs 30 \
      --momentum 0.9 \
      --block_momentum 0.9 \
      --sync_period 5 \
      --feats_dim 80 \
      --cuda \
      --batch_size 1 \
      --encoder_type transformer \
      --enc_layers 9 \
      --decoder_type rnn \
      --dec_layers 2 \
      --rnn_type LSTM \
      --rnn_size 1024 \
      --embd_dim 100 \
      --dropout 0.2 \
      --brnn \
      --padding_idx 33 \
      --padding_tgt 33 \
      --stride 1 \
      --queue_size 4 \
      --loader otf_utt \
      --batch_first \
      --cmn \
      --cmvn_stats $exp_dir/global_cmvn.stats \
      --output_dim 33 \
      --num_workers 1 \
      --sample_rate 16000 \
      --feat_config conf/fbank.conf \
      --TU_limit 15000 \
      --gain_range 50,10 \
      --speed_rate 0.9,1.0,1.1 \
      --log_per_n_frames 131072  \
      --max_len 1600 \
      --lctx 1 --rctx 1\
      --model_lctx 21 --model_rctx 21 \
      --model_stride 4 \

What changes do you recommend to get the good baseline results?

cweng6 commented 3 years ago

what's your final converged RNN loss? In general, with only 240hrs I don't recommend using 9 layers TDNN-transformer model structure. We are actually working on a recipe that could work well in small scale training data. Should be ready in a few weeks or so. I suggested a few things you could try,

I. --encoder_type transformer -> rnn

II. --enc_layers 9 -> 6

III. --model_stride change to 3, note that you will also need to change the last convolution layer's stride to 3 here,

https://github.com/tencent-ailab/pika/blob/23c4cddef4392bc035207187d3b5653e9a3f083e/trainer/model/rnnt_tdnn_transformer.py#L56

IV. also try reduce model parameters further, tdnn_nhid*4 -> tdnn_nhid

https://github.com/tencent-ailab/pika/blob/23c4cddef4392bc035207187d3b5653e9a3f083e/trainer/model/rnnt_tdnn_transformer.py#L65