microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.16k stars 113 forks source link

About the SpeechT5 pre-training curve #29

Closed benyang0506 closed 1 year ago

benyang0506 commented 1 year ago

Hi, congratulations on your achievement in this great work! I did pre-training according to the given configuration, but the loss of the curve converges quickly (about 20k updates) and then rises, I don't know if this is normal, or can you share your pretrain curve, thanks.

Ajyy commented 1 year ago

Hi, thanks for your attention.

Could you provide the training script and the training log? So that we can help check the problem. Also, do you pre-train the model on the same dataset (LibriSpeech and LibriSpeech-LM)?

benyang0506 commented 1 year ago

We use LibriSpeech 960h and LibriSpeech-LM to pretrain, dev-clean is the speech validation set and the text of dev-clean is the text validation. But the curve is as follow: image here is my training script,

! /bin/bash

DATA_ROOT= SAVE_DIR= LABEL_DIR= TRAIN_SET="speech_train|text_train" VALID_SET="speech_valid|text_valid"

echo $SAVE_DIR

fairseq-train ${DATA_ROOT} \ --save-dir ${SAVE_DIR} \ --train-subset ${TRAIN_SET} \ --valid-subset ${VALID_SET} \ --hubert-label-dir ${LABEL_DIR} \ --distributed-world-size 1 \ --distributed-port 0 \ --ddp-backend legacy_ddp \ --user-dir speecht5 \ --log-format json \ --seed 1337 \ --amp \ \ --task speecht5 \ --t5-task pretrain \ --label-rates 50 \ --sample-rate 16000 \ --random-crop \ \ --num-workers 0 \ --max-tokens 1000000 \ --max-speech-sample-size 100000 \ --update-freq 2 \ --batch-ratio "[1,0.0086]" \ \ --criterion speecht5 \ --optimizer adam \ --reset-optimizer \ --adam-betas "(0.9, 0.98)" \ --adam-eps 1e-06 \ --weight-decay 0.01 \ --power 1 \ --clip-norm 5.0 \ --lr 0.0002 \ --lr-scheduler polynomial_decay \ \ --max-update 800000 \ --warmup-updates 64000 \ --total-num-update 800000 \ --save-interval-updates 3000 \ --skip-invalid-size-inputs-valid-test \ --required-batch-size-multiple 1 \ \ --arch t5_transformer_base \ --share-input-output-embed \ --find-unused-parameters \ --bert-init \ --relative-position-embedding \ --use-codebook \ --codebook-prob 0.1 \ --loss-weights="[10,0.1]" \ --max-text-positions 600 > log.txt

Ajyy commented 1 year ago

I think the value of --max-tokens and number of GPUs are too small. The default value of --max-tokens is 1400000 and we use 32 GPUs to pre-train our model. Smaller batch size will make the training unstable. And I think this is the main reason.

Ajyy commented 1 year ago

You can try to increase the --max-tokens, number of GPUs or the --update-freq to get a larger batch size