microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.09k stars 113 forks source link

ASR fine-tuning loss goes to zero after several epochs #75

Closed yunigma closed 2 months ago

yunigma commented 2 months ago

Hello and thank you very much for your project! I want to fine-tune the pre-trained SpeechT5 model to the ASR task with LibriSpeech data (after that I plan to fine-tune it on some other data). The fine-tuning runs for the set number of epochs (42) without throwing any errors but after the 18th epoch, the loss becomes zero (see the training logs). The fine-tuned model does not produce any meaningful hypotheses.

Here is the command I run (the train logs are also attached here fine-tune-log):

DATA_ROOT=$ROOT/data/LibriSpeech
SAVE_DIR=$ROOT/exp/finetune_ls
TRAIN_SET=train
VALID_SET=valid
LABEL_DIR=$DATA_ROOT
HUBERT_LABEL_DIR=$DATA_ROOT/hubert_labels/
BPE_TOKENIZER=$ROOT/models/spm_char.model
USER_DIR=$ROOT/SpeechT5/SpeechT5/speecht5
PT_CHECKPOINT_PATH=$ROOT/models/speecht5_base.pt

$cmd -N ${_basename} ${output_dir}/${_basename}.log \
  fairseq-train ${DATA_ROOT} \
    --save-dir ${SAVE_DIR} \
    --tensorboard-logdir ${SAVE_DIR} \
    --train-subset ${TRAIN_SET} \
    --valid-subset ${VALID_SET} \
    --hubert-label-dir ${HUBERT_LABEL_DIR} \
    --user-dir ${USER_DIR} \
    --log-format json \
    --seed 1 \
    --device-id 0 \
    \
    --task speecht5 \
    --t5-task s2t \
    --sample-rate 16000 \
    --num-workers 4 \
    --max-tokens 1600000 \
    --update-freq 2 \
    --bpe-tokenizer ${BPE_TOKENIZER} \
    \
    --criterion speecht5 \
    --report-accuracy \
    --zero-infinity \
    --ce-weight 0.5 \
    --ctc-weight 0.5 \
    --sentence-avg \
    \
    --optimizer adam \
    --adam-betas "(0.9, 0.98)" \
    --adam-eps 1e-08 \
    --weight-decay 0.1 \
    --clip-norm 25.0 \
    --lr 0.00006 \
    --lr-scheduler tri_stage \
    --phase-ratio "[0.1, 0.4, 0.5]" \
    --final-lr-scale 0.05 \
    \
    --max-update 80000 \
    --max-text-positions 600 \
    --required-batch-size-multiple 1 \
    --save-interval-updates 3000 \
    --skip-invalid-size-inputs-valid-test \
    \
    --arch t5_transformer_base_asr \
    --share-input-output-embed \
    --find-unused-parameters \
    --bert-init \
    --relative-position-embedding \
    --freeze-encoder-updates 13000 \
    \
    --keep-last-epochs 10 \
    --feature-grad-mult 1.0 \
    --best-checkpoint-metric s2t_accuracy \
    --maximize-best-checkpoint-metric \
    --finetune-from-model ${PT_CHECKPOINT_PATH}

In how I run the fine-tune command, I am mostly unsure about the --hubert-label-dir argument. In the inference, it is just set to the transcriptions (as discussed here issue_15). For fine-tuning, I set it to the labels extracted from HuBERT. But in this case, I do not know if I do it correctly. For the HuBERT labels, I got two types of files: data.len with the length number per utterance and data.npy containing features themselves. I set --hubert-label-dir to len, as the script expected some txt input. Please, could you clarify for me these points and help me to fix the ASR fine-tuning? Thanks in advance!

Ajyy commented 2 months ago

Hi, --hubert-label-dir should be the transcription for ASR. So, it should not be the HuBERT labels extracted by kmeans model. You can just follow issue_15 and libri-label to prepare.

yunigma commented 2 months ago

Hi, @Ajyy , thank you so much! I confirm that everything works fine now. I have run fine-tuning with the data.wrd, the training loss looked good, and in the inference, I am getting a very close result to the one I get with the released model. Currently, I am running fine-tuning on the other data as well.