ictnlp / StreamSpeech

StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
https://ictnlp.github.io/StreamSpeech-site/
MIT License
860 stars 66 forks source link

Trained model can generate correct text but incorrect speech #13

Open chentuochao opened 1 month ago

chentuochao commented 1 month ago

I tried to reproduce the training of the fr-en simultaneous model. I follows the instruction to prepare the dataset and run the script train.simul-s2st.sh The model training seems to go fine but the during evaluation of our trained model (using ./simuleval.simul-s2st.sh), weird behaviors happen. Here is the training logging:

Screenshot 2024-07-27 at 2 39 39 AM

During the inference, when I tried to run the eval scripts on the example you provided, the weird thing happens, it can output correct text translation but the output speech is incorrect (output speech is almost silent). I print the text output and speech units output as follow:

image

Do you know what problem may be?

Thank you

zhangshaolei1998 commented 1 month ago

I wonder if you have tried to test it directly using the model we provide, and whether this happens?

If not, I think it may be a problem with the training scripts? Perhaps you can provide the training scripts?

chentuochao commented 1 month ago

Thank you for your kind reply! I also tried your provided pretrained model, it works well. The weird issues only happens to my trained model. Here is the traning script I am using

export CUDA_VISIBLE_DEVICES=0

LANG=fr
DATA_ROOT=/scr/data/zhangshaolei/datasets/cvss/cvss-c
DATA=$DATA_ROOT/${LANG}-en/fbank2unit
model=streamspeech.simul-s2st.${LANG}-en

fairseq-train $DATA \
  --user-dir researches/ctc_unity \
  --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \
  --task speech_to_speech_ctc --target-is-code --target-code-size 1000 --vocoder code_hifigan  \
  --criterion speech_to_unit_2pass_ctc_asr_st --label-smoothing 0.1 --rdrop-alpha 0.0 \
  --arch streamspeech --share-decoder-input-output-embed \
  --encoder-layers 12 --encoder-embed-dim 256 --encoder-ffn-embed-dim 2048 --encoder-attention-heads 4 \
  --translation-decoder-layers 4 --synthesizer-encoder-layers 2 \
  --decoder-layers 2  --decoder-embed-dim 512 --decoder-ffn-embed-dim 2048 --decoder-attention-heads 8 \
  --k1 0 --k2 0 --n1 1 --n2 -1 \
  --chunk-size 8 --multichunk \
  --uni-encoder \
  --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
  --train-subset train --valid-subset dev \
  --ctc-upsample-rate 25 \
  --save-dir checkpoints/$model \
  --validate-interval 1000 --validate-interval-updates 1000 \
  --save-interval 1 --save-interval-updates 1000 \
  --keep-last-epochs 15 \
  --no-progress-bar --log-format json --log-interval 100 \
  --lr 0.001 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
  --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 1.0 \
  --max-tokens 22000 --max-target-positions 1200 --update-freq 2 \
  --attn-type espnet --pos-enc-type rel_pos \
  --keep-interval-updates 40 \
  --keep-best-checkpoints 20 \
  --seed 1 --fp16 --num-workers 8 

config_gcmvn.yaml

global_cmvn:
  stats_npz_path: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/gcmvn.npz
input_channels: 1
input_feat_per_channel: 80
specaugment:
  freq_mask_F: 27
  freq_mask_N: 1
  time_mask_N: 1
  time_mask_T: 100
  time_mask_p: 1.0
  time_wrap_W: 0
transforms:
  '*':
  - global_cmvn
  _train:
  - global_cmvn
  - specaugment
vocoder:
  checkpoint: ./pretrained_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000
  config: ./pretrained_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json
  type: code_hifigan

config_mtl_asr_st_ctcst.yaml

target_unigram:
   decoder_type: transformer
   dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000/spm_unigram_fr.txt
   data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000
   loss_weight: 8.0
   rdrop_alpha: 0.0
   decoder_args:
      decoder_layers: 4
      decoder_embed_dim: 512
      decoder_ffn_embed_dim: 2048
      decoder_attention_heads: 8
   label_smoothing: 0.1
source_unigram:
   decoder_type: ctc
   dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/src_unigram6000/spm_unigram_fr.txt
   data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/src_unigram6000
   loss_weight: 4.0
   rdrop_alpha: 0.0
   decoder_args:
      decoder_layers: 0
      decoder_embed_dim: 512
      decoder_ffn_embed_dim: 2048
      decoder_attention_heads: 8
   label_smoothing: 0.1
ctc_target_unigram:
   decoder_type: ctc
   dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000/spm_unigram_fr.txt
   data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000
   loss_weight: 4.0
   rdrop_alpha: 0.0
   decoder_args:
      decoder_layers: 0
      decoder_embed_dim: 512
      decoder_ffn_embed_dim: 2048
      decoder_attention_heads: 8
   label_smoothing: 0.1

I also attach the model we trained here (https://drive.google.com/file/d/1rdOEt1NSt8oxUBHL0WfM_CCtKczt6TzO/view?usp=share_link)

zhangshaolei1998 commented 1 month ago

There seems to be no problem with training scripts. Problems with generating short speech are often caused by the non-autoregressive text-to-unit generation module. I wonder if you have modified this part of the code?

chentuochao commented 1 month ago

Yeah, I think it should be the problem at autoregressive text-to-unit generation module. I did not change any part of training code and model. Do you have any idea what happens? I will retry to re-download the GitHub repo and train again to see whether I an still facing the problem and update in this issue

zhangshaolei1998 commented 1 month ago

Sorry, I haven't encountered this problem before, and I don't have any experience to solve this issue yet.

Maybe you can retrain with the latest code and record the final loss. We can see whether the loss after convergence is within the normal range.

Lili-q commented 1 month ago

Hello, I also trained a fr-en streaming S2ST model completely according to the tutorial, and did not make any changes to the code, and encountered a similar problem as you.

The result of streaming ASR is normal, but the result of simultaneous translation is incorrect, and the corresponding token is also abnormal (very short), and the synthesized audio is less than 1s, with almost no sound.

I tested the same source audio using my own trained model and the pre-trained model provided by the author. See the following pictures.

a. Result on my own trained model: a_error

b. Results on the pre-trained model provided by the author b_correct

Did you solve your problem?

chentuochao commented 1 month ago

Dear authors, I tried to redo all pipeline again, but I still has that issues: Here are the all commands we use after installing the environment:

bash 0.download_pretrain_models.sh

# changed the env variables
bash preprocess.sh

# changed the paths in config_gcmvn.yaml

# copy and paste config_mtl_asr_st_ctcst.yaml to fbank2unit

# changed paths in train.simul-s2st.sh
bash train.simul-s2st.sh

# changed paths in simuleval.simul-s2st.sh
bash simuleval.simul-s2st.sh

Do you know what the potential problem is?

EmreOzkose commented 1 week ago

I have the same issue. Is there any update?

chentuochao commented 1 week ago

Hi Emre, I found this bug is related to the loss function and author pushed the fixed loss in the most recent commit. Just pull it, then the problem will be solved

EmreOzkose commented 1 week ago

I am training on my own data. I applied loss bug fix. ASR and translation seem okey (wer decreases to ~30%). However, I cannot still get meaningful audio outputs after loss bug fix. They are very short and sound like a noise.

EmreOzkose commented 1 week ago

I use another Hubert model to extract source units. Do it affect this situation?

EmreOzkose commented 1 week ago

It was the problem :). I misunderstood some part of the model. When I changed back to the original hubert, the problem is solved.