Open chentuochao opened 1 month ago
I wonder if you have tried to test it directly using the model we provide, and whether this happens?
If not, I think it may be a problem with the training scripts? Perhaps you can provide the training scripts?
Thank you for your kind reply! I also tried your provided pretrained model, it works well. The weird issues only happens to my trained model. Here is the traning script I am using
export CUDA_VISIBLE_DEVICES=0
LANG=fr
DATA_ROOT=/scr/data/zhangshaolei/datasets/cvss/cvss-c
DATA=$DATA_ROOT/${LANG}-en/fbank2unit
model=streamspeech.simul-s2st.${LANG}-en
fairseq-train $DATA \
--user-dir researches/ctc_unity \
--config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \
--task speech_to_speech_ctc --target-is-code --target-code-size 1000 --vocoder code_hifigan \
--criterion speech_to_unit_2pass_ctc_asr_st --label-smoothing 0.1 --rdrop-alpha 0.0 \
--arch streamspeech --share-decoder-input-output-embed \
--encoder-layers 12 --encoder-embed-dim 256 --encoder-ffn-embed-dim 2048 --encoder-attention-heads 4 \
--translation-decoder-layers 4 --synthesizer-encoder-layers 2 \
--decoder-layers 2 --decoder-embed-dim 512 --decoder-ffn-embed-dim 2048 --decoder-attention-heads 8 \
--k1 0 --k2 0 --n1 1 --n2 -1 \
--chunk-size 8 --multichunk \
--uni-encoder \
--dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
--train-subset train --valid-subset dev \
--ctc-upsample-rate 25 \
--save-dir checkpoints/$model \
--validate-interval 1000 --validate-interval-updates 1000 \
--save-interval 1 --save-interval-updates 1000 \
--keep-last-epochs 15 \
--no-progress-bar --log-format json --log-interval 100 \
--lr 0.001 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
--optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 1.0 \
--max-tokens 22000 --max-target-positions 1200 --update-freq 2 \
--attn-type espnet --pos-enc-type rel_pos \
--keep-interval-updates 40 \
--keep-best-checkpoints 20 \
--seed 1 --fp16 --num-workers 8
config_gcmvn.yaml
global_cmvn:
stats_npz_path: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/gcmvn.npz
input_channels: 1
input_feat_per_channel: 80
specaugment:
freq_mask_F: 27
freq_mask_N: 1
time_mask_N: 1
time_mask_T: 100
time_mask_p: 1.0
time_wrap_W: 0
transforms:
'*':
- global_cmvn
_train:
- global_cmvn
- specaugment
vocoder:
checkpoint: ./pretrained_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/g_00500000
config: ./pretrained_models/unit-based_HiFi-GAN_vocoder/mHuBERT.layer11.km1000.en/config.json
type: code_hifigan
config_mtl_asr_st_ctcst.yaml
target_unigram:
decoder_type: transformer
dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000/spm_unigram_fr.txt
data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000
loss_weight: 8.0
rdrop_alpha: 0.0
decoder_args:
decoder_layers: 4
decoder_embed_dim: 512
decoder_ffn_embed_dim: 2048
decoder_attention_heads: 8
label_smoothing: 0.1
source_unigram:
decoder_type: ctc
dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/src_unigram6000/spm_unigram_fr.txt
data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/src_unigram6000
loss_weight: 4.0
rdrop_alpha: 0.0
decoder_args:
decoder_layers: 0
decoder_embed_dim: 512
decoder_ffn_embed_dim: 2048
decoder_attention_heads: 8
label_smoothing: 0.1
ctc_target_unigram:
decoder_type: ctc
dict: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000/spm_unigram_fr.txt
data: /scr/data/zhangshaolei/datasets/cvss/cvss-c/fr-en/tgt_unigram6000
loss_weight: 4.0
rdrop_alpha: 0.0
decoder_args:
decoder_layers: 0
decoder_embed_dim: 512
decoder_ffn_embed_dim: 2048
decoder_attention_heads: 8
label_smoothing: 0.1
I also attach the model we trained here (https://drive.google.com/file/d/1rdOEt1NSt8oxUBHL0WfM_CCtKczt6TzO/view?usp=share_link)
There seems to be no problem with training scripts. Problems with generating short speech are often caused by the non-autoregressive text-to-unit generation module. I wonder if you have modified this part of the code?
Yeah, I think it should be the problem at autoregressive text-to-unit generation module. I did not change any part of training code and model. Do you have any idea what happens? I will retry to re-download the GitHub repo and train again to see whether I an still facing the problem and update in this issue
Sorry, I haven't encountered this problem before, and I don't have any experience to solve this issue yet.
Maybe you can retrain with the latest code and record the final loss. We can see whether the loss after convergence is within the normal range.
Hello, I also trained a fr-en streaming S2ST model completely according to the tutorial, and did not make any changes to the code, and encountered a similar problem as you.
The result of streaming ASR is normal, but the result of simultaneous translation is incorrect, and the corresponding token is also abnormal (very short), and the synthesized audio is less than 1s, with almost no sound.
I tested the same source audio using my own trained model and the pre-trained model provided by the author. See the following pictures.
a. Result on my own trained model:
b. Results on the pre-trained model provided by the author
Did you solve your problem?
Dear authors, I tried to redo all pipeline again, but I still has that issues: Here are the all commands we use after installing the environment:
bash 0.download_pretrain_models.sh
# changed the env variables
bash preprocess.sh
# changed the paths in config_gcmvn.yaml
# copy and paste config_mtl_asr_st_ctcst.yaml to fbank2unit
# changed paths in train.simul-s2st.sh
bash train.simul-s2st.sh
# changed paths in simuleval.simul-s2st.sh
bash simuleval.simul-s2st.sh
Do you know what the potential problem is?
I have the same issue. Is there any update?
Hi Emre, I found this bug is related to the loss function and author pushed the fixed loss in the most recent commit. Just pull it, then the problem will be solved
I am training on my own data. I applied loss bug fix. ASR and translation seem okey (wer decreases to ~30%). However, I cannot still get meaningful audio outputs after loss bug fix. They are very short and sound like a noise.
I use another Hubert model to extract source units. Do it affect this situation?
It was the problem :). I misunderstood some part of the model. When I changed back to the original hubert, the problem is solved.
I tried to reproduce the training of the fr-en simultaneous model. I follows the instruction to prepare the dataset and run the script train.simul-s2st.sh The model training seems to go fine but the during evaluation of our trained model (using ./simuleval.simul-s2st.sh), weird behaviors happen. Here is the training logging:
During the inference, when I tried to run the eval scripts on the example you provided, the weird thing happens, it can output correct text translation but the output speech is incorrect (output speech is almost silent). I print the text output and speech units output as follow:
Do you know what problem may be?
Thank you