ictnlp / StreamSpeech

StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
https://ictnlp.github.io/StreamSpeech-site/
MIT License
659 stars 46 forks source link

Train on other language #3

Open yiwei0730 opened 3 weeks ago

yiwei0730 commented 3 weeks ago

Hello, this is amazing. I want to ask is it can be trained in other languages, or even if can be trained in multiple languages ​​at the same time.

zhangshaolei1998 commented 3 weeks ago

Hi, thanks for your attention. StreamSpeech architecture can support multilingual speech-to-speech translation, which we have also explored above. Since multilingual is not the core highlight of this work, we did not cover it in our paper.

If you want to train a multilingual StreamSpeech on CVSS-C, you only need to modify the data processing part. The training part is the same.

Hope these can help you.

arararz commented 2 weeks ago

Hi, what changes should be made for speech translation to a language other than English, what parts need to be modified apart from data processing?

Thanks!

zhangshaolei1998 commented 2 weeks ago

@arararz Hi, If you want to train StreamSpeech that translate speech to other languages ​​(other than English), in addition to data preparation, there are two points to note:

  1. To extract the units of the target speech, you need to use the Vocoder of the corresponding language, which can be found here.
  2. Appropriately adjust --ctc-upsample-rate. You can refer to Appendix D of our paper and adjust it to 2-3 times the unit/word sequence length ratio.

Hope these can help you~

thetushargoyal commented 1 week ago

@zhangshaolei1998 hey very interesting work. I was wondering about the training time and what system configuration did you use? thanks

zhangshaolei1998 commented 1 week ago

@thetushargoyal Hi, the training takes less than 1 day on 8 NVIDIA 3090 GPUs.