open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.28k stars 365 forks source link

Add Jets implementation #231

Closed hansheng-zhang closed 2 days ago

hansheng-zhang commented 2 weeks ago

✨ Description

We release the JETS (Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech) model in Amphion. JETS has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, JETS is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.

How to test: see egs/Jets/README.md

Major contribution for this PR: @hansheng-zhang @chenjianzhen666 @So1a

👨‍💻 Changes Proposed

🧑‍🤝‍🧑 Who Can Review?

@lmxue @RMSnow

✅ Checklist

hansheng-zhang commented 4 days ago

Thanks for the contribution! Some questions

  1. I notice that you add msd and mpd implementations for jets, instead of using existing ones. Is it possible to reuse existing discriminators to improve readibility?
  2. If some codes are reference other repos, please make sure to add acknowledgements in the readme, and on the top of each file.
  3. Demos of your reproduction demos would be welcomed.
  1. Original mpd and msd use y_hat and other variables while our's do not. Reuse may cause change in the original code, so we prefer to add JETS' version of mpd and msd.
  2. I have added some more acknowledgement in the code. Copyright of JETS is at the end of readme.
  3. Here're some demos. I used LJSpeech and trained for about 250 epochs. demo.zip