Add Jets implementation

open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

MIT License

4.28k stars 365 forks source link

✨ Description

We release the JETS (Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech) model in Amphion. JETS has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, JETS is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.

How to test: see egs/Jets/README.md

Major contribution for this PR: @hansheng-zhang @chenjianzhen666 @So1a

👨‍💻 Changes Proposed

[ ] Add the Jets model in the tts section
[ ] Add Jets' versioin of mpd and msd in vocoders/gan/discriminator

🧑‍🤝‍🧑 Who Can Review?

@lmxue @RMSnow

✅ Checklist

[ ] Code has been reviewed
[ ] Code complies with the project's code standards and best practices
[ ] Code has passed all tests
[ ] Code does not affect the normal use of existing features
[ ] Code has been commented properly
[ ] Documentation has been updated (if applicable)
[ ] Demo/checkpoint has been attached (if applicable)

Thanks for the contribution! Some questions

I notice that you add msd and mpd implementations for jets, instead of using existing ones. Is it possible to reuse existing discriminators to improve readibility?

If some codes are reference other repos, please make sure to add acknowledgements in the readme, and on the top of each file.

Demos of your reproduction demos would be welcomed.

Original mpd and msd use y_hat and other variables while our's do not. Reuse may cause change in the original code, so we prefer to add JETS' version of mpd and msd.
I have added some more acknowledgement in the code. Copyright of JETS is at the end of readme.
Here're some demos. I used LJSpeech and trained for about 250 epochs. demo.zip

open-mmlab / Amphion