jaketae commented 2 years ago

🌟 New model addition

Model description

FastSpeech2 is a TTS model that outputs mel-spectrograms given some input text. From the paper abstract:

Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of fully end-to-end inference. Experimental results show that 1) FastSpeech 2 achieves a 3x training speed-up over FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech 2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even surpass autoregressive models. Audio samples are available at this https URL.

Open source status

[x] the model implementation is available
[x] the model weights are available
[x] who are the authors: @RayeRen

The authors have not open-sourced their code implementation. However, the first author replied to an email inquiry and pointed me to the official implementation of DiffSinger, which includes FastSpeech2 code. This is likely the closest original implementation we can access.

DiffSinger

LJ Speech model weights are available here.

Other notable unofficial implementations include:

Additional Context

This issue is a revisiting of https://github.com/huggingface/transformers/pull/11135.

cc @anton-l @patrickvonplaten

jaketae commented 2 years ago

This thread is a summary of some details discussed in today's meeting with Patrick and Anton.

Ideally, use weights from DiffSinger, as it is the best approximation of the original implementation currently available
Check inference results
Compare with fairseq's implementation to see if there are architectural differences
(Tentative) Use non-neural vocoding algorithms (e.g. Griffin-Lim) or find a light-weight, clean library for optional dependency

Below are some relevant observations made thus far.

Inference seems to work with DiffSinger's weights (listen to LJ introduce HF in style)
There are a total of 6 versions of the paper on arXiv. In early versions of the paper, F0 is used, whereas the final version uses continuous wavelet transforms. To the best of my knowledge, all open source implementations of FastSpeech2 follow the early version. DiffSinger's checkpoint/configuration, however, uses CWT. Therefore, there are minor architectural differences between other implementations, including fairseq's, and DiffSinger's (even with checkpoint conversion, some keys relating to CWT would be missing). This is also noted in ming024's repo.

This implementation is more similar to version 1, which uses F0 values as the pitch features. On the other hand, pitch spectrograms extracted by continuous wavelet transform are used as the pitch features in the later versions.

My personal vote would be to use DiffSinger's weights and code given that it reflects the most up-to-date version of the paper. What do you think @patrickvonplaten @anton-l?

patrickvonplaten commented 2 years ago

Sorry to answer that late here @jaketae.

@anton-l, could you take over the guidance of the PR here? :-) I'm drowning a bit in GitHub issues at the moment :sweat_smile:

My 2 cents, if the inference works well with DiffSinger's weights - let's go for this code as the official fastspeech2 code.

anton-l commented 2 years ago

@patrickvonplaten, @jaketae and I agreed to take a brief break until the speech sprint ends, so that we can make a more informed decision about which version of the model to implement first (or focus on a more simple architecture altogether)

jaketae commented 2 years ago

@patrickvonplaten, thank you for looping back to this issue! Adding on top of what Anton said, I might try to port more low hanging fruit, such as FastPitch or TransformerTTS in the meantime. I'll be coordinating with Anton throughout. Thank you both!

ArEnSc commented 2 years ago

@patrickvonplaten @anton-l @jaketae I would be interested in helping with this, as I am super genuinely interested in Speech Synthesis and its been a hobby of mine.

I am eventually looking to return back to (Local / Serverless) Machine Learning Engineering from Mobile Software Development in the payments domain.

I was mainly working in the Mobile Machine Learning Space before DL was a thing. I worked on feature extraction from gyroscope accelerometer motion signals. I am a bit rusty however, since more recently. I have been pushed to take a leadership role.

Anyways talk is cheap, delivery means more to me. I have gotten Fast Pitch 1.1 to work on custom dataset and fine tuned it on this popular actress.

Here's a sample of this https://vocaroo.com/14E4KeW0ymXI yeah we could be living in a "her" based universe in 10 years? powered by hugging face =)
Encase you missed this movie
https://en.wikipedia.org/wiki/Her_(film)

I will be looking to apply to the wild card position or the pytorch engineer position once I get about 4 weeks of leetcode practice and read and memorize a few things in the pytorch book.

Let me know how I can help thanks!

jaketae commented 2 years ago

Hey @ArEnSc, thank you for your interest in FastSpeech2 integration to HF transformers! Glad to see someone who is also interested in TTS.

At the moment, the integration is FastSpeech2 using weights from fairseq. I also briefly considered FastPitch, but FS2 is the priority at the moment. If you would like to contribute FastPitch, please feel free to do so! The FastSpeech2 PR is WIP at the moment, but when it matures it will likely introduce a number of TTS-specific APIs that you might want to reference in your own work with FastPitch. If you have any questions, ideas, or feedback, they are always welcome.

I'm not involved in hiring, but I believe the door is always open. Feel free to apply and wish you the best of luck!

ArEnSc commented 2 years ago

@jaketae ill take a look at that FastSpeech2 PR, I am going through the hugging face course to get an idea of the API surface right now seems striaght forward I believe. Ill get some of the ground work started and then wait for the TTS specific API's to mature =)

huggingface / transformers

Add FastSpeech2 #15166

🌟 New model addition

Model description

Open source status

Additional Context