Open jaketae opened 2 years ago
This thread is a summary of some details discussed in today's meeting with Patrick and Anton.
Below are some relevant observations made thus far.
This implementation is more similar to version 1, which uses F0 values as the pitch features. On the other hand, pitch spectrograms extracted by continuous wavelet transform are used as the pitch features in the later versions.
My personal vote would be to use DiffSinger's weights and code given that it reflects the most up-to-date version of the paper. What do you think @patrickvonplaten @anton-l?
Sorry to answer that late here @jaketae.
@anton-l, could you take over the guidance of the PR here? :-) I'm drowning a bit in GitHub issues at the moment :sweat_smile:
My 2 cents, if the inference works well with DiffSinger's weights - let's go for this code as the official fastspeech2 code.
@patrickvonplaten, @jaketae and I agreed to take a brief break until the speech sprint ends, so that we can make a more informed decision about which version of the model to implement first (or focus on a more simple architecture altogether)
@patrickvonplaten, thank you for looping back to this issue! Adding on top of what Anton said, I might try to port more low hanging fruit, such as FastPitch or TransformerTTS in the meantime. I'll be coordinating with Anton throughout. Thank you both!
@patrickvonplaten @anton-l @jaketae I would be interested in helping with this, as I am super genuinely interested in Speech Synthesis and its been a hobby of mine.
I am eventually looking to return back to (Local / Serverless) Machine Learning Engineering from Mobile Software Development in the payments domain.
I was mainly working in the Mobile Machine Learning Space before DL was a thing. I worked on feature extraction from gyroscope accelerometer motion signals. I am a bit rusty however, since more recently. I have been pushed to take a leadership role.
Anyways talk is cheap, delivery means more to me. I have gotten Fast Pitch 1.1 to work on custom dataset and fine tuned it on this popular actress.
Here's a sample of this https://vocaroo.com/14E4KeW0ymXI yeah we could be living in a "her" based universe in 10 years? powered by hugging face =)
Encase you missed this movie
https://en.wikipedia.org/wiki/Her_(film)
I will be looking to apply to the wild card position or the pytorch engineer position once I get about 4 weeks of leetcode practice and read and memorize a few things in the pytorch book.
Let me know how I can help thanks!
Hey @ArEnSc, thank you for your interest in FastSpeech2 integration to HF transformers! Glad to see someone who is also interested in TTS.
At the moment, the integration is FastSpeech2 using weights from fairseq. I also briefly considered FastPitch, but FS2 is the priority at the moment. If you would like to contribute FastPitch, please feel free to do so! The FastSpeech2 PR is WIP at the moment, but when it matures it will likely introduce a number of TTS-specific APIs that you might want to reference in your own work with FastPitch. If you have any questions, ideas, or feedback, they are always welcome.
I'm not involved in hiring, but I believe the door is always open. Feel free to apply and wish you the best of luck!
@jaketae ill take a look at that FastSpeech2 PR, I am going through the hugging face course to get an idea of the API surface right now seems striaght forward I believe. Ill get some of the ground work started and then wait for the TTS specific API's to mature =)
🌟 New model addition
Model description
FastSpeech2 is a TTS model that outputs mel-spectrograms given some input text. From the paper abstract:
Open source status
The authors have not open-sourced their code implementation. However, the first author replied to an email inquiry and pointed me to the official implementation of DiffSinger, which includes FastSpeech2 code. This is likely the closest original implementation we can access.
LJ Speech model weights are available here.
Other notable unofficial implementations include:
Additional Context
This issue is a revisiting of https://github.com/huggingface/transformers/pull/11135.
cc @anton-l @patrickvonplaten