huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.36k stars 26.12k forks source link

[WIP] New Model Add FastPitch 1.1 #16349

Open ArEnSc opened 2 years ago

ArEnSc commented 2 years ago

🌟 New model addition

Model description

What type of model is Fast Pitch 1.1? It is a Mel spectrogram generator (part of a speech to text model engine) that mainly comprises of two Feed Forward Transformer stacks. Fast Pitch is used to transform text to a spectrogram to be used in wave form generation in speech synthesis. Fast Pitch 1.1 include's multispeaker embeddings. What is the novel feature of the model making it different from other spectrogram generators? A fully-parallel text-to-speech model based on Fast Speech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. Uniformly increasing or decreasing pitch with Fast Pitch generates speech that resembles the voluntary modulation of voice.

Fast Pitch is meant to be used with a neural vocoder like Wave Net, or Wave Glow.

Text (Feature Extraction) → Audio Synthesis (spectrogram)→ Waveform Synthesis (wavform)

From the paper abstract:

We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. Uniformly increasing or decreasing pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the overall quality of synthesized speech, making it comparable to state-of-the-art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformer architecture, with over 900× real-time factor for mel-spectrogram synthesis of a typical utterance.

Open source status

Samples https://fastpitch.github.io/ My Own Generated Samples https://voca.ro/1eYmqidRhGi6 Pro's of the model: It plays a part in a high MOS score. Compariable to Tacotron2, without the high cost of inference.

High performance and High Quality will be useful to provide voice or soul to digital assistants or metaverse assistants.

Training isn't sophisticated, unlike FastPitch 1.0 this model does not required durations or alignments to be generated from Tacotron2 or Montreal Forced Aligner.

Con's of the model: It isn't in the Hugging Face repository to be easily adapted to products use cases. =)

cc @anton-l @patrickvonplaten will assign to whom will be availible once the draft is complete.

ArEnSc commented 2 years ago

Preliminary Tasks:

patrickvonplaten commented 2 years ago

Very cool! @anton-l do you know whether we are allowed to use checkpoints from Microsoft regarding licensing?

anton-l commented 2 years ago

@patrickvonplaten looks like there's no custom NVIDIA license this time, the checkpoint's license refers to the BSD-3 bundled with the code: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/FastPitch/LICENSE I think we're able to concatenate it with the transformers' Apache license?

Also FYI @jaketae as you were interested in porting FastPitch too :slightly_smiling_face:

patrickvonplaten commented 2 years ago

Cool - if @anton-l and @jaketae you think it's worth adding this model, happy to give it a try