huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
2.6k stars 265 forks source link

Could this theoretically be retrained from scratch to generate singing vocals? #8

Open Saltb0xApps opened 1 month ago

Saltb0xApps commented 1 month ago

Given a 10k hour dataset of singing vocals (instead of the current audiobook reading content), could this model be ported to be able to sing/generate vocals?

adamfils commented 1 month ago

@Saltb0xApps I was thinking the same thing and also adding another input conditioning, like background music for the generated audio to follow. I am not exactly an ML engineer but here is my rough thinking.

  1. Text Encoder (Unchanged): Continues to map text descriptions to a sequence of hidden-state representations using a frozen text encoder initialized from Flan-T5.

  2. Music Encoder: A new component that takes background music as input and generates a music-conditioned representation using a pretrained music autoencoder (DAC or EnCodec). This encoder analyses the background music to extract features such as tempo, key, mood, and rhythm, which will be used to condition the generated speech.

  3. Parler-TTS Decoder (Modified): The decoder now auto-regressively generates audio tokens conditional not only on the encoder hidden-state representations (from text) but also on the music-conditioned representation. To incorporate the music-conditioned representation, you could either:

    • Concatenate: Directly concatenate the music-conditioned representation with the text-conditioned hidden states before feeding them into the decoder.
    • Cross-Attention Modification: Integrate the music-conditioned representation into the cross-attention layers of the decoder, allowing the decoder to attend to both text and music features simultaneously.
  4. Audio Codec (Unchanged): Continues to recover the audio waveform from the audio tokens predicted by the decoder using the DAC model or EnCodec as preferred.

@sanchit-gandhi Does this sound feasible?

sanchit-gandhi commented 1 month ago

Hey @Saltb0xApps @adamfils - this sounds like it would work. The only change that I would make would be using a more powerful audio encoder to extract more meaningful representations from the music conditioning (e.g. warm starting an audio encoder from the HuBERT model to extract music embedding representations). Using DAC or EnCodec alone is only going to provide you with a down-sampled version of the music inputs, rather than any features that encode tempo, key, mood, and rhythm, etc. This is then analogous to what the Flan-T5 encoder does for the text conditioning.

Note that you could use something similar to train a TTS model that has text and voice conditioning as well (just replace the music conditioning with a voice sample in the flowchart above). You could then give it a 2-second voice prompt to control the style of generated voice, and then control how fast/slow or animated/monotonous the speech is using the text prompt.

Saltb0xApps commented 2 weeks ago

@sanchit-gandhi / @adamfils Thank you for providing a detailed response! I have a large dataset of around 1k hours of just vocals (separated out of music from soundcloud songs using demucs) and their lyrics/transcription using Whisper. I was wondering if I could take that dataset of just vocals, combined with dataspeech info for style, and retrain parler-tts to only output singing vocals.

The idea is to create a robust singing vocals version of text to speech models using parler-tts that would only generate singing vocals instead of regular speech.

Would this require code level changes as mentioned above, or simply retraining parler-tts on this singing vocals dataset would be a good enough starting point for something that can generate vocals only?

ylacombe commented 1 week ago

Hey @Saltb0xApps, this would totally work as you only need 3 things from your datasets for the training to work:

  1. Audio samples
  2. Transcriptions
  3. Text conditioning

Parler-TTS is agnostic to the text conditioning and audio samples you're using!

Also, 1K hours should be enough to get a good-enough model from scratch. You can also explore fine-tuning the current model as well as it already learned some acoustics features and to associate text tokens to acoustics sounds!