huggingface / speech-to-speech

Speech To Speech: an effort for an open-sourced and modular GPT4-o
Apache License 2.0
3.27k stars 345 forks source link

Setting up a voice conversion pipeline #117

Open holmbuar opened 3 weeks ago

holmbuar commented 3 weeks ago

I successfully made your pipeline example run on my Mac. I did not expect to meet an assistant, but understand a bit more now about the intention of this project.

I would like to build a pipeline for voice conversion, similar to the product that ElevenLabs are offering. In their app you can upload a sound file up to 50 MB, and get a configurable voice conversion of the original speech sample. Microsoft SpeechT5 also offers voice conversion, but one would have to build a custom framework around that model.

Is speech-to-speech a relevant tool for such a task, or should I look at other s2s models or frameworks?

EDIT: After writing this, I realized that GPT4-o is a AI voice controlled assistant. My bad. It would still be nice to know if this pipeline can easily be modified to accept sound files, and convert voices.

EDIT2: I found this HuggingFace audio course, which I guess pretty much covers the basics. However: the ElevenLabs voice conversion outputs an audio file where the converted words is synced to the spoken words on a timeline, in practical terms mimicking the pace and style of the speaker. Unless I am missing something obvious, it seems my best option is to build a custom framework around the SpeechT5 vc model

EDIT3: I think this problem is solved, for example by WhisperX. If one wishes to build a framework from scratch, it would involve

  1. A STT model like whisper-distil-large for speech transcription
  2. An aligner like Pytorch audio forced align
  3. A TTS model like parler-tts
  4. Finally a custom framework for syncing the converted speech chunks to the original voice recording
andimarafioti commented 4 days ago

Hi! I think this could be relevant for this project. Right now, we focused mostly on chatting to LLMs, but doing voice conversion is around the corner for it, I don't see a reason why we wouldn't support it here.