Setting up a voice conversion pipeline

I successfully made your pipeline example run on my Mac. I did not expect to meet an assistant, but understand a bit more now about the intention of this project.

I would like to build a pipeline for voice conversion, similar to the product that ElevenLabs are offering. In their app you can upload a sound file up to 50 MB, and get a configurable voice conversion of the original speech sample. Microsoft SpeechT5 also offers voice conversion, but one would have to build a custom framework around that model.

Is speech-to-speech a relevant tool for such a task, or should I look at other s2s models or frameworks?

EDIT: After writing this, I realized that GPT4-o is a AI voice controlled assistant. My bad. It would still be nice to know if this pipeline can easily be modified to accept sound files, and convert voices.

EDIT2: I found this HuggingFace audio course, which I guess pretty much covers the basics. However: the ElevenLabs voice conversion outputs an audio file where the converted words is synced to the spoken words on a timeline, in practical terms mimicking the pace and style of the speaker. Unless I am missing something obvious, it seems my best option is to build a custom framework around the SpeechT5 vc model

EDIT3: I think this problem is solved, for example by WhisperX. If one wishes to build a framework from scratch, it would involve

A STT model like whisper-distil-large for speech transcription
An aligner like Pytorch audio forced align
A TTS model like parler-tts
Finally a custom framework for syncing the converted speech chunks to the original voice recording

huggingface / speech-to-speech

Setting up a voice conversion pipeline #117