I successfully made your pipeline example run on my Mac. I did not expect to meet an assistant, but understand a bit more now about the intention of this project.
I would like to build a pipeline for voice conversion, similar to the product that ElevenLabs are offering. In their app you can upload a sound file up to 50 MB, and get a configurable voice conversion of the original speech sample. Microsoft SpeechT5 also offers voice conversion, but one would have to build a custom framework around that model.
Is speech-to-speech a relevant tool for such a task, or should I look at other s2s models or frameworks?
EDIT: After writing this, I realized that GPT4-o is a AI voice controlled assistant. My bad. It would still be nice to know if this pipeline can easily be modified to accept sound files, and convert voices.
EDIT2: I found this HuggingFace audio course, which I guess pretty much covers the basics. However: the ElevenLabs voice conversion outputs an audio file where the converted words is synced to the spoken words on a timeline, in practical terms mimicking the pace and style of the speaker. Unless I am missing something obvious, it seems my best option is to build a custom framework around the SpeechT5 vc model
EDIT3: I think this problem is solved, for example by WhisperX. If one wishes to build a framework from scratch, it would involve
A STT model like whisper-distil-large for speech transcription
Hi! I think this could be relevant for this project. Right now, we focused mostly on chatting to LLMs, but doing voice conversion is around the corner for it, I don't see a reason why we wouldn't support it here.
I successfully made your pipeline example run on my Mac. I did not expect to meet an assistant, but understand a bit more now about the intention of this project.
I would like to build a pipeline for voice conversion, similar to the product that ElevenLabs are offering. In their app you can upload a sound file up to 50 MB, and get a configurable voice conversion of the original speech sample. Microsoft SpeechT5 also offers voice conversion, but one would have to build a custom framework around that model.
Is speech-to-speech a relevant tool for such a task, or should I look at other s2s models or frameworks?
EDIT: After writing this, I realized that GPT4-o is a AI voice controlled assistant. My bad. It would still be nice to know if this pipeline can easily be modified to accept sound files, and convert voices.
EDIT2: I found this HuggingFace audio course, which I guess pretty much covers the basics. However: the ElevenLabs voice conversion outputs an audio file where the converted words is synced to the spoken words on a timeline, in practical terms mimicking the pace and style of the speaker. Unless I am missing something obvious, it seems my best option is to build a custom framework around the SpeechT5 vc model
EDIT3: I think this problem is solved, for example by WhisperX. If one wishes to build a framework from scratch, it would involve
whisper-distil-large
for speech transcriptionparler-tts