Open Hassan-jr opened 1 month ago
Hi @Hassan-jr! I'm here to assist you with any bugs, questions, or contributions you have. Let's solve this together!
Yes, it is possible to use a different TTS model like Parler TTS as the base model. You can refer to a similar open issue where a solution was suggested for replacing the base model with something different [1].
Here is a summarized approach to achieve this:
Generate Audio with Parler TTS: Use Parler TTS to generate an audio file of a sentence (preferably longer than 10 seconds).
Extract Tone Color Vector: Use the se_extractor
function to extract the tone color vector from the generated audio. The se_extractor
function is implemented in openvoice/se_extractor.py
and its input and output specifications are as follows:
audio_path
(str): The path to the audio file.vc_model
(object): The voice cloning model object.target_dir
(str, optional): The directory where the processed files will be saved. Default is 'processed'
.vad
(bool, optional): A flag indicating whether to use Voice Activity Detection (VAD) for splitting the audio. Default is True
.vc_model
.audio_name
which is a string derived from the audio file name, model version, and a hash of the audio content.def get_se(audio_path, vc_model, target_dir='processed', vad=True):
device = vc_model.device
version = vc_model.version
print("OpenVoice version:", version)
audio_name = f"{os.path.basename(audio_path).rsplit('.', 1)[0]}_{version}_{hash_numpy_array(audio_path)}"
se_path = os.path.join(target_dir, audio_name, 'se.pth')
if vad:
wavs_folder = split_audio_vad(audio_path, target_dir=target_dir, audio_name=audio_name)
else:
wavs_folder = split_audio_whisper(audio_path, target_dir=target_dir, audio_name=audio_name)
audio_segs = glob(f'{wavs_folder}/*.wav')
if len(audio_segs) == 0:
raise NotImplementedError('No audio segments found!')
return vc_model.extract_se(audio_segs, se_save_path=se_path), audio_name
Use the Tone Color Vector: Once you have the tone color vector, you can use it as input along with a new sentence read by your base speaker (Parler TTS) and the tone color vector of the target speaker to the tone color converter.
For more detailed steps, you can refer to demo_part2.ipynb
in the OpenVoice repository, which provides an example of using an external TTS model as the base speaker.
To continue talking to Dosu, mention @dosu.
Does this help in better voice cloning? Because using melotts' english model as the base, cloned voice has vast difference between reference voice
@meet1919 I am not sure. I wanted to experiment with it.
OpenVoice is amazing and I like it a lot. However, I wanted to experiment with different TTS models like Parler tts from huggingface as the base model instead of the melotts.
How to achieve that? thanks