myshell-ai / OpenVoice

Instant voice cloning by MIT and MyShell.
https://research.myshell.ai/open-voice
MIT License
29.93k stars 2.95k forks source link

I want to use Parler tts for as the base model, is it possible? #312

Open Hassan-jr opened 2 months ago

Hassan-jr commented 2 months ago

OpenVoice is amazing and I like it a lot. However, I wanted to experiment with different TTS models like Parler tts from huggingface as the base model instead of the melotts.

How to achieve that? thanks

dosubot[bot] commented 2 months ago

Hi @Hassan-jr! I'm here to assist you with any bugs, questions, or contributions you have. Let's solve this together!

Yes, it is possible to use a different TTS model like Parler TTS as the base model. You can refer to a similar open issue where a solution was suggested for replacing the base model with something different [1].

Here is a summarized approach to achieve this:

  1. Generate Audio with Parler TTS: Use Parler TTS to generate an audio file of a sentence (preferably longer than 10 seconds).

  2. Extract Tone Color Vector: Use the se_extractor function to extract the tone color vector from the generated audio. The se_extractor function is implemented in openvoice/se_extractor.py and its input and output specifications are as follows:

    • Input:
      • audio_path (str): The path to the audio file.
      • vc_model (object): The voice cloning model object.
      • target_dir (str, optional): The directory where the processed files will be saved. Default is 'processed'.
      • vad (bool, optional): A flag indicating whether to use Voice Activity Detection (VAD) for splitting the audio. Default is True.
    • Output:
      • Returns a tuple containing:
        • The speaker embedding extracted by the vc_model.
        • The audio_name which is a string derived from the audio file name, model version, and a hash of the audio content.
    def get_se(audio_path, vc_model, target_dir='processed', vad=True):
        device = vc_model.device
        version = vc_model.version
        print("OpenVoice version:", version)
    
        audio_name = f"{os.path.basename(audio_path).rsplit('.', 1)[0]}_{version}_{hash_numpy_array(audio_path)}"
        se_path = os.path.join(target_dir, audio_name, 'se.pth')
    
        if vad:
            wavs_folder = split_audio_vad(audio_path, target_dir=target_dir, audio_name=audio_name)
        else:
            wavs_folder = split_audio_whisper(audio_path, target_dir=target_dir, audio_name=audio_name)
    
        audio_segs = glob(f'{wavs_folder}/*.wav')
        if len(audio_segs) == 0:
            raise NotImplementedError('No audio segments found!')
    
        return vc_model.extract_se(audio_segs, se_save_path=se_path), audio_name
  3. Use the Tone Color Vector: Once you have the tone color vector, you can use it as input along with a new sentence read by your base speaker (Parler TTS) and the tone color vector of the target speaker to the tone color converter.

For more detailed steps, you can refer to demo_part2.ipynb in the OpenVoice repository, which provides an example of using an external TTS model as the base speaker.

To continue talking to Dosu, mention @dosu.

meet1919 commented 2 months ago

Does this help in better voice cloning? Because using melotts' english model as the base, cloned voice has vast difference between reference voice

Hassan-jr commented 2 months ago

@meet1919 I am not sure. I wanted to experiment with it.