jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild
Other
7.65k stars 749 forks source link

Tips to improve the quality of text to speech #69

Open tweeter0830 opened 7 months ago

tweeter0830 commented 7 months ago

Thanks for the great model!

Do you have any tips when using the model to clone voices for text to speech?

I'm converting the reference wav files to 16000 sample rate and the same format as the example wav file in the repo.

However, the performance of the model doesn't seem that great. It often can only mimic the general tone and gender of the reference and often has pauses or slurring.

I'm calling it like this:

    def generate(self, wav_audio_file: pathlib.Path, audio_file_transcript: str, target_transcript: str) -> bytes:
        # take a look at demo/temp/mfa_alignment, decide which part of the audio to use as prompt
        target_transcript = f"{audio_file_transcript} {target_transcript}"
        print(target_transcript)

        # NOTE: 3 sec of reference is generally enough for high quality voice cloning, but longer is generally better, try e.g. 3~6 sec.
        audio_file_path = str(wav_audio_file)
        info = torchaudio.info(audio_file_path)
        audio_dur = info.num_frames / info.sample_rate

        # cut_off_sec = 4.01 # NOTE: according to forced-alignment file demo/temp/mfa_alignments/84_121550_000074_000000.csv, the word "common" stop as 3.01 sec, this should be different for different audio
        # assert cut_off_sec < audio_dur, f"cut_off_sec {cut_off_sec} is larger than the audio duration {audio_dur}"
        # prompt_end_frame = int(cut_off_sec * info.sample_rate)
        prompt_end_frame = -1

        # run the model to get the output
        # hyperparameters for inference
        codec_audio_sr = 16000
        codec_sr = 50
        top_k = 0
        top_p = 0.8
        temperature = 1
        silence_tokens=[1388,1898,131]
        kvcache = 1 # NOTE if OOM, change this to 0, or try the 330M model

        # NOTE adjust the below three arguments if the generation is not as good
        stop_repetition = 3 # NOTE if the model generate long silence, reduce the stop_repetition to 3, 2 or even 1
        sample_batch_size = 4 # NOTE: if the if there are long silence or unnaturally strecthed words, increase sample_batch_size to 5 or higher. What this will do to the model is that the model will run sample_batch_size examples of the same audio, and pick the one that's the shortest. So if the speech rate of the generated is too fast change it to a smaller number.
        seed = 1 # change seed if you are still unhappy with the result

        seed_everything(seed)

Am I missing something? Thank you!

dillionverma commented 7 months ago

:+1: same question here!

I was able to produce some sounds but the quality is... mediocre? How can we improve it?

edit: changing the seed parameters and also making the target transcript to be only 1-2 sentences helped a bit. (longer sentences causes the pitch to change for some reason)

jasonppy commented 7 months ago

How long target transcript? The model is trained on short sentences (evarage length 5 sec, although the longest training data goes to 20sec), so you might want to finetune it on long utterances if that's you testing scenario

without finetuning, you could try increasing sample_batch_size and decreasing stop_repetition

In general, the current model is not trained to do TTS - it's trained to do speech editing, but it happens to generalize to TTS. I'm finetunig the model on a TTS objective, and will release that model soon

tweeter0830 commented 7 months ago

Thank you! I was using reference audio up to 12 seconds long + target transcript which is about 4 seconds long.

I’ll try using a reference which is about 4 seconds + target of 4 seconds? Does that sound ok?

Also, when doing text to speech, I just concatenate the reference transcript and target transcript together and set prompt_end_frame to -1. Is that the correct thing to do?

jasonppy commented 7 months ago

Thank you! I was using reference audio up to 12 seconds long + target transcript which is about 4 seconds long.

I’ll try using a reference which is about 4 seconds + target of 4 seconds? Does that sound ok?

Also, when doing text to speech, I just concatenate the reference transcript and target transcript together and set prompt_end_frame to -1. Is that the correct thing to do?

all sound good

jasonppy commented 7 months ago

Some times the speaker similarity can be a bit off, it's like the model uses a different voice than the prompt.

One thing that I found can improve speaker similarity in those situations is to make sure that the prompt is not an entire sentence, it should be instead an unfinished sentence, and therefore the model will better follow the voice

due to the noisy nature of gigaspeech, some of the training utterances have a speaker switch, i.e. two speakers takes turn to speak in the same training utterance.

jasonppy commented 7 months ago

The TTS finetuned 330M model is up, should be better than the 830M one

clearpathai commented 7 months ago

The TTS finetuned 330M model is up, should be better than the 830M one

Thank you for the release of the fine-tuned 330M TTS model. Its performance and efficiency are impressive. Your work is greatly appreciated, and I'm keen to see how it evolves to further support real-time use cases. Are there plans to develop future models with an emphasis on optimizing for real-time TTS applications?