rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
6.44k stars 473 forks source link

Finished Training, infer sounds better than output #107

Open 97Cweb opened 1 year ago

97Cweb commented 1 year ago

Me again I managed to finish CPU training finally and it sounds good when using infer. I exported the ONNNX file, and I do not have an .ONNX.JSON file. Where do I export it from. I tried using the one in Kathleens voice, but it sounds terrible. Also, the player calls for speaker ID. 0 for me sounds terrible, but I guessed 20 and either than being pitch shifted, is less terrible.

I am trying to have a python file like the following code that will act like the infer field in the main training example area on the main page .

Please help

` from python_run.piper import Piper from functools import partial from pathlib import Path soxVoicePath = "voice-sox/sox.onnx" voice = Piper(soxVoicePath) synthesize = partial( voice.synthesize, speaker_id=20, length_scale=1, noise_scale=0.667, noise_w=0.8, )

def main(text):

    wav_bytes = synthesize(text)

    wav_path = "output.wav"
    output_path = Path(wav_path)
    #output_path.mkdir(parents=True, exist_ok=True)

    output_path.write_bytes(wav_bytes)

if __name__ == '__main__':
    main("I bought you five minutes")

`

97Cweb commented 1 year ago

I don't know how to format the code, that should all be one file

kbickar commented 1 year ago

The json file should be in your training dir, you just need to rename it

97Cweb commented 1 year ago

@kbickar thank you for that. It still seems to be pitch shifted down and scratchier than the infer. Do you know what is causing that?

synesthesiam commented 1 year ago

Did you train a single or multi speaker model?

97Cweb commented 1 year ago

I am unsure which I trained. I followed the example commands in the main page for training. python3 -m piper_train \ --dataset-dir /path/to/training_dir/ \ --accelerator 'gpu' \ --devices 1 \ --batch-size 32 \ --validation-split 0.05 \ --num-test-examples 5 \ --max_epochs 10000 \ --precision 32

modified by removing GPU, and batch size set to 4

only thing I found that mentions speaker id is in the jsonl file

{"text": "I bought you five minutes.", "audio_path": "/home/ubuntu/Documents/Rhasspy/python/Sox_Voice/sox-21001.wav", "speaker": "I bought you five minutes.", "speaker_id": 20, "phonemes": ["a", "ɪ", " ", "b", "ˈ", "ɔ", "ː", "t", " ", "j", "u", "ː", " ", "f", "ˈ", "a", "ɪ", "v", " ", "m", "ˈ", "ɪ", "n", "ɪ", "t", "s", "."], "phoneme_ids": [1, 0, 14, 0, 74, 0, 3, 0, 15, 0, 120, 0, 54, 0, 122, 0, 32, 0, 3, 0, 22, 0, 33, 0, 122, 0, 3, 0, 19, 0, 120, 0, 14, 0, 74, 0, 34, 0, 3, 0, 25, 0, 120, 0, 74, 0, 26, 0, 74, 0, 32, 0, 31, 0, 10, 0, 2], "audio_norm_path": "/home/ubuntu/Documents/Rhasspy/python/training_dir/cache/22050/fe76c655a2586c3419fdc8e27c6766da57554c1b8ab66c7570551cc52876e83c.pt", "audio_spec_path": "/home/ubuntu/Documents/Rhasspy/python/training_dir/cache/22050/fe76c655a2586c3419fdc8e27c6766da57554c1b8ab66c7570551cc52876e83c.spec.pt"} {"text": "Hello Izzy!", "audio_path": "/home/ubuntu/Documents/Rhasspy/python/Sox_Voice/sox-32001.wav", "speaker": "Hello Izzy!", "speaker_id": 30, "phonemes": ["h", "ə", "l", "ˈ", "o", "ʊ", " ", "ˈ", "ɪ", "z", "i", "!"], "phoneme_ids": [1, 0, 20, 0, 59, 0, 24, 0, 120, 0, 27, 0, 100, 0, 3, 0, 120, 0, 74, 0, 38, 0, 21, 0, 4, 0, 2], "audio_norm_path": "/home/ubuntu/Documents/Rhasspy/python/training_dir/cache/22050/2464e0c00c853b46d4eea072a08f6f4e2a6869b3e47ecc8c1b1201fe7ae7b66f.pt", "audio_spec_path": "/home/ubuntu/Documents/Rhasspy/python/training_dir/cache/22050/2464e0c00c853b46d4eea072a08f6f4e2a6869b3e47ecc8c1b1201fe7ae7b66f.spec.pt"} etc

Lauler commented 1 year ago

If your data only consists a single speaker, then you should include the argument --single-speaker in your command for launching training.

@synesthesiam I think the README would benefit from mentioning the arg --single-speaker. I too accidentally trained a multispeaker model the first time I used piper because I didn't understand the preprocessing would interpret my LJSpeech dataset as having multiple speakers.

I don't understand what logic is used to partition the data to multiple speakers off of the id column in LJSpeech. There's no mention at all of how to format multiple speakers in the linked LJSpeech dataset format description. If you know how to properly format LJSpeech with multiple speakers, it would be nice to include a paragraph about it in the README!

97Cweb commented 1 year ago

@Lauler Thank you for the flag I was missing. Time to start retraining. Are there any others I should know about considering this will take me another month to train as my GPU does not work?

synesthesiam commented 1 year ago

Are you training from scratch or fine-tuning from an existing voice? Fine-tuning will be much faster.

97Cweb commented 1 year ago

@synesthesiam I am training from scratch. I am the one trying to clone Sox voice from lightyear as seen in other issues I have posted.

synesthesiam commented 1 year ago

I'd recommend using --resume_from_checkpoint to finetune an existing model instead of training from scratch.

I want to help, but I need to be careful around audio data that is copyrighted. Especially something associated with the mouse company.

97Cweb commented 1 year ago

I am using this solely for myself and will not be released anywhere publicly. I am using --resume_from_checkpoint to resume training of system but I started it from scratch. I don't know how to finetune an existing model to match the voice, as the instructions do not include that. Hopefully this is enough info for whether you can help.

synesthesiam commented 1 year ago

I would start over completely and use --resume_from_checkpoint with the "lessac" medium quality voice (22050 Hz sample rate). This will get you something in a few dozen epochs rather than thousands.

If you're using 16Khz (low quality), you'll need to wait a bit as I'm retraining the lessac voice.

kbickar commented 1 year ago

@synesthesiam Are you saying you're going to upload a checkpoint from 16khz? Does it work to resume training a medium voice with 16khz audio?

synesthesiam commented 1 year ago

I'll be uploading a 16Khz checkpoint tomorrow for lessac, yes. You cannot switch sample rates when resuming, unfortunately. The only difference between low and medium quality is 16000Hz vs 22050Hz sample rates.

97Cweb commented 1 year ago

I am back. I did not see your previous message. How would I train based on Lessac? I finished another batch of training and while better than the last one, still sounds pretty terrible.

kbickar commented 1 year ago

Basically do all the training the same, except when you run the train command use the flag --resume_from_checkpoint and pass the path to the lessac checkpoint downloaded from here: https://huggingface.co/datasets/rhasspy/piper-checkpoints/tree/main/en/en_US/lessac. It should start sounding good after very few epochs (50 was pretty good when I tested).

When I tried from scratch, it took around 20,000 epochs to get something that sounded good for certain phrases (but still garbled for words where my script didn't cover the phonemes used).

97Cweb commented 1 year ago

Ok, I'll try that next. Thank you. If possible, could this be added to the main page/wiki for the train your own section?

kbickar commented 1 year ago

It was updated fairly recently with the instructions: https://github.com/rhasspy/piper/blob/master/TRAINING.md

97Cweb commented 1 year ago

Thank you for your help. It sounds soo much better! One last question. Is it possible to increase the pause between sentences? As in using the . in a sentence to delay longer? Right now it runs them together.

aaronnewsome commented 10 months ago

Thank you for your help. It sounds soo much better! One last question. Is it possible to increase the pause between sentences? As in using the . in a sentence to delay longer? Right now it runs them together.

I too would like to know how to get more dead air between sentences. They tend to just get smashed right against the previous sentence without a rest.

97Cweb commented 10 months ago

Thank you for your help. It sounds soo much better! One last question. Is it possible to increase the pause between sentences? As in using the . in a sentence to delay longer? Right now it runs them together.

I too would like to know how to get more dead air between sentences. They tend to just get smashed right against the previous sentence without a rest.

What I ended up doing is manually adding silence using AudioSegment

` total=AudioSegment.silent(duration=1) sentences = sent_tokenize(text) for i in range(len(sentences)): print(sentences[i]) wav_bytes = self.synthesize(sentences[i]) tempPath = Path("temp.wav") tempPath.write_bytes(wav_bytes)

        wav = AudioSegment.from_wav("temp.wav")
        total += wav
        #if i < len(sentences)-1:
            #append silence
        total += AudioSegment.silent(duration=250)

`

aaronnewsome commented 9 months ago

Thanks for the hint @97Cweb. I ended up going a different direction before I saw your post. Although, I think I like your method better. The web frontend I built for doing Piper TTS supports a bunch of different output formats, mp3, wav, aif, m4a, flac, ogg, etc. Using ffmpeg to do all the work, so it was easier to just plug into that framework I built.

My web frontend also has a speed control, since some of the voices, even my own custom voice, sound a little slow for me. Also using an external program to speed up or slow down the piper generated audio. I went that route before I realized Piper can speed up the audio itself, without external programs. But I've already built my speed control so it is what it is.

My web front end has two options for adding "rest" between sentences, really just silence. Fixed rest will insert a fixed amount of silence and Random rest will randomize the length from .5 sec to maxlength, which is configurable in my WebUI. I think the random rest sounds more natural, the only downside is that regenerating the same text twice results in an output file that's slightly longer or shorter than the last time it was ran.

I've also enabled mp4 output in my WebUI, which shows a video of scrolling text and waveform display synced with the Piper generated TTS. I'm debating cleaning up the code and posting it for all to use. Either way, I have a few YouTube videos planned where I'll use the mp4 videos with Piper and my custom voice to narrate some videos, so you'll see it in action there.

Anyway, just wanted to give you a proper thanks for taking the time to share your code. Considering how amazing Piper is, it's a bit of a head scratcher as to why there's such low traffic here in the Issues. It's very lonely.