neonbjb / tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Apache License 2.0
12.8k stars 1.77k forks source link

is there any way to make it read longer like 20-30mins? #280

Closed philsouth closed 1 year ago

philsouth commented 1 year ago

I can get 23 seconds out of it but you'd think that it would be possible. Also are the emotional stresses automatic?

sbersier commented 1 year ago

For long generations: Have a look at read.py on https://github.com/neonbjb/tortoise-tts Note: Multi-sentences might give bad results if they are too long. In which case, you should break up your text with the "|" symbol (in case of read.py). For the emotional stresses: If you give a sentence like "Look at that! What is it?" the exclamation and the interrogation points are taken into account. But as said on the main page under "prompt engineering", you can help tortoise by giving "scenic" indications between [] like: "[She said with a sad voice,] I can't stand it anymore." What is between brackets won't be said but seems to help tortoise.

philsouth commented 1 year ago

Interesting thank you. So what's the syntax to get it to read a text file? I'm using the collab version.

sbersier commented 1 year ago

I haven't tried it on Colab but just looking at the notebook: if you go to the last cell, you'll find these 2 lines:

!python3 tortoise/read.py --voice=train_atkins --textfile=tortoise/data/riding_hood.txt --preset=ultra_fast --output_path=.
IPython.display.Audio('train_atkins/combined.wav')

If you run this cell, it will read the riding_hood.txt with atkins's voice. Now, all you have to do is to replace the text file with your text and the voice with the voice you want (and possibly change the preset to "standard") But be aware that it will take a LOT of time... And the the result won't be OK throughout the text. You will probably have to split the text with a "|" between sentences and remove empty lines... NOTE: In order to know the possibles options: Create a new cell and paste the following line: !python3 tortoise/read.py --help

sbersier commented 1 year ago

If you want to get something good you'll need to spend quiet some time, trying different seed and editing the result. In my case, I didn't spend that much of time so the result is a bit janky... But if you want an idea of the kind of things that can be done, here is the Little red riding hood with Stephen Fry's voice: https://drive.google.com/file/d/1VgZToukf7b3gBhZtH9Tf0YvGJ60JTsFC/view?usp=sharing (p.s.: I forgot a sentence... the second "Who's there?" (which you can hear as a long silence). I apparently lost it in the editing needed in order to get a reasonable result.)

Or, a try on "Viking tales by Jennie Hall, Part I, The Baby": https://www.gutenberg.org/cache/epub/24811/pg24811-images.html (with Royalties Free music from Mystika : https://www.youtube.com/watch?v=rd1tyqLuoOo )

https://drive.google.com/file/d/1rkRC48F1JCytqxn5miWbnuVJAIvK6Fp0/view?usp=sharing (note: the text just goes up to 1'45")

philsouth commented 1 year ago

wow that's truley impressive. I'm especially impressed with the Stephen Fry voice. I have trouble getting it to do British English. The project I'm trying to do (and I've had no luck anywhere) is to try and make a AI voice of my co-writer (who sadly passed away in 2018) as part of the materials for the sequel to a game we wrote in the 80s. All my attempts have come to nothing and although I'm not non-techincal by any means, the methods of tweaking the voices to get them sounding good have so far eluded me.

I haven't tried it on Colab but just looking at the notebook: if you go to the last cell, you'll find these 2 lines:

!python3 tortoise/read.py --voice=train_atkins --textfile=tortoise/data/riding_hood.txt --preset=ultra_fast --output_path=.
IPython.display.Audio('train_atkins/combined.wav')

If you run this cell, it will read the riding_hood.txt with atkins's voice. Now, all you have to do is to replace the text file with your text and the voice with the voice you want (and possibly change the preset to "standard") But be aware that it will take a LOT of time... And the the result won't be OK throughout the text. You will probably have to split the text with a "|" between sentences and remove empty lines... NOTE: In order to know the possibles options: Create a new cell and paste the following line: !python3 tortoise/read.py --help

hmm I can't see those lines, am I missing something?

philsouth commented 1 year ago

Yep looked again those lines are not in the last cell of the collab. Am I on the wrong one? Sorry I'm sure this is really annoying :)

sbersier commented 1 year ago

Ah... I was talking about the notebook tortoise_tts.ipynb in the repository. But, no problem: 1) Open the shared notebook. 2) Run cell 1 (just cell 1 ! Don't run cell 2 with the imports) 3) Select cell 1 and add a line below it with [+ Code] button 4) In the new created cell, paste the following:

!python3 tortoise/read.py --voice=train_atkins --textfile=tortoise/data/riding_hood.txt --preset=ultra_fast --output_path=results

Note: you can change the preset to fast or standard. It takes longer the result is better.

After downloading the models, it will start generating the sentences from the riding_hood.txt file and put them into the tortoise-tts/results folder. If you double click on the generated samples (0.wav, 1.wav, ... combined.wav) it downloads them.

philsouth commented 1 year ago

Thank you! That seems to be working. I keep getting the "stop tokens" error but does that mean I just have to insert more "|"?

sbersier commented 1 year ago

Yes, splitting the text should help. And in some cases, you might have to be a bit creative in order to get the result.

A note regarding using Colab. I think the best way of using it would be to copy the notebook to your own google drive in MyDrive/Colab Notebooks, And run it from there. Because Google colab doesn't like that much downloading the same models/data again and again everytime you use it. It takes an unnecessary amount of bandwidth.

To install it on your google drive: 0) Open the shared notebook, "Files --> Save a copy in drive --> Runtime --> Manage Sessions --> Terminate --> Close window

1) Go to your google drive and open the notebook from your google drive

2) Click on the Mount Drive button (next to the "eye") and follow the instructions

3) Add a line after the cell containing "from google.colab.... drive mount" In that line add: cd 'drive/MyDrive/Colab Notebooks'

4) Run the installation (Cell with !pip3 install stuffs) It will create a tortoise-tts folder in your Colab Notebooks folder on your drive.

5) Add a new line:

!python3 tortoise/read.py --voice=train_atkins --textfile=tortoise/data/riding_hood.txt --preset=ultra_fast --output_path=results

And don't forget to close your session (Runtime --> Manage Session --> Terminate)

philsouth commented 1 year ago

That's amazing, thank you for your time. Plenty to keep me busy there. :) One thing, how do you build pauses in? I could cut the audio and edit it, but of course the background room tone wouldn't be there so it would sound disjointed.

sbersier commented 1 year ago

I'm not sure I understand your problem. What background room tone? Are you talking about creating a new voice using your own recordings?

philsouth commented 1 year ago

Sorry I wasn't clear. In the "weaver" voice for example the training is from a conference and obviously spoken over a microphone and the background tone of the room is slightly noisy. This "room tone" has transferred to the voice when generated. If I add space to the words in editing the spaces lack the same room tone and so will be silent, causing noticable gaps. So can I add gaps or opauses in the generated speech with notations in the TXT file it is reading?

sbersier commented 1 year ago

Well, I listened to the input audio in the tortoise-tts/tortoise/voices/weaver folder and indeed the audio isn't good enough (in my opinion) to generate a good output. Too much noises, reverberation, and there is even a moment where you hear the interviewer... There nothing (meaningful) you can do with it. In order to get good results you have to be picky with the audio you put in.

philsouth commented 1 year ago

Well, I listened to the input audio in the tortoise-tts/tortoise/voices/weaver folder and indeed the audio isn't good enough (in my opinion) to generate a good output. Too much noises, reverberation, and there is even a moment where you hear the interviewer... There nothing (meaningful) you can do with it. In order to get good results you have to be picky with the audio you put in.

Understood. Thanks again for your time, I really appreciate it. I'm still not sure how the Stephen Fry one was achieved, because all my attempts at getting an accurate British accent with my samples have render American results. Can you tell me how you tweaked it to get the result so true to the original? Also did you program in those changes of voice for the granny etc. Those were amazing.

sbersier commented 1 year ago

For Stephen Fry, I downloaded the video: https://www.youtube.com/watch?v=r1BeLqDen70 I extracted the audio and splitted it into utterrances which I put into a subfolder named Stephen_Fry in the tortoise/voice folder. Note that it gave me 1h22 min of audio (471 audio clips). That's certainly largely overkilll but since I was able to automate this process it was OK for me. But you can also get good results with much less.

Then I extracted the voice latents with the tortoise/get_conditioning_latents.py (cf. doc.) I moved the audio clips into a "Clips" folder (just to keep them) and copied the resulting latent (.pth file) into my voice/Stephen_Fry folder.

You will note that in the video Stephen Fry also "makes voices" for the characters. I think it helped a lot when it came to make the wolf/riding hood/grandmother voices. Of course you have to generate the same lines multiple times and select the ones you like. It is a slow, very slow process. But, strangely enough, it found the right voices pretty easily... As if it knew that a wolf should have a low booming voice and so on. Almost as if it knew the little riding hood story... I can't explain.

So, can Tortoise generate "british" accentuated speech? Of course, it just depends on the audio you put in. If you feed it with "scottish","irish","texan", "New Yorker",... speaker audio content, it will generate a voice with these accents. If you want a voice with a finnish accent, you can also do it. Just find the right audio.

philsouth commented 1 year ago

Fascinating how it "knows" stuff like that. I suppose the same way it "knows" prosody.

Is any of that possible with the Collab Notebook or would I have to have a local install?

philsouth commented 1 year ago

(That said I don't have a very good Nvidia GPU and the installation process is by no means trivial. Maybe that's just me. :) )

philsouth commented 1 year ago

Well I can give it a try. Thanks for all your pointers you've been very helpful.

sbersier commented 1 year ago

(Sorry I deleted my previous comment because it was inaccurate) Colab or local install? I prefer a local install because your time on colab is limited and playing with tortoise takes a LOT of time. Also, I prefer to have everything on my local drive. It's simpler. I have a RTX 3060 (12 GB VRAM) and it is perfectly fine. But, apparently, you can go with much less. Tortoise adapts to the GPU size (see lines 179-193 in https://github.com/neonbjb/tortoise-tts/blob/main/tortoise/api.py ) But the larger the VRAM, the faster it will run. In my case, Tortoise tops at about 9 GB of used VRAM and ~7 GB of RAM.

Now, installation might indeed be a bit tricky. But if you closely follow the installation steps, it should be (more or less) OK...

philsouth commented 1 year ago

Then I extracted the voice latents with the tortoise/get_conditioning_latents.py (cf. doc.) I moved the audio clips into a "Clips" folder (just to keep them) and copied the resulting latent (.pth file) into my voice/Stephen_Fry folder.

Sorry can you express that in more detail? What's the syntax for "tortoise/get_conditioning_latents.py" for example. I have 10 second clips (all 22050hz mono) but I keep getting errors when I try to process them. Like so:

I use python tortoise/get_conditioning_latents.py --voice john_new and get

C:\Users\User\tortoise-tts>python tortoise/get_conditioning_latents.py --voice john_new
Traceback (most recent call last):
  File "tortoise/get_conditioning_latents.py", line 23, in <module>
    cond_paths = voices[voice]
KeyError: 'john_new'

And

C:\Users\User\tortoise-tts\tortoise\utils\audio.py:17: WavFileWarning: Chunk (non-data) not understood, skipping it.

But I have made a local install. It seems to be working! At least I ran the test phrase and it made voices. It made it three times for some reason I'm not clear on, but it did it!

sbersier commented 1 year ago

If you type: python tortoise/get_conditioning_latents.py --help

It will show the "help" message :

usage: get_conditioning_latents.py [-h] [--voice VOICE] [--output_path OUTPUT_PATH]

options: -h, --help show this help message and exit --voice VOICE Selects the voice to convert to conditioning latents --output_path OUTPUT_PATH Where to store outputs.

So, in your case: 1) You need a folder named john_new placed in the tortoise/voice/ folder 2) The john_new folder should contain the audio clips. 3) From the tortoise-tts folder: python tortoise/get_conditioning_latents.py --voice john_new --output_path LATENTS

It will create a folder named LATENTS in the current folder (i.e. in tortoise-tts) containing a file named: john_new.pth Now, copy this file to the voice/john_new folder and move the audio clips into a subfolder, let's say "audio_clips" (for further experiments, but they are not required anymore. Or you can just delete them.) So now, you should have .pth file and a folder (or just the .pth file)

From there (cd to the tortoise-tts folder): python tortoise/read.py --help You'll see a lot of options:

So, you could generate audio with something like: python tortoise/read.py --textfile tortoise/data/riding_hood.txt --voice john_new --output_path john_new_sentences --preset standard --candidates 3

philsouth commented 1 year ago

Wow that's great thanks. Sorry for monopolising your time. I'm getting some great results already.

https://audio.com/philsouth/john-voice-clone-test

It sounds about 60-70% there already and I haven't even done the latents yet. Thank you so much this has been frustrating me for weeks.

sbersier commented 1 year ago

Yeah! It looks great already! Well done! Note that the conditioning latents generation (with the get_conditioning_latents.py script) won't improve the quality. Generation will just be a bit faster. If you want to improve on the quality then you have to improve on the audio clips. E.g. by carefully denoising the clips, maybe a bit of equalization and increase the number of clips (if possible) You can do all that with a software like Audacity.

philsouth commented 1 year ago

Not sure how much more audio I can get. I'll ask around. Are there other ways to improve the likeness?

sbersier commented 1 year ago

You can try generation with the tortoise_tts.py script located in the tortoise-tts/scripts folder. For the help: python scripts/tortoise_tts.py --help You'll see a there lot of parameters you can give. In case the voice jumps "all over the place" you can try to set parameter "temperature" to a lower value. The default one is 0.8 but if you set it at 0.3 it will "tame the beast" a bit while retaining a pretty good expressivity. Increasing the number of autoregressive samples is another way of improving on the result. If it keep repeating words then you can play with repetition-penalty parameter and increase its default value 2.0 to let's say 5. And so on, and so on... Tortoise is a complex piece of software and what makes it great (the "randomness") is also what makes it difficult and time consuming to use. But in the end, the quality of the audio clips you have is what matters the most.

philsouth commented 1 year ago

Fabulous thanks. That's great info. You've saved me a ton of time trying to work it out. You e got me like 80% towards where I wanted to be and I'm much obliged. :)

s-b-repo commented 1 year ago

In order to make a TTS program read text for 20-30 minutes, you will need to make sure that your program can handle long text inputs efficiently and generate the speech output in real-time. Here are some tips that you can use to achieve this:

Text Pre-processing: Make sure to pre-process the text input to remove any irrelevant or redundant information that is not necessary for generating speech output. This can help reduce the size of the input and improve the performance of the TTS engine.

Text Segmentation: If the text input is very long, you may want to segment it into smaller chunks and generate speech output for each chunk separately. This can help prevent memory overflow and improve the stability of your program.

Speech Generation Optimization: Consider optimizing the speech generation process to reduce the time required to generate speech output. For example, you can use parallel processing or GPU acceleration to speed up the generation process.

TTS Engine Selection: Choose a TTS engine that is designed to handle long text inputs efficiently and generate high-quality speech output in real-time. You can consider using commercial TTS engines or open-source TTS engines such as Festival, eSpeak, or gTTS.

Note: The performance of your TTS program will also depend on the hardware and software configurations of your system, so make sure to use a system with adequate memory, processing power, and storage capacity.

philsouth commented 1 year ago

Well the latents thing is still going wonky :)

C:\Users\User\tortoise-tts>python tortoise/get_conditioning_latents.py --voice john_new --output_path LATENTS

Traceback (most recent call last): File "tortoise/get_conditioning_latents.py", line 23, in <module> cond_paths = voices[voice] KeyError: 'john_new'

Dunno what's going wrong there. How odd.

sbersier commented 1 year ago

What do you have in the tortoise/voices/john_new folder?

philsouth commented 1 year ago

The sample clips, the ones I used for training.

sbersier commented 1 year ago

So, something like: clip1.wav, clip2.wav, .... ? And nothing else? No misspelling? Can you try with: python tortoise/get_conditioning_latents.py --voice deniro --output_path LATENTS Does it work? EDIT: output-path --> output_path

philsouth commented 1 year ago

So, something like: clip1.wav, clip2.wav, .... ? And nothing else? No misspelling?

They are called jm-01.wav, jm-02.wav, ... etc. and the folder is indeed "voices/john_new"

I'm wondering if it's because I chopped the file up automatically with audible into 10 second chunks and it chops words off.

Can you try with: python tortoise/get_conditioning_latents.py --voice deniro --output-path LATENTS

I'm giving it a go now.

sbersier commented 1 year ago

Ooops... I meant: python tortoise/get_conditioning_latents.py --voice deniro --output_path LATENTS It should work.

Now if the folder doesn't exist it will generate the error you get. For example, python tortoise/get_conditioning_latents.py --voice kjhasdjhdasdd --output-path LATENTS

will generate the same error (KeyError...) because there is no folder named "kjhasdjhdasdd" in the voices folder.

philsouth commented 1 year ago

AH ok LOL now I've got you doing it :) hmm, it's entirely possible I made an error. I'll check.

sbersier commented 1 year ago

Note: The error will be the same if the folder exists but is empty.

philsouth commented 1 year ago

Okay that worked, so there must be an error in mine somewhere. I'll check. I'm a tiny bit dyslexic sometimes unless I concentrate hard so it's entire on the table I might have fudged it up. :)

philsouth commented 1 year ago

Yes sure, I'm sure we've solved a few problems for people along the way, but yes. quite right. Talk soon.

htnha commented 1 month ago

https://drive.google.com/file/d/1rkRC48F1JCytqxn5miWbnuVJAIvK6Fp0/view?usp=sharing (note: the text just goes up to 1'45")

How can I make a sound like this? please help!

sbersier commented 1 month ago

https://drive.google.com/file/d/1rkRC48F1JCytqxn5miWbnuVJAIvK6Fp0/view?usp=sharing (note: the text just goes up to 1'45")

How can I make a sound like this? please help!

It was some time ago... I'm not sure I remember everything.

For the man's voice I think I used the voice from here: https://www.youtube.com/watch?v=-zgKyBlyjL8 A famous actor with a beautiful voice... Note that the original audio comes from librivox but then was enhanced (in a very nice way) to be published on the youtube channel. I give you the voice embedding : https://drive.google.com/file/d/16GniOo6YmmstJEk3kX4qxz7pfzhuYhf9/view?usp=sharing

For the woman's voice, I think I used the voice from Sonia, an experienced reader at librivox: https://librivox.org/the-house-of-orchids-and-other-poems-by-george-sterling/ The embedding: https://drive.google.com/file/d/1433fVDbNNvWt10W_UcflM4s9MHPaT1zY/view?usp=sharing

Now, the result doesn't come straight of the box. I generated the audio, generaly sentence by sentence, while keeping the same random seed. Then I chose the best utterances and edited/stitched them together in Audacity.

The important thing is to have good audio and enough audio (~1 h). I would add that taking audio in the same "genre" will help. For example, I don't think I would have achieved the same result with Sonia reading the news on some random TV news channel.

I would also add that passing the audio through a denoiser (e.g. demucs) and possibly a voice enhancer might help you with (not too) poor audio.