myshell-ai / MeloTTS

High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.
MIT License
4.84k stars 631 forks source link

Training vs tensorboard metrics #211

Open smlkdev opened 2 weeks ago

smlkdev commented 2 weeks ago

Will my training yield better results over time? Currently, the training took about 9 hours. I have 1500 wav samples, with a total audio length of approximately 2 hours.

Screenshot 2024-11-08 at 11 53 27

What other metrics should I pay attention to in TensorBoard?

smlkdev commented 1 week ago

Update after ~34h: Little improvement visible but note sure should I keep it longer because of the flattening.

Screenshot 2024-11-09 at 10 41 25 Screenshot 2024-11-09 at 10 41 41

jeremy110 commented 1 week ago

We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.

smlkdev commented 1 week ago

We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.

@jeremy110 Thank you for your response! I’m honestly a bit hooked on watching the progress as it keeps going down, so I can’t seem to stop checking in :-)

Currently at 68 hours.

Screenshot 2024-11-10 at 22 05 29

I’m planning to create an 8-10 hour audio dataset for the next training session. Could you suggest what kind of text data I should gather for it? So far, I’ve used random articles and some ChatGPT-generated data, but I’ve heard that people sometimes read books, for example. Is there perhaps a dataset available with quality English sentences that covers a variety of language phenomena? I tried to find it but no results.

jeremy110 commented 1 week ago

@smlkdev Basically, this training can be kept short since it’s just a fine-tuning session; no need to make it too long. Here’s my previous tensorboard log for your reference(https://github.com/myshell-ai/MeloTTS/issues/120#issuecomment-2105728981).

I haven’t specifically researched text types. My own dataset was professionally recorded, with sentences that resemble reading books. I’m not very familiar with English datasets—are you planning to train in English?

smlkdev commented 1 week ago

This is my first attempt with ML/training/voice cloning and decided to use english. I read briefly Thai thread and it was way too complex for me to start with.

Your training was 32 hours long and for me (I'm not the expert) infer voice matched original :) That's really nice. Is it a voice that had 8-10 hours of audio as you mentioned earlier?

jeremy110 commented 1 week ago

Yes, that's correct. I tried both single-speaker and multi-speaker models, and the total duration is around 8-10 hours.

If this is your first time getting into it, I recommend you try F5-TTS. There are a lot of people in the forums who have trained their own models, and some even wrote a Gradio interface, which is very convenient.

smlkdev commented 1 week ago

@jeremy110 thank your for your responses.

Is F5-TTS better than MeloTTS in terms of quality?

I just realized that my cloned MeloTTS voice doesn’t add breaks between sentences. I have to add them manually—by splitting the text into sentences, breaking it down into smaller parts, generating and then merging it back together after adding pauses. It can be made automatically of course but still a bit of work. (I was focusing on single sentences before and I liked the quality)

jeremy110 commented 1 week ago

In terms of quality, I think F5-TTS is quite good. You can try it out on the Huggingface demo.

The pauses within sentences mainly depend on your commas (","). The program adds a space after punctuation to create a pause. However, if the audio files you trained on have very little silence before and after the speech, the generated audio will also have little silence. Of course, you can add the pauses manually, but you could also address it by adjusting the training data.

kadirnar commented 3 days ago

@smlkdev I am training the melotts model with sentiment data. But I couldn't get the tensorboard graphs to work. Can you share your sample code?

kadirnar commented 3 days ago

These are written in the train.log file. Training is still ongoing. Are these important?

2024-11-19 09:22:10,339 example ERROR   enc_p.language_emb.weight is not in the checkpoint
2024-11-19 09:22:10,340 example ERROR   emb_g.weight is not in the checkpoint
smlkdev commented 3 days ago

@smlkdev I am training the melotts model with sentiment data. But I couldn't get the tensorboard graphs to work. Can you share your sample code?

I used the simplest cmd possible:

tensorboard --logdir PATH where PATH is a logs folder inside ...MeloTTS/melo/logs/checkpoint_name (pointing to folder with checkpoints)

manhcuong17072002 commented 1 day ago

@jeremy110 Hello, I would like to inquire about the data preparation process when training on multiple speakers. Is it necessary for each speaker to have a comparable amount of data? For instance, if Speaker A has 10 hours of audio and Speaker B only has 1 hour, is it possible to create a good model, or does Speaker B also require approximately 10 hours of audio? Thank you

jeremy110 commented 1 day ago

@manhcuong17072002 Hello~ In my training, some speakers had 1 or 2 hours of audio, while others had 30 minutes, and in the end, there were about 10 hours of total data. I was able to train a decent model, but for speakers with less data, their pronunciation wasn't as accurate.

manhcuong17072002 commented 22 hours ago

@jeremy110 Oh, if that's the case, that's wonderful. Collecting data and training the model will become much easier with your idea. So, when training, you must have used many speaker IDs, right? And do you find their quality sufficient for deployment in a real-world environment? I'm really glad to hear your helpful feedback. Thank you very much!

jeremy110 commented 21 hours ago

@manhcuong17072002

Yes, there are about 15 speakers. Of course, if you have enough people, you can continue to increase the number. After 10 hours, the voice quality is quite close, but if you want better prosody, you might need more speakers and hours.

From the TTS systems I've heard, the voice quality is about above average, but when it comes to deployment, you need to consider inference time. For this, MeloTTS is quite fast.

manhcuong17072002 commented 20 hours ago

@jeremy110 Thank you for the incredibly helpful information. Let me summarize a few points:

However, I've experimented with various TTS models and noticed that if the text isn't broken down into smaller chunks, the generated speech quality degrades towards the end of longer passages. Have you tested this with MeloTTS? If so, could you share your experimental process? Thank you so much.

jeremy110 commented 10 hours ago

@manhcuong17072002 You're welcome, your conclusion is correct.

Normally, during training, long audio files are avoided to prevent GPU OOM (Out of Memory) issues. Therefore, during inference, punctuation marks are typically used to segment the text, ensuring that each sentence is closer to the length used during training for better performance. MeloTTS performs this segmentation based on punctuation during inference, and then concatenates the individual audio files after synthesis.

manhcuong17072002 commented 1 hour ago

@jeremy110 I'm so sorry but I suddenly have a question about training on a multiple speakers dataset. Is it possible for Speaker A to pronounce words that exist in other speakers but not in A? Because if not, dividing the dataset into multiple speakers would be pointless and the model would not be able to cover the entire vocabulary of a language. Have you tried this before and what are your thoughts on this? Thank you.

jeremy110 commented 33 minutes ago

@manhcuong17072002 If we consider 30 minutes of audio, assuming each word takes about 0.3 seconds, there would be around 5000–6000 words. These words would then be converted into phoneme format, meaning they would be broken down into their phonetic components for training. With 6000 words, the model would learn most of the phonemes. However, when a new word is encountered, it will be broken down into the phonemes it has already learned. I haven't done rigorous testing, but in my case, the model is able to produce similar sounds.