Open smlkdev opened 2 weeks ago
Update after ~34h: Little improvement visible but note sure should I keep it longer because of the flattening.
We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.
We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially used around 8 to 10 hours for training.
@jeremy110 Thank you for your response! I’m honestly a bit hooked on watching the progress as it keeps going down, so I can’t seem to stop checking in :-)
Currently at 68 hours.
I’m planning to create an 8-10 hour audio dataset for the next training session. Could you suggest what kind of text data I should gather for it? So far, I’ve used random articles and some ChatGPT-generated data, but I’ve heard that people sometimes read books, for example. Is there perhaps a dataset available with quality English sentences that covers a variety of language phenomena? I tried to find it but no results.
@smlkdev Basically, this training can be kept short since it’s just a fine-tuning session; no need to make it too long. Here’s my previous tensorboard log for your reference(https://github.com/myshell-ai/MeloTTS/issues/120#issuecomment-2105728981).
I haven’t specifically researched text types. My own dataset was professionally recorded, with sentences that resemble reading books. I’m not very familiar with English datasets—are you planning to train in English?
This is my first attempt with ML/training/voice cloning and decided to use english. I read briefly Thai thread and it was way too complex for me to start with.
Your training was 32 hours long and for me (I'm not the expert) infer voice matched original :) That's really nice. Is it a voice that had 8-10 hours of audio as you mentioned earlier?
Yes, that's correct. I tried both single-speaker and multi-speaker models, and the total duration is around 8-10 hours.
If this is your first time getting into it, I recommend you try F5-TTS. There are a lot of people in the forums who have trained their own models, and some even wrote a Gradio interface, which is very convenient.
@jeremy110 thank your for your responses.
Is F5-TTS better than MeloTTS in terms of quality?
I just realized that my cloned MeloTTS voice doesn’t add breaks between sentences. I have to add them manually—by splitting the text into sentences, breaking it down into smaller parts, generating and then merging it back together after adding pauses. It can be made automatically of course but still a bit of work. (I was focusing on single sentences before and I liked the quality)
In terms of quality, I think F5-TTS is quite good. You can try it out on the Huggingface demo.
The pauses within sentences mainly depend on your commas (","). The program adds a space after punctuation to create a pause. However, if the audio files you trained on have very little silence before and after the speech, the generated audio will also have little silence. Of course, you can add the pauses manually, but you could also address it by adjusting the training data.
@smlkdev I am training the melotts model with sentiment data. But I couldn't get the tensorboard graphs to work. Can you share your sample code?
These are written in the train.log file. Training is still ongoing. Are these important?
2024-11-19 09:22:10,339 example ERROR enc_p.language_emb.weight is not in the checkpoint
2024-11-19 09:22:10,340 example ERROR emb_g.weight is not in the checkpoint
@smlkdev I am training the melotts model with sentiment data. But I couldn't get the tensorboard graphs to work. Can you share your sample code?
I used the simplest cmd possible:
tensorboard --logdir PATH
where PATH is a logs folder inside ...MeloTTS/melo/logs/checkpoint_name
(pointing to folder with checkpoints)
@jeremy110 Hello, I would like to inquire about the data preparation process when training on multiple speakers. Is it necessary for each speaker to have a comparable amount of data? For instance, if Speaker A has 10 hours of audio and Speaker B only has 1 hour, is it possible to create a good model, or does Speaker B also require approximately 10 hours of audio? Thank you
@manhcuong17072002 Hello~ In my training, some speakers had 1 or 2 hours of audio, while others had 30 minutes, and in the end, there were about 10 hours of total data. I was able to train a decent model, but for speakers with less data, their pronunciation wasn't as accurate.
@jeremy110 Oh, if that's the case, that's wonderful. Collecting data and training the model will become much easier with your idea. So, when training, you must have used many speaker IDs, right? And do you find their quality sufficient for deployment in a real-world environment? I'm really glad to hear your helpful feedback. Thank you very much!
@manhcuong17072002
Yes, there are about 15 speakers. Of course, if you have enough people, you can continue to increase the number. After 10 hours, the voice quality is quite close, but if you want better prosody, you might need more speakers and hours.
From the TTS systems I've heard, the voice quality is about above average, but when it comes to deployment, you need to consider inference time. For this, MeloTTS is quite fast.
@jeremy110 Thank you for the incredibly helpful information. Let me summarize a few points:
However, I've experimented with various TTS models and noticed that if the text isn't broken down into smaller chunks, the generated speech quality degrades towards the end of longer passages. Have you tested this with MeloTTS? If so, could you share your experimental process? Thank you so much.
@manhcuong17072002 You're welcome, your conclusion is correct.
Normally, during training, long audio files are avoided to prevent GPU OOM (Out of Memory) issues. Therefore, during inference, punctuation marks are typically used to segment the text, ensuring that each sentence is closer to the length used during training for better performance. MeloTTS performs this segmentation based on punctuation during inference, and then concatenates the individual audio files after synthesis.
@jeremy110 I'm so sorry but I suddenly have a question about training on a multiple speakers dataset. Is it possible for Speaker A to pronounce words that exist in other speakers but not in A? Because if not, dividing the dataset into multiple speakers would be pointless and the model would not be able to cover the entire vocabulary of a language. Have you tried this before and what are your thoughts on this? Thank you.
@manhcuong17072002 If we consider 30 minutes of audio, assuming each word takes about 0.3 seconds, there would be around 5000–6000 words. These words would then be converted into phoneme format, meaning they would be broken down into their phonetic components for training. With 6000 words, the model would learn most of the phonemes. However, when a new word is encountered, it will be broken down into the phonemes it has already learned. I haven't done rigorous testing, but in my case, the model is able to produce similar sounds.
Will my training yield better results over time? Currently, the training took about 9 hours. I have 1500 wav samples, with a total audio length of approximately 2 hours.
What other metrics should I pay attention to in TensorBoard?