Closed yiwei0730 closed 4 months ago
3-seconds long prompt is used for training as p-flow paper did, so audio samples for training should be longer than 3-seconds. i set it 3.5 for convenience, to ensure all samples produce proper loss. loss is calculated on only non-prompt region. you can check p-flow paper for more details.
text2latent_rate is to adjust inconsistent frame rate between text_duration and encodec latent. text_duration is calculated with frame rate 50Hz since it's based on XLS-R, and encodec latent has 75Hz frame rate. so, it is necessary to upscale text embedding to encodec's frame rate.
seed is just random seed used while training. so it will affect initialization of model.
while training loop, generated samples from model will be logged on tensorboard every sample_freq steps.
this is pytorch-lightning specific issue. AFAIK if use_distributed_sampler is True, Lightning will use their own distributed batch sampler. so this value should be False to use custom DistributedSampler. related issue: https://github.com/Lightning-AI/pytorch-lightning/issues/5145
I meet one error when i test the training code path = self.paths[idx] IndexError: list index out of range if idx > self.max_length: print(idx, self.max_length) i print the idx and max_length of df 1000 50 i don't know why the max_length just 50, but the idx jump to the 1000
@yiwei0730 need full traceback.
and, what is max_length? it seems no max_length
in this repo.
i add it in the TextLatentDataset
self.max_length= len(df.index)
maybe i find the problem if the dataset low than 1000 data than it will break(since i use the 64 data in validation)
if i just setting the val data path same to train data path(200K data), then it work.
but i still don't know why hahaha
The another question is can i add the batch_durations, i saw the 100 duration just fill my gpu in 6000 but i have 48G how much does it setting with a good para.
oh, i got it.
sample_idx
in model config is used for log generated audio while training loop.
so you can adjust this value available values like [0, 1, 2, 3, 4]
. idx 1000 was appeared due to this setup. i'm gonna add this info in README too.
you can increase batch_durations about to 200~300, memory consumption would increase bit linearly when you increase batch_durations.
i used 100 and 4 gradient accumulation, so effective batch size i used is 400. i think bigger batch would have better result.
by the way, i remember you are doing mandarin and english. is there any good (24K sample_rate, multi-speaker) public dataset for training mandarin TTS? which dataset are you using now?
OH, thank you. I will use maybe 4 GPU to train in 300 batch_duration hope it will have a good results!! The mandarin dataset I think Aishell is good and aidatazang . maybe there are more in openslr But I train in the kingasr(100K) dataset(I think this is not public by the way) and LibriTTS(100K, this is public). If the result is best enough i will use more dataset to testing. I saw you use the langid in the new submit, is it have a great performance? If you add a language setting, will it limit your generation to a certain language and not be able to do multi-language generation?
thanks. i'm gonna try aishell or aidatazang. FYI, if you are using 4 gpus with batch_durations, effective batch_durations will be 1200. batch_durations is per device. Good Luck!
For language setting, not tested many yet. i tried adding language embedding and finetune from multilingual checkpoint. but speaker embedding was still highly entangled. so, lang id setting is still on experiment. Fortunately, finetuned model could still generate code-switched speech.
Did you mean accumulate_grad_batches setted in 4 * (duration =300)m =1200? Hope 48G can eat it :>。
"Fortunately, finetuned model could still generate code-switched speech."-> wow, that's surprised, that mean if you setting EN, You can still synthesize Japanese and Korean synthesized languages
i used language id drop while training, so it can inference without lang id.
when i compare samples used lang id or not, there's no significant difference. i think model is not trained well. need more experiments.
Hello @yiwei0730, nice to meet you here in this repository again. I am also training with datasets in Korean, Chinese, Japanese, and English. I will share the results once they are available.
I am using the same datasets for Korean, Chinese, and Japanese as @seastar105, and for Chinese, I am using the aishell and MAGICDATA. I hope we get good results.
AH! I forget to told you some install error.
@yiwei0730
It seems original question was answered. Feel free to open issue or reopen here, if you have any related question.
Yes, thank you. If I have other findings and questions, I will immediately inquire and discuss with you! By the way, I found that different decodecs will have different effects. I tried to use the latest Facodec, and the effect seems to be too much reconstruction. Small noise in multiple raw pflow outputs
min_duration: 3.5 # minimum duration of files, this value MUST be bigger than 3.0 text2latent_rate: 1.5 # 50Hz:75Hz seed: 998244353 sample_freq: 5000 trainer.use_distributed_sampler to be False. I would like to ask about the use of these parameters in setting data config and experiment config, and what their effects are.