Closed meriamOu closed 10 months ago
We have updated the preprocessing code for TTV.
Through prepare_filelist.py, you can limit the length of input sequences and create the corresponding filelist for use.
Please refer to the TTV README.md and give it a try.
Hi. Just stumbled upon this repo , looks promising. I am looking at the TTV training code and couldnt understand what this condition is for in prepare_filelists.py
if len_txt * 2 + 1 > data_len or
Hi
As we utilize the blank token, we have a length of len_txt * 2 + 1 for text sequence.
However, if this text token length is longer than speech frame, there is some issues in MAS.
Thanks
Hi @sh-lee-prml . I am trying to train a Ukrainian (ukr) TTV model with my own dataset. However , CTC loss is returning negative after few epochs.
I've updated '178' part in both TextEncoder(178, out_channels=inter_channels, ...)
and self.phoneme_classifier = Conv1d(inter_channels, 178, 1, ..)
with my length of symbols (which is 38).
Any thoughts about this?
Hi, @MikeMill789
If the phoneme symbols for the language you wish to train total 38, then you have made the correct adjustment. You need to verify that tokens have been properly extracted from the text transcript, and ensure that the data is neither too short nor contains any empty entries. We have experience in training models for other languages, including 7 Indian languages, Korean, and Russian, but we have not encountered a situation where the CTC loss becomes negative. However, we have found that the presence of empty data can lead to NaN values. Could you please check your data? Thanks
Thanks for the reply. I'll recheck my data.
@MikeMill789 Hi, "CTC loss is returning negative after few epochs." Does the problem caused by empty data ?
hey, thank you so much for your great work. As I am trying to train the model on Libritts dataset following your tips https://github.com/sh-lee-prml/HierSpeechpp/issues/20#issuecomment-1870806287. I encountered the following issue:
w2v_slice, ids_slice = commons.rand_slice_segments(w2v, length, 60) File "/home/meri/HierSpeechpp-train_Libritts_460/commons.py", line 70, in rand_slice_segments ret = slice_segments(x, ids_str, segment_size) File "/home/meri/HierSpeechpp-train_Libritts_460/commons.py", line 53, in slice_segments ret[i] = x[i, :, idx_str:idx_end] RuntimeError: The expanded size of the tensor (60) must match the existing size (30) at non-singleton dimension 1. Target sizes: [1024, 60]. Tensor sizes: [1024, 30] and RuntimeError: Expected input_lengths to have value at most 512, but got value 520 (while checking arguments for ctc_loss_gpu)
It seems the reason for this is that the tensor of lengths has bigger numbers ([408, 420, 376, 328, 332, 400, 340, 288, 520, 124, 180, 260, 256, 100, 216, 124, 600, 192, 384, 296, 544, 480, 436, 384, 440, 252, 324, 152, 372, 336, 128, 288], device='cuda:0') than the size of the w2v of ([512, 32, 178]
any tip how to solve this issue?