sh-lee-prml / HierSpeechpp

The official implementation of HierSpeech++
MIT License
1.13k stars 134 forks source link

tensor mismatch size in commons.rand_slice_segments(w2v, length, 60) #29

Closed meriamOu closed 4 months ago

meriamOu commented 5 months ago

hey, thank you so much for your great work. As I am trying to train the model on Libritts dataset following your tips https://github.com/sh-lee-prml/HierSpeechpp/issues/20#issuecomment-1870806287. I encountered the following issue:

w2v_slice, ids_slice = commons.rand_slice_segments(w2v, length, 60) File "/home/meri/HierSpeechpp-train_Libritts_460/commons.py", line 70, in rand_slice_segments ret = slice_segments(x, ids_str, segment_size) File "/home/meri/HierSpeechpp-train_Libritts_460/commons.py", line 53, in slice_segments ret[i] = x[i, :, idx_str:idx_end] RuntimeError: The expanded size of the tensor (60) must match the existing size (30) at non-singleton dimension 1. Target sizes: [1024, 60]. Tensor sizes: [1024, 30] and RuntimeError: Expected input_lengths to have value at most 512, but got value 520 (while checking arguments for ctc_loss_gpu)

It seems the reason for this is that the tensor of lengths has bigger numbers ([408, 420, 376, 328, 332, 400, 340, 288, 520, 124, 180, 260, 256, 100, 216, 124, 600, 192, 384, 296, 544, 480, 436, 384, 440, 252, 324, 152, 372, 336, 128, 288], device='cuda:0') than the size of the w2v of ([512, 32, 178]

any tip how to solve this issue?

hayeong0 commented 5 months ago

We have updated the preprocessing code for TTV.

Through prepare_filelist.py, you can limit the length of input sequences and create the corresponding filelist for use.

Please refer to the TTV README.md and give it a try.

MikeMill789 commented 4 months ago

Hi. Just stumbled upon this repo , looks promising. I am looking at the TTV training code and couldnt understand what this condition is for in prepare_filelists.py

if len_txt * 2 + 1 > data_len or

sh-lee-prml commented 4 months ago

Hi

As we utilize the blank token, we have a length of len_txt * 2 + 1 for text sequence.

However, if this text token length is longer than speech frame, there is some issues in MAS.

Thanks

MikeMill789 commented 4 months ago

Hi @sh-lee-prml . I am trying to train a Ukrainian (ukr) TTV model with my own dataset. However , CTC loss is returning negative after few epochs.

I've updated '178' part in both TextEncoder(178, out_channels=inter_channels, ...) and self.phoneme_classifier = Conv1d(inter_channels, 178, 1, ..) with my length of symbols (which is 38).

Any thoughts about this?

hayeong0 commented 3 months ago

Hi, @MikeMill789

If the phoneme symbols for the language you wish to train total 38, then you have made the correct adjustment. You need to verify that tokens have been properly extracted from the text transcript, and ensure that the data is neither too short nor contains any empty entries. We have experience in training models for other languages, including 7 Indian languages, Korean, and Russian, but we have not encountered a situation where the CTC loss becomes negative. However, we have found that the presence of empty data can lead to NaN values. Could you please check your data? Thanks

MikeMill789 commented 3 months ago

Thanks for the reply. I'll recheck my data.

kunyao2015 commented 1 month ago

@MikeMill789 Hi, "CTC loss is returning negative after few epochs." Does the problem caused by empty data ?