Closed segmentationFaults closed 6 months ago
what if I set add blank = false?
I have not experienced the negative duration loss yet...!
As you said,
before training, we filtered some data where dur/(num_token*2+1) < 20ms as below
Additionally, removing very short audio increases the robustness for training the model
wav_min = 32
wav_max = 600
text_min = 1
text_max = 200
path = args.input_dir
wavs_train = []
wavs_train += sorted(glob.glob(path+'/**/*.wav', recursive=True))
print("wav num", len(wavs_train))
text_wav_pair_train = []
new_wavs_train = []
short_audio = 0
long_audio = 0
for wav in tqdm.tqdm(wavs_train):
data, _ = torchaudio.load(wav)
len_data = data.size(-1) // 320
if len_data <= wav_min:
short_audio +=1
continue
if len_data >=wav_max:
long_audio +=1
continue
try:
txt = torch.load(wav.replace('wave_folder_name', 'text_token_folder_name').replace('.wav', '.pt'))
except:
continue
len_txt = txt.size(-1)
if len_txt*2+1 > len_data:
continue
if len_txt <= text_min or len_txt >=text_max:
continue
text_wav_pair_train.append(wav)
In my experience, the model trained with blank token has a better performance so I recommend using blank token for phoneme sequence.
Thanks!
thank you very much for your answer~
@segmentationFaults do you mind to share the training experience and exchange the usage practice with me, who recently started to onboard this project? Thanks a lot in advance.
I found dur loss is negative when dur/num_token < 20ms, I think this is caused by mms(20ms per frame) and add blank