negative dur loss - Githubissues

segmentationFaults commented 6 months ago

I found dur loss is negative when dur/num_token < 20ms, I think this is caused by mms(20ms per frame) and add blank

segmentationFaults commented 6 months ago

what if I set add blank = false?

sh-lee-prml commented 6 months ago

I have not experienced the negative duration loss yet...!

As you said,

before training, we filtered some data where dur/(num_token*2+1) < 20ms as below

Additionally, removing very short audio increases the robustness for training the model

    wav_min = 32
    wav_max = 600
    text_min = 1
    text_max = 200
 path = args.input_dir
    wavs_train = []
    wavs_train += sorted(glob.glob(path+'/**/*.wav', recursive=True))

    print("wav num", len(wavs_train))
    text_wav_pair_train = []
    new_wavs_train = []

    short_audio = 0
    long_audio = 0

    for wav in tqdm.tqdm(wavs_train):

        data, _ = torchaudio.load(wav)
        len_data = data.size(-1) // 320
        if len_data <= wav_min:
            short_audio +=1
            continue
        if  len_data >=wav_max:
            long_audio +=1
            continue
        try:
            txt = torch.load(wav.replace('wave_folder_name', 'text_token_folder_name').replace('.wav', '.pt'))
        except:
            continue
        len_txt = txt.size(-1)

        if len_txt*2+1 > len_data:
            continue

        if len_txt <= text_min or len_txt >=text_max:
            continue

        text_wav_pair_train.append(wav)

In my experience, the model trained with blank token has a better performance so I recommend using blank token for phoneme sequence.

Thanks!

segmentationFaults commented 6 months ago

thank you very much for your answer~

babysor commented 5 months ago

@segmentationFaults do you mind to share the training experience and exchange the usage practice with me, who recently started to onboard this project? Thanks a lot in advance.

sh-lee-prml / HierSpeechpp

negative dur loss #20