loss increase and appear nan

jzi040941 / PercepNet

Unofficial implementation of PercepNet: A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech

BSD 3-Clause "New" or "Revised" License

324 stars 91 forks source link

loss increase and appear nan #12

Open YangangCao opened 3 years ago

YangangCao commented 3 years ago

Hi, thanks for your excellent work. I extract feature from speech(pcm, 12GB) and noise(pcm, 9GB), and set count into 10000000. Then, I run run_train.py and get the following output:

Can you help me? thanks again!

jzi040941 commented 3 years ago

Can you tell me which dataset you use for training?

YangangCao commented 3 years ago

Can you tell me which dataset you use for training?

Hi, I do some changes as follow: Firstly, I add some clean music data into speech since I want to keep the music when denoising. Secondly, speech and noise are resampled and re-codec from no-original 48k wav(such as: 8k 16k mp3). Maybe impact training result?

YangangCao commented 3 years ago

I use original 48k speech (concatenate into a pcm, 15GB )and noise(concatenate into a pcm, 7.8GB), set count as 10000000, get increasing loss and nan again. When I set count as 100000, I get following output:

seem like also increasing loss per iteration but decreasing per epoch, is it normal? When the count is big, the nan seems inevitable.

YangangCao commented 3 years ago

Hi , I found the problem, the reason of increasing loss is following:

            # print statistics
            running_loss += loss.item()

            # for testing
            print('[%d, %5d] loss: %.3f' %
                    (epoch + 1, i + 1, running_loss))

Actually, I quite don't understand why you write like this...

The reason of nan is CustomLoss

jzi040941 commented 3 years ago

Hi @YangangCao yes I was dumb, I only check for iter=1, epoch=1 that's why I didn't notice this printed loss increasing error on iteration. I fixed on commit 9de28e0

for nan appear error Did you check extracted feature(r,g) are 0~1? if not it will makes nan loss unless you clip it 0~1.

Thanks

YangangCao commented 3 years ago

Hi @YangangCao yes I was dumb, I only check for iter=1, epoch=1 that's why I didn't notice this printed loss increasing error on iteration. I fixed on commit 9de28e0

for nan appear error Did you check extracted feature(r,g) are 0~1? if not it will makes nan loss unless you clip it 0~1.

Thanks

I have checked the feature extracted from original 48k wav, they are all range from 0 to 1, including float point number, lots of 0 and sparse 1. When I set the count of extracted feature as 1e5, no nan appears( I tried more than one time). However, When I set as 1e6 and 1e7, nan appears again. I am not sure the relationship between count and nan.

Chen1399 commented 3 years ago

Code has error in 'rnn_train.py' so that loss is nan.

rb = targets[:,:,:34]
gb = targets[:,:,34:68]

but in 'denoise.cpp':

fwrite(g, sizeof(float), NB_BANDS, f3);//gain    
fwrite(r, sizeof(float), NB_BANDS, f3);//filtering strength

rb < 0, so that torch.pow(gb, 0.5) is nan

You should change code in 'rnn_train.py' to:

gb = targets[:,:,:34]
rb = targets[:,:,34:68]

jzi040941 commented 3 years ago

Code has error in 'rnn_train.py' so that loss is nan.

Thanks I've fix in #24

Chen1399 commented 2 years ago

There is a new question for 'loss nan'. The feature of pitch correlation could be 'nan'. The value 'error' could be zero in the file named 'celt_lpc.cpp', which make pitch correlation be nan. ''' r = -SHL32(rr,3)/error; ''' You can add a bias to 'error' which can make 'error' not be zero. ''' r = -SHL32(rr,3)/(error + 0.00001); '''