Why the loss computed during training is Nan？

mimbres / neural-audio-fp

https://mimbres.github.io/neural-audio-fp

MIT License

175 stars 25 forks source link

Why the loss computed during training is Nan？ #26

Closed JCU777 closed 2 years ago

JCU777 commented 2 years ago

I followed the actions in the readme documentation to configure the environment（Create a virtual environment via .yml), and downloaded the Dataset-mini v1.1 to ../ . But the loss calculated when running run.py training is nan. When debugging, I found that when the data passed through the front_conv layer of the FingerPrinter model, the values of the calculated tensor were all 0 or nan. What‘s wrong and why is this happening?

mimbres commented 2 years ago

Sorry for not giving you a good answer. Can you provide environmental information?

conda env export > my_env.yml

Or you may try creating a virtual environment without .yml option. I recommend that you first install tf and others without faiss-gpu and try training. It's the safer option.

Novicei commented 2 years ago

I had the same problem when I ran it on RTX3090 with batchsize=640

Novicei commented 2 years ago

But don't know why, it seems to work suddenly，My configuration is CUDA11.0, CUDNN 8.0, tensorflow2.4, I hope it can help you.

Novicei commented 2 years ago

But don't know why, it seems to work suddenly，My configuration is CUDA11.0, CUDNN 8.0, tensorflow2.4, I hope it can help you.

But the current results show that it is abnormally high, the first epoch reaches 83%, I don't know why.Because I've been busy lately, I can't figure out the reason for the time being.

mimbres commented 2 years ago

@Novicei

But the current results show that it is abnormally high, the first epoch reaches 83%, I don't know why.Because I've been busy lately, I can't figure out the reason for the time being.

Is the 83% for validation or actual test? It's totally normal in validation accuracy for 1s input, because the validation set consists of a few hundred 30s songs database. Also, this is not related to the issue #18.

Novicei commented 2 years ago

Is for validation.I followed the 620_lamb file you posted for training.But I look at the accuracy rate you posted, the first epoch starts at 65%. #15

mimbres commented 2 years ago

@Novicei It is possible that the validation accuracy is low on the first epoch with a larger batch size. However, on the 100th (or more) epoch, bsz=640 will get better validation accuracy than bsz=120. The slow training of bsz=640 suggests that the new scheduler for learning rate and temperature will be useful. This topic has not been discussed further in the paper.

Novicei commented 2 years ago

@mimbres I don't understand what you mean by that. I mean, I used your 640 configuration file for the experiment, but the verification accuracy of the output is 83% for the first epoch 1s, 100% for the 5s and above, and the accuracy for the seventh epoch 1 second is about 93% % This should not match your results in #15, I don't know what's wrong, maybe it's my environment.

mimbres commented 2 years ago

@Novicei Sorry, I misunderstood your question last time. As you mentioned, the mini-test validation accuracy (~94%) of current repo with the 640 configuration is higher than that I reported in #15 (~83%). Let me share a new 640_lamb_ep400 result.

It is also noticeable that the val_loss scale is different from #15.

To explain, we can get different scale of val_loss and val_acc depending on the selection of validation set we use in the mini-search-validation(). The value of max_n_samples and the size of validation set has changed since performing the pre-release experiment mentioned in #15. 'max_n_samples' is currently set to 3,000 for quick validation. You can set up to 25,000 as needed.

Thanks for reporting this!