Losses coming as NaN on training with default settings

mimbres / neural-audio-fp

https://mimbres.github.io/neural-audio-fp

MIT License

179 stars 25 forks source link

Losses coming as NaN on training with default settings #20

Closed ac-alpha closed 2 years ago

ac-alpha commented 2 years ago

I cloned this repo.
Downloaded the dataset mini from the google drive.
Created the docker container by pulling the image provided.
Ran python run.py train test123 --max_epoch=100 -c default inside the docker.

I am seeing all NaNs in the tensorboard logs for training loss. I am running this on a machine with 3 NVIDIA GeForce 3090 GPUs. What might be going wrong?

mimbres commented 2 years ago

@ac-alpha Current code doesn't support multi-GPUs. Please try: CUDA_VISIBLE_DEVICES=0 python run.py train test123

mimbres commented 2 years ago

FYI, https://github.com/mimbres/neural-audio-fp/blob/main/model/fp/NTxent_loss_tpu.py is prepared for multi-GPUs and TPUs. However, current version is not using it.

ac-alpha commented 2 years ago

@mimbres I tried with CUDA_VISIBLE_DEVICES=0 prefix and I am still getting NaNs in the training loss.

mimbres commented 2 years ago

@ac-alpha The docker image is built with cuda 10.1. However, 3090 GPU requires cuda >= 11.1. So I reckon this would cause the problem. You can build your own image based on cuda 11.x and TF (>? 2.4). I can't promise when that will be, but I'll try to provide a cuda 11.x-based image later.

mimbres commented 2 years ago

@ac-alpha Wait, I made a cuda 11.2 image a few months ago. docker pull mimbres/neural-audio-fp:cuda11.2.0-cudnn8

ac-alpha commented 2 years ago

@mimbres Thank you so much. Using the cuda 11.2 image that you provided above solved the problem.