Closed ac-alpha closed 2 years ago
@ac-alpha Current code doesn't support multi-GPUs. Please try:
CUDA_VISIBLE_DEVICES=0 python run.py train test123
FYI, https://github.com/mimbres/neural-audio-fp/blob/main/model/fp/NTxent_loss_tpu.py is prepared for multi-GPUs and TPUs. However, current version is not using it.
@mimbres I tried with CUDA_VISIBLE_DEVICES=0
prefix and I am still getting NaNs in the training loss.
@ac-alpha The docker image is built with cuda 10.1
. However, 3090 GPU requires cuda >= 11.1
. So I reckon this would cause the problem. You can build your own image based on cuda 11.x and TF (>? 2.4). I can't promise when that will be, but I'll try to provide a cuda 11.x-based image later.
@ac-alpha Wait, I made a cuda 11.2 image a few months ago.
docker pull mimbres/neural-audio-fp:cuda11.2.0-cudnn8
@mimbres Thank you so much. Using the cuda 11.2 image that you provided above solved the problem.
python run.py train test123 --max_epoch=100 -c default
inside the docker.I am seeing all NaNs in the tensorboard logs for training loss. I am running this on a machine with 3 NVIDIA GeForce 3090 GPUs. What might be going wrong?