Closed os1a closed 4 years ago
Hi again,
Actually I just increased the cpu memory and it worked. However, training is slower than yours. The epoch takes approx 1200 vs 800-900 according to the provided training log.
Probably the reason is the type of GPU you are using.
The speed could due to GPU issues. We are using rtx5000 (slightly faster than titan_xp)
Hi,
Thanks for the great work.
I am trying to train using horovod with 4 GPUs (RTX2080Ti) with a cpu memory of 80 GB. However, after sometime and before it starts training the first epoch, I got the following error:
mpirun noticed that process rank 0 with PID 0 on node dagobert exited on signal 9 (Killed).
According to the horovod github, it seems an out of memory issue, Therefore, I would like to know the system requirements you have to train on 4 gpus. What are the gpu memory, cpu memory, number of cpus, etc. Maybe any advice to help training on multi-gpu?