Memory issue for multi-gpu training

os1a commented 4 years ago

Hi,

Thanks for the great work.

I am trying to train using horovod with 4 GPUs (RTX2080Ti) with a cpu memory of 80 GB. However, after sometime and before it starts training the first epoch, I got the following error:

mpirun noticed that process rank 0 with PID 0 on node dagobert exited on signal 9 (Killed).

According to the horovod github, it seems an out of memory issue, Therefore, I would like to know the system requirements you have to train on 4 gpus. What are the gpu memory, cpu memory, number of cpus, etc. Maybe any advice to help training on multi-gpu?

os1a commented 4 years ago

Hi again,

Actually I just increased the cpu memory and it worked. However, training is slower than yours. The epoch takes approx 1200 vs 800-900 according to the provided training log.

Probably the reason is the type of GPU you are using.

chenyuntc commented 4 years ago

The speed could due to GPU issues. We are using rtx5000 (slightly faster than titan_xp)

uber-research / LaneGCN

Memory issue for multi-gpu training #2