uoguelph-mlrg / Theano-MPI

MPI Parallel framework for training deep learning models built in Theano
Other
53 stars 21 forks source link

Weird 16 GPU training time #7

Closed FredericMao closed 7 years ago

FredericMao commented 8 years ago

Hi He,

I am testing 16 GPUs on Mosaic, the timing I got:

29520 5.315070 0.925000 time per 5120 images: 4.89 (train 3.90 comm 0.87 wait 0.12)

29600 5.449820 0.932813 time per 5120 images: 4.88 (train 3.89 comm 0.88 wait 0.11)

29680 5.360006 0.945312 time per 5120 images: 4.91 (train 3.90 comm 0.89 wait 0.12)

Comm time seems correct, but training is the same as 8 GPU.

hma02 commented 8 years ago

@powerreactor It looks wrong. But if you notice the iteration index. It's the time every 80 iterations. There's a problem about the indexing. But the timing is correct if you divide it by 2.

hma02 commented 8 years ago

@powerreactor Did you fix the segfault? How can you run 16GPUs?

FredericMao commented 8 years ago

I still don't know what causes the segfault. Sometimes it works, but sometimes not.

FredericMao commented 8 years ago

Please try changing the compiling folder to somewhere under you /home. (actually Theano's default is ~/.theano)

hma02 commented 8 years ago

@powerreactor Changed. https://github.com/uoguelph-mlrg/Theano-MPI/commit/526639c8e96026e2fd22bf4291b9f5fca7332f48