My machine freezes with multi-gpu learning

junyanz / pytorch-CycleGAN-and-pix2pix

Image-to-Image Translation in PyTorch

Other

22.72k stars 6.28k forks source link

My machine freezes with multi-gpu learning #685

Open sangrockEG opened 5 years ago

sangrockEG commented 5 years ago

I think this is similar issue with issue #327, issue #410, issue #483

When I use single gpu, everything is fine. But when I use multi-gpu, after few iterations (around 200~300 iters) it freezes at all. In above issues, system freezes before the iteration is started. But in my case, it freezes after few iterations.

And even verification examples such as torch.cuda.broadcast work very well. I know this kind of problem is hard to solve, but I really need some helps..

fengyu19 commented 5 years ago

I have the same issue as you. When I try to use multi-gpu to train 2 models, everything is fine at the beginning, but after about 10 epochs,, the gpu-util is about 0, the training is really slow. Did you figure it out?

sangrockEG commented 5 years ago

Nope. I failed to fix it, and just run with single gpu.

And I think our issues are quite different.. In my case, literally whole system is frozen and crashed. This is not a problem of speed. But anyway learning on multi-gpu with this code seems not that stable.

junyanz commented 5 years ago

I suspect that visdom is not stable with Multi-GPUS but I haven't tested it. Could you disable visdom by --display_id 0?

sangrockEG commented 5 years ago

OK I'll try it and notice you Thanks a lot!

jiashu-zhu commented 4 years ago

Hi, I met the same issue as you @fengyu19 @sangrockEG , have you figured it out?