Open sangrockEG opened 5 years ago
I have the same issue as you. When I try to use multi-gpu to train 2 models, everything is fine at the beginning, but after about 10 epochs,, the gpu-util is about 0, the training is really slow. Did you figure it out?
Nope. I failed to fix it, and just run with single gpu.
And I think our issues are quite different.. In my case, literally whole system is frozen and crashed. This is not a problem of speed. But anyway learning on multi-gpu with this code seems not that stable.
I suspect that visdom is not stable with Multi-GPUS but I haven't tested it. Could you disable visdom by --display_id 0
?
OK I'll try it and notice you Thanks a lot!
Hi, I met the same issue as you @fengyu19 @sangrockEG , have you figured it out?
I think this is similar issue with issue #327, issue #410, issue #483
When I use single gpu, everything is fine. But when I use multi-gpu, after few iterations (around 200~300 iters) it freezes at all. In above issues, system freezes before the iteration is started. But in my case, it freezes after few iterations.
And even verification examples such as torch.cuda.broadcast work very well. I know this kind of problem is hard to solve, but I really need some helps..