petuum / adaptdl

Resource-adaptive cluster scheduler for deep learning training.
https://adaptdl.readthedocs.io/
Apache License 2.0
425 stars 76 forks source link

Strange outputs when running dcgan example #125

Open zxmeng98 opened 2 years ago

zxmeng98 commented 2 years ago

When I ran the dcgan.py in examples(autoscale batch size off), I found the outputs very strange and did not tend to converge: image

But when I remove the following two rows:

netD = adl.AdaptiveDataParallel(netD, optimizerD, scheduleD, name="netD")
netG = adl.AdaptiveDataParallel(netG, optimizerG, scheduleG, name="netG")

The results seem better: 1655875134833

Could you please help me solve this? Is this may be caused by the warning related to the zero_grad?