About DDP aggregation gradients on different GPUs

hwpengTristin commented 1 year ago

In your FGT/FGT/networks/network.py module (see the following 'code mark 1' and 'code mark 2' ), I didn't find the .all_reduce() function to aggregate gradients on different GPUs.

==============code mark 1===============

        dis_loss = (dis_real_loss + dis_fake_loss) / 2
        self.dist_optim.zero_grad()
        dis_loss.backward()
        self.dist_optim.step()

==============code mark 2===============

        loss = m_loss_valid + m_loss_masked + gen_loss
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

Should the code be rewritten like the following form (see the following 'rewritten 1' and 'rewritten 2' ) to aggregate gradients on different GPUs? If not, will each GPU be isolated and calculate the gradient update alone？

==============rewritten 1===============

        dis_loss = (dis_real_loss + dis_fake_loss) / 2
        self.dist_optim.zero_grad()
        dis_loss.backward()
        dis_loss=reduce_value(dis_loss, average=True)
        self.dist_optim.step()

==============rewritten 2===============

        loss = m_loss_valid + m_loss_masked + gen_loss
        self.optimizer.zero_grad()
        loss.backward()
        loss=reduce_value(loss, average=True)
        self.optimizer.step()

==============introduction function===============

from torch.distributed as dist

hitachinsk commented 1 year ago

To the best of my knowledge, it's not necessary to call the all_reduce function explicitly, because the aggregation of gradients can be implemented by pytorch automatically.

hwpengTristin commented 1 year ago

Noted, thank you very much!

hitachinsk / FGT