About ensure_shared_grads

hugemicrobe commented 7 years ago

It is really a great work! I have some questions about copying local gradients. In train.py, what is the purpose of adding condition:

if shared_param.grad is not None

If I understand correctly, local gradients should always be copied to the global network. Is that right? Also, I'm confused about ._grad and .grad in train.py:18. Are there any differences?

xuehy commented 7 years ago

I guessif shared_param.grad is not None then some other thread must be updating the network. Current thread should not update it until others complete. But I have a question. As I understand, if the grad is not None, the code just returns which means that the gradient of the current thread is just discarded. Is this really the case? So only one of the threads can update the network. The others that complete in the same time will just run in vain?

boscotsang commented 7 years ago

@xuehy It seems that the shared_param._grad = param.grad makes the shared_param.grad reference the same content as param. Therefore, once the shared_param._grad is not None it always has the same values as param.grad.

xuehy commented 7 years ago

@boscotsang I am still confusing.

Once the shared_param._grad is not None, it always has the same values as param.grad

But there are many threads owning different param.grad. Assume there are two threads A and B. What if shared_param._grad is assigned with A's param.grad? Then for thread B the shared_param.grad is always not None?

hugemicrobe commented 7 years ago

@xuehy It seems that grad or _grad is not shared among processes with global_network.share_memory(). Only the weights are shared. Therefore, each process has its own shared_param.grad.

xuehy commented 7 years ago

@hugemicrobe The document says, . Does it mean that shared_param.grad is also shared?

SYTMTHU commented 7 years ago

@xuehy I think that the shared_param.grad is shared is exactly why we this function works, otherwise shared_param.grad would always be none. So it seems to me that when a process detected that some other process has copied its local grad to shared_param.grad, it choose to give up its own update, as it directly returns.

What do you think?

ikostrikov commented 7 years ago

If you are not confident with A3C, I've just made my A2C code public: https://github.com/ikostrikov/pytorch-a2c .

xuehy commented 7 years ago

@SYTMTHU Yes I can understand how it works. But I think in this way the processes are wasting a lot of time doing nothing. During a same period of time, is the update times of parameters with A3C actually the same as a non-distributed one? Can I make out in this way that the only difference is that the updates of A3C come from different environments while the updates of a non-distributed algorithm come from only one running environment?

hugemicrobe commented 7 years ago

I wrote a piece of code for testing as follows. print_grad is run on 2 processes, each process add i+1 to the data of the network parameter. Since data is shared, the result would be 2+1+2=5 for both processes. And we can see that the gradient behaves differently: each process has its own gradient initialized to 0, and 0+1=1 0+2=2 are different in the two processes.

If I understand correctly, gradient is allocated separately for each process as mentioned in the following post.

https://github.com/pytorch/examples/issues/138

I think the point is that since grad is still None after we call share_memory(), the gradient allocation inside each process would become separate. One can try to set grad to 0 before calling share_memory(). In this case, the gradient will be shared.

from __future__ import print_function
import os
import torch.multiprocessing as mp
import torch
from torch import nn
from torch.autograd import Variable

os.environ['OMP_NUM_THREADS'] = '1'

def print_grad(shared_model, i):
    for p in shared_model.parameters():
        if p._grad is None:
            p._grad = Variable(torch.FloatTensor([0]))
        p._grad += i+1
        p.data += i+1
        print(p.data)
        print(p.grad)

class TestNet(nn.Module):
    def __init__(self):
        super(TestNet, self).__init__()
        self.x = nn.Parameter(torch.Tensor([2]))

    def forward(self):
        return self.x

model = TestNet()
model.share_memory()

processes = [mp.Process(target=print_grad, args=(model, i)) for i in range(0, 2)]
[p.start() for p in processes]

5 [torch.FloatTensor of size 1]

Variable containing: 1 [torch.FloatTensor of size 1]

5 [torch.FloatTensor of size 1]

Variable containing: 2 [torch.FloatTensor of size 1]

ikostrikov / pytorch-a3c

About ensure_shared_grads #25