thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.79k stars 1.12k forks source link

potential bug in the implementation of DDPG #29

Closed lmzhangucsd closed 3 years ago

lmzhangucsd commented 4 years ago

my environment: 0.2.1 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] linux

Hello there, I'm using the DDPG algorithm in this library.

To my understanding, to calculate the gradient of the policy network, it should be: (d(Q) / d(action)) * (d(action) / d(theta)) , I don't feel like this line of code will correctly propogate d(Q) / d(action)..

I'm raising this error because I was playing around with https://github.com/thu-ml/tianshou/blob/master/test/continuous/net.py, and when I added a Softmax layer to the Actor network, I'm getting an exception : RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Trinkle23897 commented 4 years ago

Hi, sorry for the late response. Could you please provide your code snippets for the network (because I add one line of F.softmax in the Actor net, it works well without an exception)

def forward(self, s, **kwargs):
    if not isinstance(s, torch.Tensor):
        s = torch.tensor(s, device=self.device, dtype=torch.float)
    batch = s.shape[0]
    s = s.view(batch, -1)
    logits = self.model(s)
    logits = F.softmax(logits, dim=-1)  # <--- here
    logits = self._max * torch.tanh(logits)
    return logits, None
Trinkle23897 commented 4 years ago

I'll check all the implemented algorithms again this week, especially PPO.

fengredrum commented 4 years ago

Hi, @liamz39 May be there're some misunderstanding here.

  1. DDPG is meant to solve continuous control tasks, there is no reason that use a SoftMax layer for output. You can check here: https://github.com/thu-ml/tianshou/blob/master/discrete/net.py. For discrete control tasks, they do add a SoftMax layer, and it works fine for DQN.

  2. actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean() When use backward() method, PyTorch will calculate the corresponding partial derivatives for us automatically.

F.Y.I.

self(batch, eps=0) will call the forward() method, I don't like this way either, I prefer to use self.actor(batch) instead :)

junfanlin commented 4 years ago

I think there might be some potential rational bugs in the implementation. When I set the number of the training environments to 1, the provided test_ddpg and test_td3 are both failed to play Pendulum ... But SAC works well.

Trinkle23897 commented 4 years ago

@junfanlin I play with the test script, in fact, it can work in only one env:

Change reward_normalization=False first, because in single env the reward bias is larger than multi-env. Then

python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 1
python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1

It takes about 4 epochs in my computer:

± % python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 2                                                                                                                      !10150
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 165.06it/s, len=200.00, loss/actor=74.273026, loss/critic=0.385957, n/ep=1.00, n/st=200.00, rew=-1514.12, v/ep=10.75, v/st=2150.88]                                                        
Epoch #1: test_reward: -1219.784197, best_reward: -1219.784197 in #1
Epoch #2: 2401it [00:14, 164.65it/s, len=200.00, loss/actor=119.941541, loss/critic=0.696872, n/ep=1.00, n/st=200.00, rew=-946.87, v/ep=10.78, v/st=2156.70]                                                        
Epoch #2: test_reward: -857.544738, best_reward: -857.544738 in #2
Epoch #3: 2401it [00:25, 95.51it/s, len=200.00, loss/actor=133.435159, loss/critic=1.195252, n/ep=1.00, n/st=200.00, rew=-254.75, v/ep=10.79, v/st=2158.83]                                                         
Epoch #3: test_reward: -283.552947, best_reward: -283.552947 in #3
Epoch #4:  29%|#########################3                                                             | 700/2400 [00:12<00:30, 55.36it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-248.40, v/ep=10.79, v/st=2157.22]
{'best_reward': -233.88520793467467,
 'duration': '70.94s',
 'test_episode': 1700.0,
 'test_speed': '14988.37 step/s',
 'test_step': 340000,
 'test_time': '22.68s',
 'train_episode': 80.0,
 'train_speed': '331.59 step/s',
 'train_step': 16000,
 'train_time/collector': '7.43s',
 'train_time/model': '40.82s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -123.6083534985761, length: 200.0
± % python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1                                                                                                                       !10151
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 161.66it/s, len=200.00, loss/actor=43.141165, loss/critic1=0.086161, loss/critic2=0.065396, n/ep=1.00, n/st=200.00, rew=-1258.21, v/ep=10.59, v/st=2118.76]                                
Epoch #1: test_reward: -1430.842098, best_reward: -1430.842098 in #1
Epoch #2: 2401it [00:14, 161.55it/s, len=200.00, loss/actor=79.333228, loss/critic1=0.171390, loss/critic2=0.161232, n/ep=1.00, n/st=200.00, rew=-1512.23, v/ep=10.63, v/st=2126.16]                                
Epoch #2: test_reward: -1194.858885, best_reward: -1194.858885 in #2
Epoch #3: 2401it [00:14, 161.87it/s, len=200.00, loss/actor=107.440990, loss/critic1=0.355920, loss/critic2=0.340280, n/ep=1.00, n/st=200.00, rew=-944.28, v/ep=10.67, v/st=2133.35]                                
Epoch #3: test_reward: -848.226526, best_reward: -848.226526 in #3
Epoch #4:  92%|#############################################################################9       | 2200/2400 [00:15<00:01, 145.75it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-127.55, v/ep=10.67, v/st=2134.07]
{'best_reward': -219.068775899375,
 'duration': '63.66s',
 'test_episode': 400.0,
 'test_speed': '15019.68 step/s',
 'test_step': 80000,
 'test_time': '5.33s',
 'train_episode': 48.0,
 'train_speed': '164.57 step/s',
 'train_step': 9600,
 'train_time/collector': '4.51s',
 'train_time/model': '53.83s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -352.5223017492268, length: 200.0
junfanlin commented 4 years ago

Thank you for the extremely quick reply :)

lmzhangucsd commented 4 years ago

Hey @Trinkle23897, thanks so much for your reply! really appreciate it. Yeah, I added that softmax in the model's Sequential.

I understand putting a softmax in this continuous action space might not be a good option, but nevertheless, mathematically it should work since it is a differentiable function.

...
self.model += [nn.Linear(256, np.prod(action_shape))]
...
self.model += [nn.Softmax(dim=1)]
self.model = nn.Sequential(*self.model)

I didn't add F.softmax in the forward function like what you did but directly added it to the self.model

What's the difference between constructing the network in the self.model versus directly adding computation in forward function, why mine doesn't work?

lmzhangucsd commented 4 years ago

actually, by adding F.softmax to my forward still raises the same exception...

here is my Actor network code:

class Actor(nn.Module):
    def __init__(self, layer_num, state_shape, action_shape,
                 device='cpu'):
        super().__init__()
        self.device = device
        self.model = [
            nn.Linear(np.prod(state_shape), 256),
            nn.ReLU()]
        for i in range(layer_num):
            self.model += [nn.Linear(256, 256), nn.ReLU()]
        self.model += [nn.Linear(256, np.prod(action_shape))]
        self.model = nn.Sequential(*self.model)
        # self._max = max_action

    def forward(self, s, **kwargs):
        s = torch.tensor(s, device=self.device, dtype=torch.float)
        batch = s.shape[0]
        s = s.view(batch, -1)
        score = self.model(s)
        score_norm = F.softmax(score, dim=-1)

        # logits = self._max * torch.tanh(logits)
        return score_norm, None
Trinkle23897 commented 4 years ago

Could you please provide the exception log?

lmzhangucsd commented 4 years ago

sure.

Epoch #1:   0%|          | 0/2400 [00:00<?, ?it/s]Warning: Traceback of forward call that caused the error:
  File "***", line 118, in <module>
    test_ddpg()
  File "***", line 103, in test_ddpg
    args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
  File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
    losses = policy.learn(train_collector.sample(batch_size))
  File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 147, in learn
    actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean()
  File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 119, in forward
    logits, h = model(obs, state=state, info=batch.info)
  File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "***", line 28, in forward
    score_norm = F.softmax(score, dim=-1)
  File "***python3.7/site-packages/torch/nn/functional.py", line 1231, in softmax
    ret = input.softmax(dim)
 (print_stack at /opt/conda/conda-bld/pytorch_1579022060824/work/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Epoch #1:   0%|          | 0/2400 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "***", line 118, in <module>
    test_ddpg()
  File "***", line 103, in test_ddpg
    args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
  File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
    losses = policy.learn(train_collector.sample(batch_size))
  File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 149, in learn
    actor_loss.backward()
  File "***python3.7/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "***python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 21]], which is output 0 of SoftmaxBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
lmzhangucsd commented 4 years ago

I changed that line to:

actor_loss = -self.critic(batch.obs, self.actor(batch.obs)[0]).mean()

and it no longer has the exception.

Trinkle23897 commented 4 years ago

You could not use softmax over action_space=1. For example, the logits's shape before softmax is torch.Size([128, 1]), and after the softmax operation it is going to be [1,1,1,1,..., 1] of length 128.

Trinkle23897 commented 4 years ago

By the way, the softmax is only used for Categorical distribution for discrete action tasks. It is not recommended to use softmax over continuous action space. For your reference: https://github.com/thu-ml/tianshou/issues/29#issuecomment-612884546

lmzhangucsd commented 4 years ago

this is what my forward returns,

def forward(self, s, **kwargs):
    ...
    return logits, None

so self.actor(batch.obs)[0] will return the logits. it will be [128, whatever my output dim size]

junfanlin commented 4 years ago

When I used ddpg to train LunarLanderContinuous-v2, the reward seems like not increasing at all. Is there any modification needed in this environment?

BTW, SAC algorithm might need a .sum(-1, keepdims=True) when calculating log_prob. And it seems like not converging in LunarLanderContinuous-v2 either.

Thanks in advance.

Trinkle23897 commented 4 years ago

@junfanlin I'll try later. Thanks for pointing out! You can try the observation & reward normalization first.

dylanqyuan commented 4 years ago

when I run /tianshou/test/discrete/test_dqn.py for testing, the following issue happened:

traceback (most recent call last): File "E:/PycharmProjects/tianshou/tianshou/test/discrete/test_dqn.py", line 114, in test_dqn(get_args()) File "E:/PycharmProjects/tianshou/tianshou/test/discrete/test_dqn.py", line 76, in test_dqn train_collector.collect(n_step=args.batch_size) File "D:\Users\Administrator\anaconda3\envs\tianshou\lib\site-packages\tianshou\data\collector.py", line 296, in collect self.step_speed.add(cur_step / duration) ZeroDivisionError: float division by zero

Process finished with exit code 1

In the first two episodes it is just work fine, in the third eposide, the following issue happened, could you please tell me why?

Trinkle23897 commented 4 years ago

@dylan0828 I fix this issue in #26, you can re-install the newest version of tianshou through GitHub: pip3 install git+https://github.com/thu-ml/tianshou.git@master Further questions are welcomed.