Closed lmzhangucsd closed 3 years ago
Hi, sorry for the late response. Could you please provide your code snippets for the network (because I add one line of F.softmax
in the Actor net, it works well without an exception)
def forward(self, s, **kwargs):
if not isinstance(s, torch.Tensor):
s = torch.tensor(s, device=self.device, dtype=torch.float)
batch = s.shape[0]
s = s.view(batch, -1)
logits = self.model(s)
logits = F.softmax(logits, dim=-1) # <--- here
logits = self._max * torch.tanh(logits)
return logits, None
I'll check all the implemented algorithms again this week, especially PPO.
Hi, @liamz39 May be there're some misunderstanding here.
DDPG is meant to solve continuous control tasks, there is no reason that use a SoftMax layer for output. You can check here: https://github.com/thu-ml/tianshou/blob/master/discrete/net.py. For discrete control tasks, they do add a SoftMax layer, and it works fine for DQN.
actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean() When use backward() method, PyTorch will calculate the corresponding partial derivatives for us automatically.
F.Y.I.
self(batch, eps=0) will call the forward() method, I don't like this way either, I prefer to use self.actor(batch) instead :)
I think there might be some potential rational bugs in the implementation. When I set the number of the training environments to 1, the provided test_ddpg and test_td3 are both failed to play Pendulum ... But SAC works well.
@junfanlin I play with the test script, in fact, it can work in only one env:
Change reward_normalization=False
first, because in single env the reward bias is larger than multi-env. Then
python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 1
python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1
It takes about 4 epochs in my computer:
± % python3 test/continuous/test_ddpg.py --seed 0 --training-num 1 --collect-per-step 2 !10150
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 165.06it/s, len=200.00, loss/actor=74.273026, loss/critic=0.385957, n/ep=1.00, n/st=200.00, rew=-1514.12, v/ep=10.75, v/st=2150.88]
Epoch #1: test_reward: -1219.784197, best_reward: -1219.784197 in #1
Epoch #2: 2401it [00:14, 164.65it/s, len=200.00, loss/actor=119.941541, loss/critic=0.696872, n/ep=1.00, n/st=200.00, rew=-946.87, v/ep=10.78, v/st=2156.70]
Epoch #2: test_reward: -857.544738, best_reward: -857.544738 in #2
Epoch #3: 2401it [00:25, 95.51it/s, len=200.00, loss/actor=133.435159, loss/critic=1.195252, n/ep=1.00, n/st=200.00, rew=-254.75, v/ep=10.79, v/st=2158.83]
Epoch #3: test_reward: -283.552947, best_reward: -283.552947 in #3
Epoch #4: 29%|#########################3 | 700/2400 [00:12<00:30, 55.36it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-248.40, v/ep=10.79, v/st=2157.22]
{'best_reward': -233.88520793467467,
'duration': '70.94s',
'test_episode': 1700.0,
'test_speed': '14988.37 step/s',
'test_step': 340000,
'test_time': '22.68s',
'train_episode': 80.0,
'train_speed': '331.59 step/s',
'train_step': 16000,
'train_time/collector': '7.43s',
'train_time/model': '40.82s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -123.6083534985761, length: 200.0
± % python3 test/continuous/test_td3.py --seed 0 --training-num 1 --collect-per-step 1 !10151
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Epoch #1: 2401it [00:14, 161.66it/s, len=200.00, loss/actor=43.141165, loss/critic1=0.086161, loss/critic2=0.065396, n/ep=1.00, n/st=200.00, rew=-1258.21, v/ep=10.59, v/st=2118.76]
Epoch #1: test_reward: -1430.842098, best_reward: -1430.842098 in #1
Epoch #2: 2401it [00:14, 161.55it/s, len=200.00, loss/actor=79.333228, loss/critic1=0.171390, loss/critic2=0.161232, n/ep=1.00, n/st=200.00, rew=-1512.23, v/ep=10.63, v/st=2126.16]
Epoch #2: test_reward: -1194.858885, best_reward: -1194.858885 in #2
Epoch #3: 2401it [00:14, 161.87it/s, len=200.00, loss/actor=107.440990, loss/critic1=0.355920, loss/critic2=0.340280, n/ep=1.00, n/st=200.00, rew=-944.28, v/ep=10.67, v/st=2133.35]
Epoch #3: test_reward: -848.226526, best_reward: -848.226526 in #3
Epoch #4: 92%|#############################################################################9 | 2200/2400 [00:15<00:01, 145.75it/s, len=200.00, n/ep=1.00, n/st=200.00, rew=-127.55, v/ep=10.67, v/st=2134.07]
{'best_reward': -219.068775899375,
'duration': '63.66s',
'test_episode': 400.0,
'test_speed': '15019.68 step/s',
'test_step': 80000,
'test_time': '5.33s',
'train_episode': 48.0,
'train_speed': '164.57 step/s',
'train_step': 9600,
'train_time/collector': '4.51s',
'train_time/model': '53.83s'}
/home/trinkle/.local/lib/python3.6/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Final reward: -352.5223017492268, length: 200.0
Thank you for the extremely quick reply :)
Hey @Trinkle23897, thanks so much for your reply! really appreciate it. Yeah, I added that softmax in the model's Sequential.
I understand putting a softmax in this continuous action space might not be a good option, but nevertheless, mathematically it should work since it is a differentiable function.
...
self.model += [nn.Linear(256, np.prod(action_shape))]
...
self.model += [nn.Softmax(dim=1)]
self.model = nn.Sequential(*self.model)
I didn't add F.softmax
in the forward
function like what you did but directly added it to the self.model
What's the difference between constructing the network in the self.model
versus directly adding computation in forward
function, why mine doesn't work?
actually, by adding F.softmax
to my forward still raises the same exception...
here is my Actor network code:
class Actor(nn.Module):
def __init__(self, layer_num, state_shape, action_shape,
device='cpu'):
super().__init__()
self.device = device
self.model = [
nn.Linear(np.prod(state_shape), 256),
nn.ReLU()]
for i in range(layer_num):
self.model += [nn.Linear(256, 256), nn.ReLU()]
self.model += [nn.Linear(256, np.prod(action_shape))]
self.model = nn.Sequential(*self.model)
# self._max = max_action
def forward(self, s, **kwargs):
s = torch.tensor(s, device=self.device, dtype=torch.float)
batch = s.shape[0]
s = s.view(batch, -1)
score = self.model(s)
score_norm = F.softmax(score, dim=-1)
# logits = self._max * torch.tanh(logits)
return score_norm, None
Could you please provide the exception log?
sure.
Epoch #1: 0%| | 0/2400 [00:00<?, ?it/s]Warning: Traceback of forward call that caused the error:
File "***", line 118, in <module>
test_ddpg()
File "***", line 103, in test_ddpg
args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
losses = policy.learn(train_collector.sample(batch_size))
File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 147, in learn
actor_loss = -self.critic(batch.obs, self(batch, eps=0).act).mean()
File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 119, in forward
logits, h = model(obs, state=state, info=batch.info)
File "***python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "***", line 28, in forward
score_norm = F.softmax(score, dim=-1)
File "***python3.7/site-packages/torch/nn/functional.py", line 1231, in softmax
ret = input.softmax(dim)
(print_stack at /opt/conda/conda-bld/pytorch_1579022060824/work/torch/csrc/autograd/python_anomaly_mode.cpp:57)
Epoch #1: 0%| | 0/2400 [00:02<?, ?it/s]
Traceback (most recent call last):
File "***", line 118, in <module>
test_ddpg()
File "***", line 103, in test_ddpg
args.batch_size, stop_fn=stop_fn, save_fn=save_fn, writer=writer)
File "***python3.7/site-packages/tianshou/trainer/offpolicy.py", line 87, in offpolicy_trainer
losses = policy.learn(train_collector.sample(batch_size))
File "***python3.7/site-packages/tianshou/policy/modelfree/ddpg.py", line 149, in learn
actor_loss.backward()
File "***python3.7/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "***python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 21]], which is output 0 of SoftmaxBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
I changed that line to:
actor_loss = -self.critic(batch.obs, self.actor(batch.obs)[0]).mean()
and it no longer has the exception.
You could not use softmax over action_space=1. For example, the logits's shape before softmax is torch.Size([128, 1]), and after the softmax operation it is going to be [1,1,1,1,..., 1] of length 128.
By the way, the softmax is only used for Categorical distribution for discrete action tasks. It is not recommended to use softmax over continuous action space. For your reference: https://github.com/thu-ml/tianshou/issues/29#issuecomment-612884546
this is what my forward
returns,
def forward(self, s, **kwargs):
...
return logits, None
so self.actor(batch.obs)[0]
will return the logits. it will be [128, whatever my output dim size]
When I used ddpg to train LunarLanderContinuous-v2, the reward seems like not increasing at all. Is there any modification needed in this environment?
BTW, SAC algorithm might need a .sum(-1, keepdims=True) when calculating log_prob. And it seems like not converging in LunarLanderContinuous-v2 either.
Thanks in advance.
@junfanlin I'll try later. Thanks for pointing out! You can try the observation & reward normalization first.
when I run /tianshou/test/discrete/test_dqn.py for testing, the following issue happened:
traceback (most recent call last):
File "E:/PycharmProjects/tianshou/tianshou/test/discrete/test_dqn.py", line 114, in
Process finished with exit code 1
In the first two episodes it is just work fine, in the third eposide, the following issue happened, could you please tell me why?
@dylan0828 I fix this issue in #26, you can re-install the newest version of tianshou through GitHub: pip3 install git+https://github.com/thu-ml/tianshou.git@master
Further questions are welcomed.
my environment: 0.2.1 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] linux
Hello there, I'm using the DDPG algorithm in this library.
To my understanding, to calculate the gradient of the policy network, it should be: (d(Q) / d(action)) * (d(action) / d(theta)) , I don't feel like this line of code will correctly propogate d(Q) / d(action)..
I'm raising this error because I was playing around with https://github.com/thu-ml/tianshou/blob/master/test/continuous/net.py, and when I added a Softmax layer to the Actor network, I'm getting an exception :
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation