Closed miriaford closed 3 years ago
No, because as you mentioned
zero state initialization was often used in previous works
, here we mimic this workflow.
@Trinkle23897 Could you please point me to the exact workflow you based on this RNN implementation? I'm trying to figure out why doesn't it work (I have tested RNN with SAC and with DQN on 5 environments, it only worked with DQN for Cartpole)
Hello, @Trinkle23897 , I am encountering a problem at line 89-90 in https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/modelfree/a2c.py#L89 as
v_s.append(self.critic(minibatch.obs))
v_s_.append(self.critic(minibatch.obs_next))
I am wondering why the critie does not support 'states' input as actor does? As it is common for the critic sharing the same RNN input network as the actor using, how can I pass the same state that the actor used to the critic? I noticed there is a state batch stored in the 'policy' key of the minibatch, but this should be the output state if I understand correctly, right? Or I am not supposed to pass any state to the critic during training?
Do I missed anything here? Thanks a lot.
Or I am not supposed to pass any state to the critic during training?
Currently, yes
Or I am not supposed to pass any state to the critic during training?
Currently, yes
Thanks. I finally put the input state into the model input dict to solve the problem.
Another thing, I found the torch.nn.utils.clip_grad_norm_
cannot deal with inf gradient which is the case usually happening in RNN. Do you mind to add a torch.nn.utils.clip_grad_value_
before it in all policy implementation like below? Thanks a lot.
if self._grad_value: # clip large gradient
nn.utils.clip_grad_value_(
self._actor_critic.parameters(), clip_value=self._grad_value
)
if self._grad_norm: # clip large gradient
nn.utils.clip_grad_norm_(
self._actor_critic.parameters(), max_norm=self._grad_norm
)
experiment:
>>> w.grad
tensor([[0.4000, 0.4000, 0.4000, 0.4000, 0.4000],
[ inf, inf, inf, inf, inf],
[2.3000, 2.3000, 2.3000, 2.3000, 2.3000]])
>>> torch.nn.utils.clip_grad_norm_([w], 0.5) # this cannot deal with inf
tensor(inf)
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
[nan, nan, nan, nan, nan],
[0., 0., 0., 0., 0.]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # and this cannot deal with nan
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
[nan, nan, nan, nan, nan],
[0., 0., 0., 0., 0.]])
========
>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
[ inf, inf, inf, inf, inf],
[1.8000, 1.8000, 1.8000, 1.8000, 1.8000]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # but this can deal with inf
>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
[0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
[0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])
Sure! Feel free to submit a PR
I see on README that RNN support is on your TODO list. However, the module API seems to support RNN (
forward(obs, state)
method). Could you please provide some examples on how to train RNN policy? Thanks!