thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.96k stars 1.13k forks source link

RNN support #19

Closed miriaford closed 3 years ago

miriaford commented 4 years ago

I see on README that RNN support is on your TODO list. However, the module API seems to support RNN ( forward(obs, state) method). Could you please provide some examples on how to train RNN policy? Thanks!

BFAnas commented 2 years ago

No, because as you mentioned zero state initialization was often used in previous works, here we mimic this workflow.

@Trinkle23897 Could you please point me to the exact workflow you based on this RNN implementation? I'm trying to figure out why doesn't it work (I have tested RNN with SAC and with DQN on 5 environments, it only worked with DQN for Cartpole)

xiaoshenxian commented 2 years ago

Hello, @Trinkle23897 , I am encountering a problem at line 89-90 in https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/modelfree/a2c.py#L89 as

v_s.append(self.critic(minibatch.obs))
v_s_.append(self.critic(minibatch.obs_next))

I am wondering why the critie does not support 'states' input as actor does? As it is common for the critic sharing the same RNN input network as the actor using, how can I pass the same state that the actor used to the critic? I noticed there is a state batch stored in the 'policy' key of the minibatch, but this should be the output state if I understand correctly, right? Or I am not supposed to pass any state to the critic during training?

Do I missed anything here? Thanks a lot.

Trinkle23897 commented 2 years ago

Or I am not supposed to pass any state to the critic during training?

Currently, yes

xiaoshenxian commented 2 years ago

Or I am not supposed to pass any state to the critic during training?

Currently, yes

Thanks. I finally put the input state into the model input dict to solve the problem.

Another thing, I found the torch.nn.utils.clip_grad_norm_ cannot deal with inf gradient which is the case usually happening in RNN. Do you mind to add a torch.nn.utils.clip_grad_value_ before it in all policy implementation like below? Thanks a lot.

if self._grad_value:  # clip large gradient
    nn.utils.clip_grad_value_(
        self._actor_critic.parameters(), clip_value=self._grad_value
    )
if self._grad_norm:  # clip large gradient
    nn.utils.clip_grad_norm_(
        self._actor_critic.parameters(), max_norm=self._grad_norm
    )

experiment:

>>> w.grad
tensor([[0.4000, 0.4000, 0.4000, 0.4000, 0.4000],
        [   inf,    inf,    inf,    inf,    inf],
        [2.3000, 2.3000, 2.3000, 2.3000, 2.3000]])
>>> torch.nn.utils.clip_grad_norm_([w], 0.5) # this cannot deal with inf
tensor(inf)
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # and this cannot deal with nan
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])

========

>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [   inf,    inf,    inf,    inf,    inf],
        [1.8000, 1.8000, 1.8000, 1.8000, 1.8000]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # but this can deal with inf
>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])
Trinkle23897 commented 2 years ago

Sure! Feel free to submit a PR