RNN support - Githubissues

miriaford commented 4 years ago

[ ] I have marked all applicable categories:
- [ ] exception-raising bug
- [ ] RL algorithm bug
- [ ] documentation request (i.e. "X is missing from the documentation.")
- [x] new feature request
[x] I have visited the source website, and in particular read the known issues
[x] I have searched through the issue tracker for duplicates
[ ] I have mentioned version numbers, operating system and environment, where applicable:
```
import tianshou, sys
print(tianshou.__version__, sys.version, sys.platform)
```

I see on README that RNN support is on your TODO list. However, the module API seems to support RNN ( forward(obs, state) method). Could you please provide some examples on how to train RNN policy? Thanks!

BFAnas commented 2 years ago

No, because as you mentioned zero state initialization was often used in previous works, here we mimic this workflow.

@Trinkle23897 Could you please point me to the exact workflow you based on this RNN implementation? I'm trying to figure out why doesn't it work (I have tested RNN with SAC and with DQN on 5 environments, it only worked with DQN for Cartpole)

xiaoshenxian commented 2 years ago

Hello, @Trinkle23897 , I am encountering a problem at line 89-90 in https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/modelfree/a2c.py#L89 as

v_s.append(self.critic(minibatch.obs))
v_s_.append(self.critic(minibatch.obs_next))

I am wondering why the critie does not support 'states' input as actor does? As it is common for the critic sharing the same RNN input network as the actor using, how can I pass the same state that the actor used to the critic? I noticed there is a state batch stored in the 'policy' key of the minibatch, but this should be the output state if I understand correctly, right? Or I am not supposed to pass any state to the critic during training?

Do I missed anything here? Thanks a lot.

Trinkle23897 commented 2 years ago

Or I am not supposed to pass any state to the critic during training?

Currently, yes

xiaoshenxian commented 2 years ago

Or I am not supposed to pass any state to the critic during training?

Currently, yes

Thanks. I finally put the input state into the model input dict to solve the problem.

Another thing, I found the torch.nn.utils.clip_grad_norm_ cannot deal with inf gradient which is the case usually happening in RNN. Do you mind to add a torch.nn.utils.clip_grad_value_ before it in all policy implementation like below? Thanks a lot.

if self._grad_value:  # clip large gradient
    nn.utils.clip_grad_value_(
        self._actor_critic.parameters(), clip_value=self._grad_value
    )
if self._grad_norm:  # clip large gradient
    nn.utils.clip_grad_norm_(
        self._actor_critic.parameters(), max_norm=self._grad_norm
    )

experiment:

>>> w.grad
tensor([[0.4000, 0.4000, 0.4000, 0.4000, 0.4000],
        [   inf,    inf,    inf,    inf,    inf],
        [2.3000, 2.3000, 2.3000, 2.3000, 2.3000]])
>>> torch.nn.utils.clip_grad_norm_([w], 0.5) # this cannot deal with inf
tensor(inf)
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # and this cannot deal with nan
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])

========

>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [   inf,    inf,    inf,    inf,    inf],
        [1.8000, 1.8000, 1.8000, 1.8000, 1.8000]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # but this can deal with inf
>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])

Trinkle23897 commented 2 years ago

Sure! Feel free to submit a PR

thu-ml / tianshou

RNN support #19