thu-ml / tianshou

An elegant PyTorch deep reinforcement learning library.
https://tianshou.org
MIT License
7.95k stars 1.13k forks source link

RNN support #19

Closed miriaford closed 3 years ago

miriaford commented 4 years ago

I see on README that RNN support is on your TODO list. However, the module API seems to support RNN ( forward(obs, state) method). Could you please provide some examples on how to train RNN policy? Thanks!

Trinkle23897 commented 4 years ago

Oh yes, in fact it has already support the rnn policy but I will write a small demonstration script after finishing the documentation. Thanks for pushing me.

miriaford commented 4 years ago

Thanks! Do you mind quickly posting a snippet here and explain how to use the RNN policy? I need to run something ASAP, thanks a lot!

fengredrum commented 4 years ago

Actually, my experiments show that a RNN policy is nearly always inferior to a MLP policy in robotic control domains. Besides, RNNs suffer from slow training speed :)

miriaford commented 4 years ago

It varies from case to case, but I agree that RNN might not be the best option. I do need it as a baseline though, so will still appreciate it if someone can explain how to run RNN policy with Tianshou. Thanks!

Trinkle23897 commented 4 years ago

@miriaford I upload a script in test/discrete/test_drqn.py. What I have changed in the codebase:

  1. modify the replay buffer to support frame_stack sampling;
  2. fix collector for clearing a state when the corresponding episode is finished;

How to use:

  1. add an argument "stack_num" in training collector's replay buffer code snippet;
  2. change your actor-network to recurrent style code snippet;
  3. use current git master branch instead of pypi whl: pip3 install git+https://github.com/thu-ml/tianshou.git@master.

That's all. If you find anything wrong, please give me feedback.

miriaford commented 4 years ago

Cool, thanks a lot!

Does it also work out of the box for PPO and SAC? Do you have any benchmark scripts/results for them?

Trinkle23897 commented 4 years ago

@miriaford I haven't tried RNN on PPO and SAC, but you can be the first. Following the changes between test_dqn.py and test_drqn.py and you can construct your first RNN-style PPO and SAC!

miriaford commented 4 years ago

Hi, I might have missed something obvious. Could you please educate me on:

  1. In these lines (https://github.com/thu-ml/tianshou/blob/master/test/discrete/net.py#L90-L91), why do you add .detach()? What happens if you don't?
  2. In PPO policy (lines), why doesn't critic support RNN interface like actor?

Thanks for your help!

Trinkle23897 commented 4 years ago
  1. It's my fault. Previously the step in collector was not under with torch.no_grad(). I'll remove it.
  2. Good catch! When I implemented the GAE, I did not think about the RNN. I'll upgrade the interface.
miriaford commented 4 years ago

This is great! Would love to check out the new interface.

Another quick question: https://github.com/thu-ml/tianshou/blob/master/test/discrete/net.py#L70-L79 If I understand this correctly, the [bsz, len, dim] will only happen when learning the parameters, but for the rest of cases, [bsz, dim] will be passed in because evaluation runs the RNN one forward step at a time? Then what's the practical impact of longer len vs shorter len for training?

Thanks again!

Trinkle23897 commented 4 years ago

Yes, you are right. The longer the length, the smaller the difference of RNN performance between training and testing.

miriaford commented 4 years ago

Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!

Trinkle23897 commented 4 years ago

Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!

Sure! I currently fix issue #38. Maybe today or tomorrow I’ll resolve your issue.

miriaford commented 4 years ago

Thanks for reopening the issue. I'd be happy to test it when you finish, thanks!

miriaford commented 4 years ago

Any updates? Sorry for pinging again. Thanks!

Trinkle23897 commented 4 years ago

I'm coding it.

Trinkle23897 commented 4 years ago
diff --git a/test/continuous/test_ppo.py b/test/continuous/test_ppo.py
index e94a917..c496de2 100644
--- a/test/continuous/test_ppo.py
+++ b/test/continuous/test_ppo.py
@@ -12,7 +12,7 @@ from tianshou.trainer import onpolicy_trainer
 from tianshou.data import Collector, ReplayBuffer

 if __name__ == '__main__':
-    from net import ActorProb, Critic
+    from net import RecurrentActorProb, RecurrentCritic
 else:  # pytest
     from test.continuous.net import ActorProb, Critic

@@ -71,11 +71,11 @@ def test_ppo(args=get_args()):
     train_envs.seed(args.seed)
     test_envs.seed(args.seed)
     # model
-    actor = ActorProb(
+    actor = RecurrentActorProb(
         args.layer_num, args.state_shape, args.action_shape,
         args.max_action, args.device
     ).to(args.device)
-    critic = Critic(
+    critic = RecurrentCritic(
         args.layer_num, args.state_shape, device=args.device
     ).to(args.device)
     optim = torch.optim.Adam(list(
@@ -95,7 +95,7 @@ def test_ppo(args=get_args()):
         gae_lambda=args.gae_lambda)
     # collector
     train_collector = Collector(
-        policy, train_envs, ReplayBuffer(args.buffer_size))
+        policy, train_envs, ReplayBuffer(args.buffer_size, stack_num=4))
     test_collector = Collector(policy, test_envs)
     # log
     log_path = os.path.join(args.logdir, args.task, 'ppo')

The code has been pushed.

Trinkle23897 commented 4 years ago

I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value (this is in https://github.com/thu-ml/tianshou/blob/c2a7caf806ceabcbb18fbd59dcec644a704350fa/test/continuous/net.py#L119-L144). But for PPO, the current version is enough.

miriaford commented 4 years ago

Thanks!! Will take a look and let you know!

Trinkle23897 commented 4 years ago

I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value

Aha, I know how to support it (Q(stacked-s, stacked-a)) without modifying any part of the codebase. Try to add an gym.wrapper which would modify the state as {'original_state': state, 'action': action}. That means in the traditional way we save (s, a, s', r, d) in the replay buffer, but now we save ([s, a], a, [s', a'], r, d) in the buffer. And in your network side, extract the [s, a] pair from the batch.state.

Arthur-Null commented 4 years ago

I'm sorry but I am still confused about how RNN works. Take PPO policy as an example, it seems that the frame_stack option only takes effect in learn(). And actor is not called in learn(). So the actor won't actually get a sequence as input?

Trinkle23897 commented 4 years ago

And actor is not called in learn().

No. The actor is called in learn(). For example, https://github.com/thu-ml/tianshou/blob/52be533d062f0f3565b6c0cd503e4d9199532e9e/tianshou/policy/modelfree/ppo.py#L149 self(b) calls self.forward() and uses the actor network. (Remember in PyTorch network, if we define nn.forward, we can use nn(input) directly)

ChengshuLi commented 4 years ago

@Trinkle23897 Thank you for adding the RNN support.

If I understand correctly, currently hidden states are only used during policy rollout. https://github.com/thu-ml/tianshou/blob/52be533d062f0f3565b6c0cd503e4d9199532e9e/tianshou/data/collector.py#L256

But they are not used during "learning". https://github.com/thu-ml/tianshou/blob/52be533d062f0f3565b6c0cd503e4d9199532e9e/tianshou/policy/modelfree/dqn.py#L145 https://github.com/thu-ml/tianshou/blob/52be533d062f0f3565b6c0cd503e4d9199532e9e/tianshou/policy/modelfree/dqn.py#L76

I am wondering whether we should also store the hidden states into the replay buffer and use them during learning as well (e.g. computing TD error).

Some examples might be helpful here: https://github.com/astooke/rlpyt/blob/master/rlpyt/replays/sequence/n_step.py https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/storage.py

Thank you in advance for your help!

Trinkle23897 commented 4 years ago

@ChengshuLi Thanks! I'll have a look after I pass the midterm exam of CSAPP 3 days later...

Arthur-Null commented 4 years ago

Thank you for your responce. I still have one question, it seems that a sequence with length n would generate n subsequences when stacked (stack_num > n). Does it means that the time complexity for a rnn model to get output for each timestamp in the sequence is O(n^2) rather than O(n) ?

Trinkle23897 commented 4 years ago

Nope. Suppose we have a trajectory which length is 100, and the stack_num is equal to 4.

In the evaluation (or sampling) phase, where the policy uses the previous hidden state to generate its action, at timestep t it computes: pi(o_t, h_t) => (a_t, h_{t+1}). So, in this case, the RNN model's runtime complexity in one sequence is O(n), because each step is O(1).

In the training phase, (in the current codebase) the policy will be trained on bootstrap-sampling examples. For instance, the sampled trajectory may be as: (the batch size is 64)

0: [3, 4, 5, 6]
1: [67, 68, 69, 70]
2: [54, 55, 56, 57]
...
63: [5, 6, 7, 8]

Each of them has length=4. If you want to generate the latest action, you should run pi(o_0, h_0) => (a_0, h_1), pi(o_1, h_1) => (a_1, h_2), pi(o_2, h_2) => (a_2, h_3), pi(o_3, h_3) => (a_3, h_4), where h_0 is the saved hidden state (defaults to zero). Thus at this time, computing an action by an RNN policy will cause O(stack_num). It is not related to n.

Arthur-Null commented 4 years ago

I was actually considering the case when n = stack_num because I am dealing with a envionment with length of an episode fixed. In this case, say we have a sequence of length 5: [1, 2, 3, 4, 5] a generated batch could be:

0: [1, 1, 1, 1, 1]
1: [1, 1, 1, 1, 2]
2: [1, 1, 1, 2, 3]
3: [1, 1, 2, 3, 4]
4: [1, 2, 3, 4, 5]

If I understand correctly, 25 RNN steps is needed to generate all 5 actions for training in current implementation. However, there is much duplicated calculation here as only the last sequence is needed to get all the actions.

Please correct me if I'm wrong, as I am considering including seqlen in the batch and using pytorch's packed_sequence to rewrite this part.

Trinkle23897 commented 4 years ago

Yes, the current code will produce this behavior as mentioned. A simple solution is to add an argument in ReplayBuffer (change the buffer.sample method) which will let it only choose the available index. I'll add this feature today.

Trinkle23897 commented 4 years ago

@Arthur-Null it should be okay now. For your example, if you turn on the sample_avail in ReplayBuffer, it will only get the last sample: 4: [1, 2, 3 ,4 ,5].

Trinkle23897 commented 4 years ago

@ChengshuLi when you are in policy.learn, the hidden state is stored in batch.policy._state, where its shape is [batch_size, stack_num, ...]. In other words, batch.policy._state[:, 0] is h_0. I would like to further modify this behavior, e.g. self(batch) is lack of hidden state, but this should wait for the advanced slicing method in Batch (batch[:, i]).

Arthur-Null commented 4 years ago

I think here is a bug in the implementation of the framestack option. https://github.com/thu-ml/tianshou/blob/47e8e2686cde0cb54bf1bbdcefa2046b0c5e0392/tianshou/data/batch.py#L544 Note that if you apply symmetric_difference sequentially on a list of sets, the elements who exist in odd number of sets would be returned. As a result, if stack_num is set to an odd number, keys_shared may have same value as keys_partial, which would cause an error. It should be changed to keys_partial = set.union(*keys_map) - keys_share to my understanding.

Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?

duburcqa commented 4 years ago

I think here is a bug in the implementation of the framestack option.

This issue hass been fixed on dev branch. Yes, keys_partial = set.union(*keys_map) - keys_share is now used instead.

Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?

I can't help you on this point. @Trinkle23897 ?

Trinkle23897 commented 4 years ago

I think here is a bug in the implementation of the framestack option.

This issue hass been fixed on dev branch. Yes, keys_partial = set.union(*keys_map) - keys_share is now used instead.

Should we merge the essential commit from dev to master at this point?

Trinkle23897 commented 4 years ago

Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?

What's your desired feature? Could you please provide a simple example?

Arthur-Null commented 4 years ago

I am trying to reach O(n) complexity in policy.learn(). In the current implementation, when sample_avail is not set, O(n) can't be reached. When sample_avail is set, policy.learn() can only get part of the reward, which makes it impossible to calculate returns.

Arthur-Null commented 4 years ago

It is possible to use the original replaybuffer without framestack and sample_avail and leave all the dirty work to policy. I wonder is there are better way?

Trinkle23897 commented 4 years ago

Do you mean sample only available index with on-policy method for RNN usage? I haven't figured it out clearly so let's take an example:

(This is a buffer)
Index 0 1 2 3 4 5 6 7 8 9 a b c d e f
Done  0 0 1 0 0 0 1 0 0 0 1 0 1 / / /

If I understand correctly, you want to only extract [0-2], [3-6], [7-a], [b-c] without overlap. What I'm not sure about is the data format that you expect. (and for me "O(n)" is ambiguous) buffer.sample(0) can do something like this since it will extract all of the data, but only once. I think it is not what you want.

Arthur-Null commented 4 years ago

Let me state my question more clearly and thoroughly. To me, the frame_stack solution has two problems:

  1. As mentioned before, some redundant calculation is included. To calculate the action distribution for sequence [0, 1, 2], we have to do RNN calculation on at least three sequences [0, 0, 0], [0, 0, 1], [0, 1, 2]. (This is what I mean by O(n^2). Only running RNN([0, 1, 2]) once is O(n)).
  2. Sequences like [0, 0, 0] would cause deviation as we always take the last output of the RNN as the final output.

By setting sample_avail to True. The first problem is solved to some extend. However sequences like [b, c, c] may still exist so the second problem is still there. Also, in the current implementation, only reward at the end of an episode will be returned by buf.sample() which makes training impossible.

The data format I am expecting is something like pytorch's PackedSequence, which include the sequences and the lengths of the sequences. It seems a bit complecated to implement this in replay buffer. I am now implementing it in process_fn of my policy.

Trinkle23897 commented 4 years ago

So you want a new sample method: it points out the number of episodes you want and will return the sampled episodes, for example, [[7-a], [3-6], [7-a], [b-c]] with episode_size=4, am I right?

Arthur-Null commented 4 years ago

Yes

Trinkle23897 commented 4 years ago

Okay, I see. I think the simplest way is to implement it in the buffer side. Since we can easily maintain the done flag, we can do something like:

# begin_index stores all the start index of one episode (if available), e.g., [0, 3, 7, b]
episode_index = np.random.choice(begin_index, episode_size)  # assume we get [b, 7, 3, 7]
# do a sorting with key=-episode_length, since we want to mimic torch's PackedSequence
# and after that the episode_index is [7, 7, 3, b]
# should maintain a list of episode_length, in the above case: [3, 4, 4, 2]
data, seq_length = [], []
while not self._meta.done[episode_index][0]:  # since the first episode's length is the longest
    # find out the episodes that haven't terminated
    available_episode_index = episode_index[self._meta.done[episode_index] == 0]
    data.append(self[available_episode_index])  # should set stack_num=0
    seq_length.append(len(available_episode_index))
    episode_index += self._meta.done[episode_index] == 0
data = Batch.cat(data)
# for now, (data, seq_length) is something like a PackedSequence.

I think maintaining begin_index and episode_length is not a big problem. It is very similar to my implementation of sample_avail.

I prefer to add a new argument in the initialization of Collector, to indicate whether to use this mode.

duburcqa commented 4 years ago

Should we merge the essential commit from dev to master at this point?

Yes I think it is appropriate.

Trinkle23897 commented 4 years ago

Should we merge the essential commit from dev to master at this point?

Yes I think it is appropriate.

But should we release a new version like 0.2.4.rc1? Since we want to keep master stable.

duburcqa commented 4 years ago

But should we release a new version like 0.2.4.rc1? Since we want to keep master stable.

It's up to you. It's really a personal choice. For Python-only project, I tend to prefer releasing patches like 0.2.4.rc1 in the case of "critical" bug fixes. But there is no clear answer. I would say the minimum is to push them on Master, but not necessarily to create an actual release.

heartInsert commented 3 years ago

Did someone try transfomrer before ? In NLP area , transformer seems perform better than RNN unit .

Trinkle23897 commented 3 years ago

But in RL, the bigger the model, the harder it converges to an optimal policy. :(

heartInsert commented 3 years ago

Ok , I get it.

doylech commented 3 years ago

Is there a reason why in the DQNPolicy.learn method during training you don't pass an initial hidden state to the forward method? Wouldn't this allow it to train from better predictions?

From Recurrent Experience Replay in Distributed Reinforcement Learning by Kapturowski et al. in 2018 (https://openreview.net/pdf?id=r1lyTjAqYX)

...although zero state initialization was often used in previous works (Hausknecht & Stone, 2015; Gruslys et al., 2018), we have found that it leads to misestimated action-values, especially in the early states of replayed sequence

Trinkle23897 commented 3 years ago

No, because as you mentioned zero state initialization was often used in previous works, here we mimic this workflow.

cbellinger27 commented 2 years ago

Is it possible to use the recurrent network in the discrete soft actor critic?