Closed miriaford closed 3 years ago
Oh yes, in fact it has already support the rnn policy but I will write a small demonstration script after finishing the documentation. Thanks for pushing me.
Thanks! Do you mind quickly posting a snippet here and explain how to use the RNN policy? I need to run something ASAP, thanks a lot!
Actually, my experiments show that a RNN policy is nearly always inferior to a MLP policy in robotic control domains. Besides, RNNs suffer from slow training speed :)
It varies from case to case, but I agree that RNN might not be the best option. I do need it as a baseline though, so will still appreciate it if someone can explain how to run RNN policy with Tianshou. Thanks!
@miriaford I upload a script in test/discrete/test_drqn.py. What I have changed in the codebase:
How to use:
pip3 install git+https://github.com/thu-ml/tianshou.git@master
.That's all. If you find anything wrong, please give me feedback.
Cool, thanks a lot!
Does it also work out of the box for PPO and SAC? Do you have any benchmark scripts/results for them?
@miriaford I haven't tried RNN on PPO and SAC, but you can be the first. Following the changes between test_dqn.py and test_drqn.py and you can construct your first RNN-style PPO and SAC!
Hi, I might have missed something obvious. Could you please educate me on:
.detach()
? What happens if you don't?critic
support RNN interface like actor
? Thanks for your help!
with torch.no_grad()
. I'll remove it.This is great! Would love to check out the new interface.
Another quick question:
https://github.com/thu-ml/tianshou/blob/master/test/discrete/net.py#L70-L79
If I understand this correctly, the [bsz, len, dim]
will only happen when learning the parameters, but for the rest of cases, [bsz, dim]
will be passed in because evaluation runs the RNN one forward step at a time? Then what's the practical impact of longer len
vs shorter len
for training?
Thanks again!
Yes, you are right. The longer the length, the smaller the difference of RNN performance between training and testing.
Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!
Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!
Sure! I currently fix issue #38. Maybe today or tomorrow I’ll resolve your issue.
Thanks for reopening the issue. I'd be happy to test it when you finish, thanks!
Any updates? Sorry for pinging again. Thanks!
I'm coding it.
diff --git a/test/continuous/test_ppo.py b/test/continuous/test_ppo.py
index e94a917..c496de2 100644
--- a/test/continuous/test_ppo.py
+++ b/test/continuous/test_ppo.py
@@ -12,7 +12,7 @@ from tianshou.trainer import onpolicy_trainer
from tianshou.data import Collector, ReplayBuffer
if __name__ == '__main__':
- from net import ActorProb, Critic
+ from net import RecurrentActorProb, RecurrentCritic
else: # pytest
from test.continuous.net import ActorProb, Critic
@@ -71,11 +71,11 @@ def test_ppo(args=get_args()):
train_envs.seed(args.seed)
test_envs.seed(args.seed)
# model
- actor = ActorProb(
+ actor = RecurrentActorProb(
args.layer_num, args.state_shape, args.action_shape,
args.max_action, args.device
).to(args.device)
- critic = Critic(
+ critic = RecurrentCritic(
args.layer_num, args.state_shape, device=args.device
).to(args.device)
optim = torch.optim.Adam(list(
@@ -95,7 +95,7 @@ def test_ppo(args=get_args()):
gae_lambda=args.gae_lambda)
# collector
train_collector = Collector(
- policy, train_envs, ReplayBuffer(args.buffer_size))
+ policy, train_envs, ReplayBuffer(args.buffer_size, stack_num=4))
test_collector = Collector(policy, test_envs)
# log
log_path = os.path.join(args.logdir, args.task, 'ppo')
The code has been pushed.
I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value (this is in https://github.com/thu-ml/tianshou/blob/c2a7caf806ceabcbb18fbd59dcec644a704350fa/test/continuous/net.py#L119-L144). But for PPO, the current version is enough.
Thanks!! Will take a look and let you know!
I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value
Aha, I know how to support it (Q(stacked-s, stacked-a)) without modifying any part of the codebase. Try to add an gym.wrapper which would modify the state as {'original_state': state, 'action': action}. That means in the traditional way we save (s, a, s', r, d) in the replay buffer, but now we save ([s, a], a, [s', a'], r, d) in the buffer. And in your network side, extract the [s, a] pair from the batch.state.
I'm sorry but I am still confused about how RNN works. Take PPO policy as an example, it seems that the frame_stack option only takes effect in learn(). And actor is not called in learn(). So the actor won't actually get a sequence as input?
And actor is not called in learn().
No. The actor is called in learn()
. For example,
https://github.com/thu-ml/tianshou/blob/52be533d062f0f3565b6c0cd503e4d9199532e9e/tianshou/policy/modelfree/ppo.py#L149
self(b)
calls self.forward()
and uses the actor network. (Remember in PyTorch network, if we define nn.forward, we can use nn(input) directly)
@Trinkle23897 Thank you for adding the RNN support.
If I understand correctly, currently hidden states are only used during policy rollout. https://github.com/thu-ml/tianshou/blob/52be533d062f0f3565b6c0cd503e4d9199532e9e/tianshou/data/collector.py#L256
But they are not used during "learning". https://github.com/thu-ml/tianshou/blob/52be533d062f0f3565b6c0cd503e4d9199532e9e/tianshou/policy/modelfree/dqn.py#L145 https://github.com/thu-ml/tianshou/blob/52be533d062f0f3565b6c0cd503e4d9199532e9e/tianshou/policy/modelfree/dqn.py#L76
I am wondering whether we should also store the hidden states into the replay buffer and use them during learning as well (e.g. computing TD error).
Some examples might be helpful here: https://github.com/astooke/rlpyt/blob/master/rlpyt/replays/sequence/n_step.py https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/storage.py
Thank you in advance for your help!
@ChengshuLi Thanks! I'll have a look after I pass the midterm exam of CSAPP 3 days later...
Thank you for your responce. I still have one question, it seems that a sequence with length n would generate n subsequences when stacked (stack_num > n). Does it means that the time complexity for a rnn model to get output for each timestamp in the sequence is O(n^2) rather than O(n) ?
Nope. Suppose we have a trajectory which length is 100, and the stack_num is equal to 4.
In the evaluation (or sampling) phase, where the policy uses the previous hidden state to generate its action, at timestep t
it computes: pi(o_t, h_t) => (a_t, h_{t+1})
. So, in this case, the RNN model's runtime complexity in one sequence is O(n), because each step is O(1).
In the training phase, (in the current codebase) the policy will be trained on bootstrap-sampling examples. For instance, the sampled trajectory may be as: (the batch size is 64)
0: [3, 4, 5, 6]
1: [67, 68, 69, 70]
2: [54, 55, 56, 57]
...
63: [5, 6, 7, 8]
Each of them has length=4
. If you want to generate the latest action, you should run pi(o_0, h_0) => (a_0, h_1)
, pi(o_1, h_1) => (a_1, h_2)
, pi(o_2, h_2) => (a_2, h_3)
, pi(o_3, h_3) => (a_3, h_4)
, where h_0
is the saved hidden state (defaults to zero).
Thus at this time, computing an action by an RNN policy will cause O(stack_num). It is not related to n
.
I was actually considering the case when n = stack_num because I am dealing with a envionment with length of an episode fixed. In this case, say we have a sequence of length 5: [1, 2, 3, 4, 5] a generated batch could be:
0: [1, 1, 1, 1, 1]
1: [1, 1, 1, 1, 2]
2: [1, 1, 1, 2, 3]
3: [1, 1, 2, 3, 4]
4: [1, 2, 3, 4, 5]
If I understand correctly, 25 RNN steps is needed to generate all 5 actions for training in current implementation. However, there is much duplicated calculation here as only the last sequence is needed to get all the actions.
Please correct me if I'm wrong, as I am considering including seqlen in the batch and using pytorch's packed_sequence to rewrite this part.
Yes, the current code will produce this behavior as mentioned.
A simple solution is to add an argument in ReplayBuffer (change the buffer.sample
method) which will let it only choose the available index. I'll add this feature today.
@Arthur-Null it should be okay now. For your example, if you turn on the sample_avail
in ReplayBuffer, it will only get the last sample: 4: [1, 2, 3 ,4 ,5]
.
@ChengshuLi when you are in policy.learn
, the hidden state is stored in batch.policy._state
, where its shape is [batch_size, stack_num, ...]
. In other words, batch.policy._state[:, 0]
is h_0
.
I would like to further modify this behavior, e.g. self(batch)
is lack of hidden state, but this should wait for the advanced slicing method in Batch (batch[:, i]).
I think here is a bug in the implementation of the framestack option. https://github.com/thu-ml/tianshou/blob/47e8e2686cde0cb54bf1bbdcefa2046b0c5e0392/tianshou/data/batch.py#L544 Note that if you apply symmetric_difference sequentially on a list of sets, the elements who exist in odd number of sets would be returned. As a result, if stack_num is set to an odd number, keys_shared may have same value as keys_partial, which would cause an error. It should be changed to keys_partial = set.union(*keys_map) - keys_share to my understanding.
Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?
I think here is a bug in the implementation of the framestack option.
This issue hass been fixed on dev
branch. Yes, keys_partial = set.union(*keys_map) - keys_share
is now used instead.
Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?
I can't help you on this point. @Trinkle23897 ?
I think here is a bug in the implementation of the framestack option.
This issue hass been fixed on
dev
branch. Yes,keys_partial = set.union(*keys_map) - keys_share
is now used instead.
Should we merge the essential commit from dev to master at this point?
Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?
What's your desired feature? Could you please provide a simple example?
I am trying to reach O(n) complexity in policy.learn(). In the current implementation, when sample_avail is not set, O(n) can't be reached. When sample_avail is set, policy.learn() can only get part of the reward, which makes it impossible to calculate returns.
It is possible to use the original replaybuffer without framestack and sample_avail and leave all the dirty work to policy. I wonder is there are better way?
Do you mean sample only available index with on-policy method for RNN usage? I haven't figured it out clearly so let's take an example:
(This is a buffer)
Index 0 1 2 3 4 5 6 7 8 9 a b c d e f
Done 0 0 1 0 0 0 1 0 0 0 1 0 1 / / /
If I understand correctly, you want to only extract [0-2], [3-6], [7-a], [b-c] without overlap. What I'm not sure about is the data format that you expect. (and for me "O(n)" is ambiguous)
buffer.sample(0)
can do something like this since it will extract all of the data, but only once. I think it is not what you want.
Let me state my question more clearly and thoroughly. To me, the frame_stack solution has two problems:
By setting sample_avail to True. The first problem is solved to some extend. However sequences like [b, c, c] may still exist so the second problem is still there. Also, in the current implementation, only reward at the end of an episode will be returned by buf.sample() which makes training impossible.
The data format I am expecting is something like pytorch's PackedSequence, which include the sequences and the lengths of the sequences. It seems a bit complecated to implement this in replay buffer. I am now implementing it in process_fn of my policy.
So you want a new sample method: it points out the number of episodes you want and will return the sampled episodes, for example, [[7-a], [3-6], [7-a], [b-c]] with episode_size=4, am I right?
Yes
Okay, I see. I think the simplest way is to implement it in the buffer side. Since we can easily maintain the done flag, we can do something like:
# begin_index stores all the start index of one episode (if available), e.g., [0, 3, 7, b]
episode_index = np.random.choice(begin_index, episode_size) # assume we get [b, 7, 3, 7]
# do a sorting with key=-episode_length, since we want to mimic torch's PackedSequence
# and after that the episode_index is [7, 7, 3, b]
# should maintain a list of episode_length, in the above case: [3, 4, 4, 2]
data, seq_length = [], []
while not self._meta.done[episode_index][0]: # since the first episode's length is the longest
# find out the episodes that haven't terminated
available_episode_index = episode_index[self._meta.done[episode_index] == 0]
data.append(self[available_episode_index]) # should set stack_num=0
seq_length.append(len(available_episode_index))
episode_index += self._meta.done[episode_index] == 0
data = Batch.cat(data)
# for now, (data, seq_length) is something like a PackedSequence.
I think maintaining begin_index
and episode_length
is not a big problem. It is very similar to my implementation of sample_avail
.
I prefer to add a new argument in the initialization of Collector, to indicate whether to use this mode.
Should we merge the essential commit from dev to master at this point?
Yes I think it is appropriate.
Should we merge the essential commit from dev to master at this point?
Yes I think it is appropriate.
But should we release a new version like 0.2.4.rc1
? Since we want to keep master stable.
But should we release a new version like 0.2.4.rc1? Since we want to keep master stable.
It's up to you. It's really a personal choice. For Python-only project, I tend to prefer releasing patches like 0.2.4.rc1
in the case of "critical" bug fixes. But there is no clear answer. I would say the minimum is to push them on Master
, but not necessarily to create an actual release.
Did someone try transfomrer before ? In NLP area , transformer seems perform better than RNN unit .
But in RL, the bigger the model, the harder it converges to an optimal policy. :(
Ok , I get it.
Is there a reason why in the DQNPolicy.learn
method during training you don't pass an initial hidden state to the forward method? Wouldn't this allow it to train from better predictions?
From Recurrent Experience Replay in Distributed Reinforcement Learning by Kapturowski et al. in 2018 (https://openreview.net/pdf?id=r1lyTjAqYX)
...although zero state initialization was often used in previous works (Hausknecht & Stone, 2015; Gruslys et al., 2018), we have found that it leads to misestimated action-values, especially in the early states of replayed sequence
No, because as you mentioned zero state initialization was often used in previous works
, here we mimic this workflow.
Is it possible to use the recurrent network in the discrete soft actor critic?
I see on README that RNN support is on your TODO list. However, the module API seems to support RNN (
forward(obs, state)
method). Could you please provide some examples on how to train RNN policy? Thanks!