Closed ericl closed 3 years ago
Currently working on this (for some reason I can't assign the issue to myself?)
Yeah you need to be part of the org, but it's probably ok - if you open a PR people will know
For todo (1), "Wire up rnn in/out to the models of the DQN policy graph", isn't this just specifying "use_lstm": True
in the model config?
Currently, the "use_lstm": True
option isn't supported by DQN, so a bit of work needed to be done to allow for using that.
any progress here?
Ah not yet; no one is currently working on this. Open to contributions!
Looking at the results in the paper, there is a surprising performance gap between full recurrent R2D2 and a feed-forward version, considering the fact that most Atari envs are essentially MDPs. They claim in the paper that on these envs the recurrent agent "learns better representations".
Can it be partially because LSTM agent just has a lot more parameters? They never mention whether they keep the number of weights the same, or just remove the LSTM layer. Besides, they say that the LSTM agent receives previous action and reward as input to the RNN layer, but they never mention if the feed-forward agent has the same treatment.
Finally, citing the paper: [RNN] improves performance even on domains that are fully observable and do not obviously require memory (cf. BREAKOUT results in the feed-forward ablation). But if you look at the full results on page 18, their recurrent agent is actually worse than previous SOTA feed-forward on Breakout!
I wonder if their claims that LSTM is the main performance factor are actually correct.
Is there a good reason why R2D2 should be a Q-learning algorithm? I am approaching this from the viewpoint of someone wanting the most sample-efficiency out of my algorithms. I understand that experience replay does not tend to go well with actor-critic algorithms, for the reason that experience gathered from policy-based algorithms tend to get stale quickly.
However, there exist fixes to this such as soft actor-critic or ACER, though I am not sure how competitive the latter at present. It was already barely competitive with prioritised DDQN when it first came out. With regards to the former, I still don't understand what the differences are between energy-based policies, PGQL, and, for instance, normalised actor-critic, which seem to repackage the same idea in various forms. I have yet to understand their relative advantages and shortcomings.
It is also quite odd that the average experience utilisation of R2D2 is 80%, and ape-X's ratio is something like 130%. I wish there was more information about the distribution of rollouts consumed, for instance, if by prioritisation certain rollouts are consumed tens or hundreds of times, which would mean that a majority of rollouts are basically tossed out and never seen by the optimisation loop, but also that prioritisation is (perhaps) doing its job.
Is this issue still open ? If open Can I work on this ?
Hey @chandramoulirajagopalan ! Sorry for the late response. Writing a short design doc right now. Will get this work done this quarter.
@sven1977 Will you work on this? Or can I also help to implement R2D2 once you draft the design documentation?
Hi!
I'm trying to do my project on top of your R2D2 implementation. I found, that replay buffer still making importance weighting, but aren't we supposed to use unprioritized replay buffer version? Btw, what is the current status of prioritized version of R2D2? Code seems to me pretty much ready for this, what am I missing?
And, what about n-step rollout? I checked out assertion in validate_config, and set n_steps=2. It trained successfully on cartpole.
Hmm, the original version is now released, so I will close the issue. Please open issues (or pull requests!) if you have any new asks.
Describe the problem
The results for R2D2 are quite good: https://openreview.net/forum?id=r1lyTjAqYX
We should add this as a variant of Ape-X DQN that supports recurrent networks. The high-level changes would include: