ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.01k stars 5.59k forks source link

[rllib] Implement R2D2: Recurrent Experience Replay in Distributed Reinforcement Learning #3148

Closed ericl closed 3 years ago

ericl commented 5 years ago

Describe the problem

The results for R2D2 are quite good: https://openreview.net/forum?id=r1lyTjAqYX

We should add this as a variant of Ape-X DQN that supports recurrent networks. The high-level changes would include:

alvkao58 commented 5 years ago

Currently working on this (for some reason I can't assign the issue to myself?)

richardliaw commented 5 years ago

Yeah you need to be part of the org, but it's probably ok - if you open a PR people will know

daylen commented 5 years ago

For todo (1), "Wire up rnn in/out to the models of the DQN policy graph", isn't this just specifying "use_lstm": True in the model config?

alvkao58 commented 5 years ago

Currently, the "use_lstm": True option isn't supported by DQN, so a bit of work needed to be done to allow for using that.

byptedance commented 5 years ago

any progress here?

richardliaw commented 5 years ago

Ah not yet; no one is currently working on this. Open to contributions!

alex-petrenko commented 5 years ago

Looking at the results in the paper, there is a surprising performance gap between full recurrent R2D2 and a feed-forward version, considering the fact that most Atari envs are essentially MDPs. They claim in the paper that on these envs the recurrent agent "learns better representations".

Can it be partially because LSTM agent just has a lot more parameters? They never mention whether they keep the number of weights the same, or just remove the LSTM layer. Besides, they say that the LSTM agent receives previous action and reward as input to the RNN layer, but they never mention if the feed-forward agent has the same treatment.

Finally, citing the paper: [RNN] improves performance even on domains that are fully observable and do not obviously require memory (cf. BREAKOUT results in the feed-forward ablation). But if you look at the full results on page 18, their recurrent agent is actually worse than previous SOTA feed-forward on Breakout!

I wonder if their claims that LSTM is the main performance factor are actually correct.

jon-chuang commented 5 years ago

Is there a good reason why R2D2 should be a Q-learning algorithm? I am approaching this from the viewpoint of someone wanting the most sample-efficiency out of my algorithms. I understand that experience replay does not tend to go well with actor-critic algorithms, for the reason that experience gathered from policy-based algorithms tend to get stale quickly.

However, there exist fixes to this such as soft actor-critic or ACER, though I am not sure how competitive the latter at present. It was already barely competitive with prioritised DDQN when it first came out. With regards to the former, I still don't understand what the differences are between energy-based policies, PGQL, and, for instance, normalised actor-critic, which seem to repackage the same idea in various forms. I have yet to understand their relative advantages and shortcomings.

It is also quite odd that the average experience utilisation of R2D2 is 80%, and ape-X's ratio is something like 130%. I wish there was more information about the distribution of rollouts consumed, for instance, if by prioritisation certain rollouts are consumed tens or hundreds of times, which would mean that a majority of rollouts are basically tossed out and never seen by the optimisation loop, but also that prioritisation is (perhaps) doing its job.

chamorajg commented 3 years ago

Is this issue still open ? If open Can I work on this ?

sven1977 commented 3 years ago

Hey @chandramoulirajagopalan ! Sorry for the late response. Writing a short design doc right now. Will get this work done this quarter.

chamorajg commented 3 years ago

@sven1977 Will you work on this? Or can I also help to implement R2D2 once you draft the design documentation?

spiralhead commented 3 years ago

Hi!

I'm trying to do my project on top of your R2D2 implementation. I found, that replay buffer still making importance weighting, but aren't we supposed to use unprioritized replay buffer version? Btw, what is the current status of prioritized version of R2D2? Code seems to me pretty much ready for this, what am I missing?

spiralhead commented 3 years ago

And, what about n-step rollout? I checked out assertion in validate_config, and set n_steps=2. It trained successfully on cartpole.

richardliaw commented 3 years ago

Hmm, the original version is now released, so I will close the issue. Please open issues (or pull requests!) if you have any new asks.