Closed ericl closed 3 years ago
@joneswong let's move the conversation here since that PR is hard to dig up.
Hi Eric, @adoda have deployed Ray on a cluster of Alibaba and we will be able to run a lot of Ray experiments simultaneously. adoda and his colleagues are making attempts to compare Ray with some other RL packages. I think adoda and I shall run some experiments for you.
Thanks @joneswong, let me know how I can help.
The noisy network has NOT been used correctly if we make update (replay) step every 4 sample steps, because
the parameters are drawn from the noisy network parameter distribution after every replay step.
according to the paper
Is it ensured to use max_priority instead of the dummy np.ones as the priorities of newly inserted samples?
def add(self, obs_t, action, reward, obs_tp1, done, weight):
"""See ReplayBuffer.store_effect"""
idx = self._next_idx
super(PrioritizedReplayBuffer, self).add(obs_t, action, reward,
obs_tp1, done, weight)
if weight is None:
weight = self._max_priority
self._it_sum[idx] = weight**self._alpha
self._it_min[idx] = weight**self._alpha
batch = SampleBatch({
"obs": obs,
"actions": actions,
"rewards": rewards,
"new_obs": new_obs,
"dones": dones,
"weights": np.ones_like(rewards)
})
That's a good catch, the weight
argument was added for Ape-X, but I guess for DQN we should set it to max priority instead of ones. Any idea if fixing this affects performance?
The following should disable the use of np.ones for non-apex dqn:
--- a/python/ray/rllib/optimizers/sync_replay_optimizer.py
+++ b/python/ray/rllib/optimizers/sync_replay_optimizer.py
@@ -98,7 +98,7 @@ class SyncReplayOptimizer(PolicyOptimizer):
pack_if_needed(row["obs"]),
row["actions"], row["rewards"],
pack_if_needed(row["new_obs"]), row["dones"],
- row["weights"])
+ None)
if self.num_steps_sampled >= self.replay_starts:
self._optimize()
@ericl , we try to reproduce the experimental conclusion in the paper "Ray: A Distributed Framework for Emerging AI Applications". Such as:
But we don't have the benchmarks. We did some simulations of the benchmarks, which can't achieve the performance in the paper. So, can you help us with the benchmarks.
If I recall, you need to have multiple GCS shards to scale to that throughput. Cc @atumanov
I updated the results for Dueling DDQN and Distributional DQN here: https://github.com/ray-project/rl-experiments
Both show significant improvement over the basic DQN. I also tried n-step and prioritized DQN, but didn't see any gains even after the most recent fixes.
@ericl Is this properly resolved now? Should the be closed?
Also, is there an example of what config to use to reproduce a rainbow DQN from the paper? I have to do that, and the documentation on it is very unclear.
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you'd still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray's public slack channel.
Thanks again for opening the issue!
Describe the problem
Per https://github.com/ray-project/ray/pull/2737, combining all the Rainbow configs does not yield performance as expected (in fact, sometimes DQN or DDQN performs consistently better).
It would be good to understand why / compare with reference implementations like dopamine.