ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.01k stars 5.59k forks source link

[rllib] Reproduce Rainbow results #2884

Closed ericl closed 3 years ago

ericl commented 5 years ago

Describe the problem

Per https://github.com/ray-project/ray/pull/2737, combining all the Rainbow configs does not yield performance as expected (in fact, sometimes DQN or DDQN performs consistently better).

It would be good to understand why / compare with reference implementations like dopamine.

ericl commented 5 years ago

@joneswong let's move the conversation here since that PR is hard to dig up.

joneswong commented 5 years ago

Hi Eric, @adoda have deployed Ray on a cluster of Alibaba and we will be able to run a lot of Ray experiments simultaneously. adoda and his colleagues are making attempts to compare Ray with some other RL packages. I think adoda and I shall run some experiments for you.

ericl commented 5 years ago

Thanks @joneswong, let me know how I can help.

joneswong commented 5 years ago

The noisy network has NOT been used correctly if we make update (replay) step every 4 sample steps, because

the parameters are drawn from the noisy network parameter distribution after every replay step.

according to the paper . I am not sure such a slight difference may affect the performance of Rainbow.

joneswong commented 5 years ago

Is it ensured to use max_priority instead of the dummy np.ones as the priorities of newly inserted samples?

    def add(self, obs_t, action, reward, obs_tp1, done, weight):
        """See ReplayBuffer.store_effect"""

        idx = self._next_idx
        super(PrioritizedReplayBuffer, self).add(obs_t, action, reward,
                                                 obs_tp1, done, weight)
        if weight is None:
            weight = self._max_priority
        self._it_sum[idx] = weight**self._alpha
        self._it_min[idx] = weight**self._alpha
   batch = SampleBatch({
        "obs": obs,
        "actions": actions,
        "rewards": rewards,
        "new_obs": new_obs,
        "dones": dones,
        "weights": np.ones_like(rewards)
    })
ericl commented 5 years ago

That's a good catch, the weight argument was added for Ape-X, but I guess for DQN we should set it to max priority instead of ones. Any idea if fixing this affects performance?

The following should disable the use of np.ones for non-apex dqn:

--- a/python/ray/rllib/optimizers/sync_replay_optimizer.py
+++ b/python/ray/rllib/optimizers/sync_replay_optimizer.py
@@ -98,7 +98,7 @@ class SyncReplayOptimizer(PolicyOptimizer):
                         pack_if_needed(row["obs"]),
                         row["actions"], row["rewards"],
                         pack_if_needed(row["new_obs"]), row["dones"],
-                        row["weights"])
+                        None)

         if self.num_steps_sampled >= self.replay_starts:
             self._optimize()
adoda commented 5 years ago

@ericl , we try to reproduce the experimental conclusion in the paper "Ray: A Distributed Framework for Emerging AI Applications". Such as: image

But we don't have the benchmarks. We did some simulations of the benchmarks, which can't achieve the performance in the paper. So, can you help us with the benchmarks.

ericl commented 5 years ago

If I recall, you need to have multiple GCS shards to scale to that throughput. Cc @atumanov

ericl commented 5 years ago

I updated the results for Dueling DDQN and Distributional DQN here: https://github.com/ray-project/rl-experiments

Both show significant improvement over the basic DQN. I also tried n-step and prioritized DQN, but didn't see any gains even after the most recent fixes.

jkterry1 commented 4 years ago

@ericl Is this properly resolved now? Should the be closed?

Also, is there an example of what config to use to reproduce a rainbow DQN from the paper? I have to do that, and the documentation on it is very unclear.

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 3 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!