Cartpole colab shows DQN outperforming C51?

RylanSchaeffer commented 3 years ago

I would've expected C51 to outperform DQN (at least initially, if not asymptotically) but when I looked at the provided colab notebook, C51 seems to be beaten by DQN quite frequently:

I ran the notebook myself to get my own results, which largely agreed:

I suppose there are two questions:

Why is DQN so unstable?
Why does DQN outperform C51?

psc-g commented 3 years ago

the hyperparameters we are using for cartpole (and acrobot) were not tuned for a very long time. we played with them to get something that's reasonably stable, as the intent was to use it as a simple example that can train quickly, rather than aiming to get SOTA.

On Tue, Aug 4, 2020 at 12:16 AM Rylan Schaeffer notifications@github.com wrote:

I would've expected C51 to outperform DQN (at least initially, if not asymptotically) but when I looked at the provided colab notebook, C51 seems to be beaten by DQN most of the time:

[image: image] https://user-images.githubusercontent.com/8942987/89252334-45d83d00-d5ce-11ea-9547-8edb3a9d9c35.png

I ran the notebook myself to get my own results, which largely agreed:

[image: image] https://user-images.githubusercontent.com/8942987/89252294-2ccf8c00-d5ce-11ea-952f-72e473099524.png

I suppose there are two questions:

1.

Why is DQN so unstable? 2.

Why does DQN outperform C51?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMI24RSA3IASFTVBLD3R66DQVANCNFSM4PT7JK4A .

RylanSchaeffer commented 3 years ago

Ok thank you for clarifying! In that case, can I ask for your insights into what fraction of hyperparameters C51 clearly displays superior performance to DQN and vice versa?

Relatedly, when reading a paper like your paper with Lyle and Bellemare, how reliable are the results in Section 5.2? If distributional RL only outperforms classical RL under a very small subset of hyperparameters, how can a reader discern whether the result is something genuine from whether the result didn't test a sufficient number/range of hyperparameters?

RylanSchaeffer commented 3 years ago

A statement like "We used the same hyperparameters for all algorithms, except for step sizes, where we chose the step size that gave the best performance for each algorithm." now seems a bit more concerning to me.

psc-g commented 3 years ago

On Tue, Aug 4, 2020 at 10:26 AM Rylan Schaeffer notifications@github.com wrote:

Ok thank you for clarifying! In that case, can I ask for your insights into what fraction of hyperparameters C51 clearly displays superior performance to DQN and vice versa?

Relatedly, when reading a paper like your paper with Lyle and Bellemare, how reliable are the results in Section 5.2? If distributional RL only outperforms classical RL under a very small subset of hyperparameters, how can a reader discern whether the result is something genuine from whether the result didn't test a sufficient number/range of hyperparameters?

for this paper clare did run hyperparameter sweeps for dqn and c51, but these were run on custom code (dopamine had not yet been launched and it was atari-only at the time), so she was not using the configs that have been released with dopamine.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148#issuecomment-668628366, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMLP3X3EXNVIR6K6QKTR7AK75ANCNFSM4PT7JK4A .

psc-g commented 3 years ago

On Tue, Aug 4, 2020 at 10:27 AM Rylan Schaeffer notifications@github.com wrote:

A statement like "We used the same hyperparameters for all algorithms, except for step sizes, where we chose the step size that gave the best performance for each algorithm." now seems a bit more concerning to me.

our intent with the hyperparameter choices for dopamine was to try, as much as possible, to provide an apples-to-apples comparison of the different algorithms, not to provide SOTA configs for each. however, we do provide the configs that match the hyperparameter choices used in the original papers that introduced those algorithms. the idea was to provide reasonable baselines for each of these algorithms so that researchers could use that as a starting point to develop new algorithms/ideas (SOTA or not).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148#issuecomment-668629189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMNZVSQROZOJ63LZ5MTR7ALFNANCNFSM4PT7JK4A .

RylanSchaeffer commented 3 years ago

I think my main question now is how a reader can discern whether a published result is a genuine effect versus whether the result didn't test a sufficient number/range of hyperparameters?

psc-g commented 3 years ago

On Tue, Aug 4, 2020 at 12:42 PM Rylan Schaeffer notifications@github.com wrote:

I think my main question now is how a reader can discern whether a published result is a genuine effect from whether the result didn't test a sufficient number/range of hyperparameters?

ah, that general question :) it's an important question, and i don't think there's a single right answer, but things like the reproducibility challenge and the reproducibility checklist do help in making sure that the results presented are not just cherry-picked results, but actually show the merits (and shortcomings) of a new algorithm in a way that carries through to other results.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148#issuecomment-668703999, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMJSSONSZHLK57P6PSDR7A3ANANCNFSM4PT7JK4A .

RylanSchaeffer commented 3 years ago

Hi @psc-g , sorry to bother you again, but I'm hoping you can help me find a very simple example to play with in which distributional RL is clearly better (in an apples-to-apples comparison sense). Can you point me in the right direction?

RylanSchaeffer commented 3 years ago

I was hoping this C51 + cartpole tutorial would be such an example. I'm just looking for an example I can play with where distributional agents learn faster and asymptote to a higher return per episode

RylanSchaeffer commented 3 years ago

@psc-g sorry to bother you again, but can you help me find a very simple example to play with in which distributional RL is clearly better (in an apples-to-apples comparison sense)?

RylanSchaeffer commented 3 years ago

Single architecture, single environment, whatever it takes

psc-g commented 3 years ago

hi rylan, i don't know off the top of my head, and i haven't run these types of experiments to have a good suggestion for a simple environment that exhibits these characteristics. one thing you could try is with other gym environments (such as lunar lander). i'm not sure a-priori whether distributional will outperform expectational, though.

another option is running some atari games for fewer frames (instead of the regular 200 million frames)?

i guess it depends on what your intended use case is...

On Tue, Sep 1, 2020 at 5:02 PM Rylan Schaeffer notifications@github.com wrote:

Single architecture, single environment, whatever it takes

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148#issuecomment-685130530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMJ57ZEX6USTW2OA3NLSDVOONANCNFSM4PT7JK4A .

psc-g commented 3 years ago

hi rylan, any luck with this?

RylanSchaeffer commented 3 years ago

None. I ran a few environments (Asterix, Breakout, Pong, Qbert, Seaquest, SpaceInvaders) and found mixed results. I had to abandon the project :(

google / dopamine

Cartpole colab shows DQN outperforming C51? #148