Closed RylanSchaeffer closed 3 years ago
the hyperparameters we are using for cartpole (and acrobot) were not tuned for a very long time. we played with them to get something that's reasonably stable, as the intent was to use it as a simple example that can train quickly, rather than aiming to get SOTA.
On Tue, Aug 4, 2020 at 12:16 AM Rylan Schaeffer notifications@github.com wrote:
I would've expected C51 to outperform DQN (at least initially, if not asymptotically) but when I looked at the provided colab notebook, C51 seems to be beaten by DQN most of the time:
[image: image] https://user-images.githubusercontent.com/8942987/89252334-45d83d00-d5ce-11ea-9547-8edb3a9d9c35.png
I ran the notebook myself to get my own results, which largely agreed:
[image: image] https://user-images.githubusercontent.com/8942987/89252294-2ccf8c00-d5ce-11ea-952f-72e473099524.png
I suppose there are two questions:
1.
Why is DQN so unstable? 2.
Why does DQN outperform C51?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMI24RSA3IASFTVBLD3R66DQVANCNFSM4PT7JK4A .
Ok thank you for clarifying! In that case, can I ask for your insights into what fraction of hyperparameters C51 clearly displays superior performance to DQN and vice versa?
Relatedly, when reading a paper like your paper with Lyle and Bellemare, how reliable are the results in Section 5.2? If distributional RL only outperforms classical RL under a very small subset of hyperparameters, how can a reader discern whether the result is something genuine from whether the result didn't test a sufficient number/range of hyperparameters?
A statement like "We used the same hyperparameters for all algorithms, except for step sizes, where we chose the step size that gave the best performance for each algorithm." now seems a bit more concerning to me.
On Tue, Aug 4, 2020 at 10:26 AM Rylan Schaeffer notifications@github.com wrote:
Ok thank you for clarifying! In that case, can I ask for your insights into what fraction of hyperparameters C51 clearly displays superior performance to DQN and vice versa?
Relatedly, when reading a paper like your paper with Lyle and Bellemare, how reliable are the results in Section 5.2? If distributional RL only outperforms classical RL under a very small subset of hyperparameters, how can a reader discern whether the result is something genuine from whether the result didn't test a sufficient number/range of hyperparameters?
for this paper clare did run hyperparameter sweeps for dqn and c51, but these were run on custom code (dopamine had not yet been launched and it was atari-only at the time), so she was not using the configs that have been released with dopamine.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148#issuecomment-668628366, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMLP3X3EXNVIR6K6QKTR7AK75ANCNFSM4PT7JK4A .
On Tue, Aug 4, 2020 at 10:27 AM Rylan Schaeffer notifications@github.com wrote:
A statement like "We used the same hyperparameters for all algorithms, except for step sizes, where we chose the step size that gave the best performance for each algorithm." now seems a bit more concerning to me.
our intent with the hyperparameter choices for dopamine was to try, as much as possible, to provide an apples-to-apples comparison of the different algorithms, not to provide SOTA configs for each. however, we do provide the configs that match the hyperparameter choices used in the original papers that introduced those algorithms. the idea was to provide reasonable baselines for each of these algorithms so that researchers could use that as a starting point to develop new algorithms/ideas (SOTA or not).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148#issuecomment-668629189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMNZVSQROZOJ63LZ5MTR7ALFNANCNFSM4PT7JK4A .
I think my main question now is how a reader can discern whether a published result is a genuine effect versus whether the result didn't test a sufficient number/range of hyperparameters?
On Tue, Aug 4, 2020 at 12:42 PM Rylan Schaeffer notifications@github.com wrote:
I think my main question now is how a reader can discern whether a published result is a genuine effect from whether the result didn't test a sufficient number/range of hyperparameters?
ah, that general question :) it's an important question, and i don't think there's a single right answer, but things like the reproducibility challenge and the reproducibility checklist do help in making sure that the results presented are not just cherry-picked results, but actually show the merits (and shortcomings) of a new algorithm in a way that carries through to other results.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148#issuecomment-668703999, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMJSSONSZHLK57P6PSDR7A3ANANCNFSM4PT7JK4A .
Hi @psc-g , sorry to bother you again, but I'm hoping you can help me find a very simple example to play with in which distributional RL is clearly better (in an apples-to-apples comparison sense). Can you point me in the right direction?
I was hoping this C51 + cartpole tutorial would be such an example. I'm just looking for an example I can play with where distributional agents learn faster and asymptote to a higher return per episode
@psc-g sorry to bother you again, but can you help me find a very simple example to play with in which distributional RL is clearly better (in an apples-to-apples comparison sense)?
Single architecture, single environment, whatever it takes
hi rylan, i don't know off the top of my head, and i haven't run these types of experiments to have a good suggestion for a simple environment that exhibits these characteristics. one thing you could try is with other gym environments (such as lunar lander). i'm not sure a-priori whether distributional will outperform expectational, though.
another option is running some atari games for fewer frames (instead of the regular 200 million frames)?
i guess it depends on what your intended use case is...
On Tue, Sep 1, 2020 at 5:02 PM Rylan Schaeffer notifications@github.com wrote:
Single architecture, single environment, whatever it takes
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/dopamine/issues/148#issuecomment-685130530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CCMJ57ZEX6USTW2OA3NLSDVOONANCNFSM4PT7JK4A .
hi rylan, any luck with this?
None. I ran a few environments (Asterix, Breakout, Pong, Qbert, Seaquest, SpaceInvaders) and found mixed results. I had to abandon the project :(
I would've expected C51 to outperform DQN (at least initially, if not asymptotically) but when I looked at the provided colab notebook, C51 seems to be beaten by DQN quite frequently:
I ran the notebook myself to get my own results, which largely agreed:
I suppose there are two questions:
Why is DQN so unstable?
Why does DQN outperform C51?