[rllib] Provide atari results across all algorithms (as applicable)

ericl commented 6 years ago

Describe the problem

We should publish results for at least a few of the standard Atari games on all applicable algorithms, and fix any discrepancies, e.g. https://github.com/ray-project/ray/issues/2654

Results uploaded to this repo: https://github.com/ray-project/rl-experiments

[x] IMPALA
[ ] IMPALA-LSTM
[ ] A3C
[x] A2C
[ ] DQN
[ ] APEX
[x] PPO

Envs to run: PongNoFrameskip-v4, BreakoutNoFrameskip-v4, BeamRiderNoFrameskip-v4, QbertNoFrameskip-v4, SpaceInvadersNoFrameskip-v4

(Chosen such that all but pong can run concurrently on a g3.16xl machine).

Some references: https://github.com/btaba/yarlp https://github.com/openai/baselines/issues/176

richardliaw commented 6 years ago

Also relevant reference: https://github.com/hill-a/stable-baselines

ericl commented 6 years ago

Just ran a "30% full speed" IMPALA across a couple environments. The results are pretty reasonable at 40M frames, with Qbert / Space invaders about inline with results from the A3C paper, and Breakout / Beamrider a bit below. Note that the episode max reward for Breakout and Beamrider are pretty good, but the mean is not quite up there.

I'm guessing we can improve on this with some tuning.

# Runs on a single g3.16xl node
atari-impala:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4 
    run: IMPALA
    config:
        sample_batch_size: 250  # 50 * num_envs_per_worker
        train_batch_size: 500
        num_workers: 12
        num_envs_per_worker: 5

atari-impala

robertnishihara commented 6 years ago

In what format does it make sense to publish the results? E.g., a collection of full learning curves (e.g., as CSV)? Or actual visualizations like you have above? Or something else?

ericl commented 6 years ago

If we have a public ray perf dashboard, that would be a good place to put these.

Otherwise, I think posting some summary visualizations on github or the docs would do (for example, just having the tuned example yamls with pointers to this issue). The full learning curve data probably isn't that interesting, but we could also upload that to S3 pretty easily.

luochao1024 commented 6 years ago

Do you have any result about A3C or A3C-LSTM?

ericl commented 6 years ago

I did an initial run with A3C, however the results were much worse than the Impala ones. I didn't try tuning the learning rate though as mentioned in the A3C paper.

On Sat, Aug 18, 2018, 11:00 PM luochao1024 notifications@github.com wrote:

Do you have any result about A3C or A3C-LSTM?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2663#issuecomment-414105799, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6Shy688eDmu__FkhmWz28hA3ISZ4Bks5uSP8TgaJpZM4V-5mK .

luochao1024 commented 6 years ago

A3C is very sensitive with learning rate as the staleness of gradients increases with learning rate

ericl commented 6 years ago

For reference, here is the run and params (with the default lr=0.0001, and grad_clip=40.0). Note that the gradient magnitude scales with the lr * batch size = 20.

This is also on this branch: https://github.com/ray-project/ray/pull/2679

# Runs on a single m4.16xl node
atari-a3c:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4 
    run: A3C
    config:
        num_workers: 11
        sample_batch_size: 20
        optimizer:
            grads_per_step: 1000

a3c

ericl commented 6 years ago

That PR also adds A2C. Since A2C is deterministic, it should be easy to copy hyperparameters from another A2C implementation to compare results (I'm doing some runs right now, but it might take a while).

luochao1024 commented 6 years ago

you are using 11 workers for experiment. I would recommend 16 workers.

ericl commented 6 years ago

One discovery: we're handling EpisodicLifeEnv resets incorrectly. For example, for BeamRider you get three lives, which we are treating as three episodes, but you're supposed to count as one.

This kind of explains why BeamRider's starting score is about 3x too low.

ericl commented 6 years ago

@luochao1024 this PR reproduces standard Atari results for IMPALA and A2C: https://github.com/ray-project/ray/pull/2700

I'm still having trouble finding the right hyperparams for A3C (vf_explained_var tends to dive to <0 with A3C whereas it is always close to 1 with A2C / IMPALA), but since it works in A2C it's probably just a matter of tweaking the lr / batch size / grad clipping.

luochao1024 commented 6 years ago

Do you have some right hyperparams that work for a3c now?

ericl commented 6 years ago

I don't have the bandwidth to tune A3C right now, but if you want to give it a shot perhaps starting from the A2C hyperparams with some lr adjustment could work?

On Sat, Aug 25, 2018, 10:37 AM luochao1024 notifications@github.com wrote:

Do you have some right hyperparams that work for a3c now?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2663#issuecomment-415985069, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6Srxb0zgTda_k0Mv-fcU5NnWXg4Zvks5uUYtugaJpZM4V-5mK .

luochao1024 commented 6 years ago

@ericl Can you give it a try for BreakoutNoFrameskip-v4? I try a grid search for the lr, but I still get some really bad results. Here is the configs I use:

atari-a3c:
    env: BreakoutNoFrameskip-v4
    run: A3C
    config:
        num_workers: 8
        sample_batch_size: 20
        use_pytorch: false
        vf_loss_coeff: 0.5
        entropy_coeff: -0.01
        gamma: 0.99
        grad_clip: 40.0
        lambda: 1.0
        lr:
            grid_search:
                - 0.000005
                - 0.00001
                - 0.00005
                - 0.0001
        observation_filter: NoFilter
        preprocessor_pref: rllib
        num_envs_per_workers: 5
        optimizer:
            grads_per_step: 1000

ericl commented 6 years ago

You'll definitely need to use the deepmind preprocessors, since the rllib knees don't have the right episodic life wrappers. Perhaps we should remove those. Also, maybe don't use LSTM and start from the A2C config.

On Wed, Aug 29, 2018, 9:50 AM luochao1024 notifications@github.com wrote:

@ericl https://github.com/ericl Can you give it a try for BreakoutNoFrameskip-v4? I try a grid search for the lr, but I still get some really bad results. Here is the configs I use:

atari-a3c: env: BreakoutNoFrameskip-v4 run: A3C config: num_workers: 8 sample_batch_size: 20 use_pytorch: false vf_loss_coeff: 0.5 entropy_coeff: -0.01 gamma: 0.99 grad_clip: 40.0 lambda: 1.0 lr: grid_search:

0.000005

0.00001

0.00005

0.0001 observation_filter: NoFilter preprocessor_pref: rllib num_envs_per_workers: 5 optimizer: grads_per_step: 1000

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2663#issuecomment-417024383, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6Sm2sAAl2Kk3Y5wpeyEY7lc7XYONrks5uVsZlgaJpZM4V-5mK .

luochao1024 commented 6 years ago

Now I am running A3C with the following config:

atari-a3c:
    env:
        BreakoutNoFrameskip-v4
    run: A3C
    config:
        num_workers: 5
        sample_batch_size: 20
        preprocessor_pref: deepmind
        lr:
           grid_search:
               - 0.000005
               - 0.00001
               - 0.00005
               - 0.0001
               - 0.0005
               - 0.001
        num_envs_per_worker: 5
        optimizer:
            grads_per_step: 1000

Do you think the configs are reasonable now? I am also running BeamRiderNoFrameskip-v4, QbertNoFrameskip-v4, SpaceInvadersNoFrameskip-v4 at then same time. I will report it when I finish the training.

ericl commented 6 years ago

There's this one weird thing where num_envs_per_worker will reduce your effective unroll length per env (so 20 / 5 = unroll length of 4). So just watch out for that and you might consider trying 1 env per worker instead, or setting sample_batch_size=50 for a longer unroll.

Beyond that the config looks fine. Note that I found a lr schedule is important for some envs (but it's probably too much to try right now).

On Wed, Aug 29, 2018 at 10:21 AM luochao1024 notifications@github.com wrote:

Now I am running A3C with the following config:

atari-a3c: env: BreakoutNoFrameskip-v4 run: A3C config: num_workers: 5 sample_batch_size: 20 preprocessor_pref: deepmind lr: grid_search:

0.000005

0.00001

0.00005

0.0001

0.0005

0.001 num_envs_per_worker: 5 optimizer: grads_per_step: 1000

Do you think the configs are reasonable now? I am also running BeamRiderNoFrameskip-v4, QbertNoFrameskip-v4, SpaceInvadersNoFrameskip-v4 at then same time. I will report it when I finish the training.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2663#issuecomment-417034629, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA6SoyKSWaPu8oF1F4Qk6AHp-Tq7SNNks5uVs2GgaJpZM4V-5mK .

luochao1024 commented 6 years ago

The result seems normal now with num_workers=5

BreakoutNoFrameskip-v4:

SpaceInvadersNoFrameskip-v4:

QbertNoFrameskip-v4:

I will set the num_envs_per_worker=1 later

ericl commented 6 years ago

Closing this in favor of individual tickets. Main TODOs are the DQN family.

ray-project / ray

[rllib] Provide atari results across all algorithms (as applicable) #2663

Describe the problem