Possible APEX regression in more recent versions of RLlib

ericl commented 5 years ago

It looks like Ape-X trains much more slowly now as of at least 0.7.2. For example, it now takes ~80M timesteps to 100 reward on Breakout, which used to be achieved in <10M steps according to the recorded results: https://github.com/ray-project/rl-experiments/tree/master/atari-apex

However, DQN performance seems fine (0.7.5), which suggests it might be some hyperparameter change or issue with computing Ape-X rewards (since you should only take into account rewards from low-epsilon workers). It is also possible it is some other trickier issue.

ericl commented 4 years ago

I am able to get the old performance with 0.5.3.

I believe the issue may be that we decoupled sampling speed from training. In old versions of RLlib, sampling was throttled so that the proportion of sample : train samples was kept roughly constant. Hence, the performance now looks much worse sample-wise after https://github.com/ray-project/ray/pull/3212

ericl commented 4 years ago

Ok, this is entirely a false alarm. The following config entirely reproduces previously reported performance:

apex:
    env:
        grid_search:
            - BreakoutNoFrameskip-v4
            - BeamRiderNoFrameskip-v4
            - QbertNoFrameskip-v4
            - SpaceInvadersNoFrameskip-v4
    run: APEX
    config:
        double_q: false
        dueling: false
        num_atoms: 1
        noisy: false
        n_step: 3
        lr: .0001
        adam_epsilon: .00015
        hiddens: [512]
        buffer_size: 1000000
        schedule_max_timesteps: 2000000
        exploration_final_eps: 0.01
        exploration_fraction: .1
        prioritized_replay_alpha: 0.5
        beta_annealing_fraction: 1.0
        final_prioritized_replay_beta: 1.0
#        gpu: true
        num_gpus: 1

        # APEX
        num_workers: 8
        num_envs_per_worker: 8
#        sample_batch_size: 158
        sample_batch_size: 20
        train_batch_size: 512
        target_network_update_freq: 50000
        timesteps_per_iteration: 25000

Breakout:

blue=master
pink=0.5.3

ray-project / ray

Possible APEX regression in more recent versions of RLlib #6127