[RLlib] PPO speed and performance issues

vakker commented 1 year ago

What happened + What you expected to happen

I'm running some benchmarks, and I'm getting varying results on Atari. I'm using the Breakout benchmark from the learning_tests folder and the Pong example from the tuned_examples folder.

I tried the for framework all: tf, tf2 and torch. I expected that they would all pass, or run according to the On a single GPU, this achieves maximum reward in ~15-20 minutes. comment. They didn't.

You can see the Tensorboard logs here: https://tensorboard.dev/experiment/8JLpS12qQcKtNc9Lm6BZWw/

This is a summary:	Env	Framework	Time to 10M steps
Pong	TF	1h 41min	High throughput (2x compared to TF2 and Torch), no learning at all
Pong	TF2	3h 30min	Lowest throughput, best learning
Pong	Torch	3h 2min	Slightly better throughput compared to TF2, bad learning (sample efficiency is half compared to TF2)
Breakout	TF	2h 45min	Same as Pong, doesn't learn, but at least it does it fast
Breakout	TF2	7h 31min	Lowest throughput, but it can't run on 2 GPUs (which shouldn't actually matter, I think)
Breakout	Torch	4h 41min	Medium throughput, learns same as TF2

Versions / Dependencies

Stuff	Version
Ubuntu	20.04.4
Ray	de238dd6
Python	3.9.7
Nvidia driver	510.47.03
CUDA	11.6
GPU	GeForce RTX 2080 Ti
CPU	Intel i9-10900X CPU @ 3.70GHz (20 cores)
Pytorch	1.12.1+cu116
Tensorflow	2.10.0

Reproduction script

Pong:

pong-ppo:
    env: PongNoFrameskip-v4
    run: PPO
    stop:
        timesteps_total: 10000000
    config:
        framework:
            grid_search:
                - tf2
                - tf
                - torch
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        rollout_fragment_length: 20
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 32
        num_cpus_per_worker: 0.5
        num_envs_per_worker: 5
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        num_gpus: 1
        model:
            dim: 42
            vf_share_layers: true

Breakout (note: for TF2 this will fail, it needs num_gpus: 1):

ppo-breakoutnoframeskip-v4:
    env: BreakoutNoFrameskip-v4
    run: PPO
    stop:
        timesteps_total: 10000000
    config:
        framework:
            grid_search:
                - tf2
                - tf
                - torch            
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        rollout_fragment_length: 100
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 30
        num_cpus_per_worker: 0.5
        num_envs_per_worker: 1
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        model:
            vf_share_layers: true
        num_gpus: 2
        min_time_s_per_iteration: 30
        lr: 0.0001
        grad_clip: 100

Run:

rllib train file <file>

Issue Severity

High: It blocks me from completing my task.

Yak46 commented 1 year ago

Hello,

I tried to run the atari breakout benchmark available in the ray/rllib/tuned_examples folder.

My system:

OS: Ubuntu 20.04.4 LTS RAY: 1.13 PYTHON: 3.9.10

Settings:

atari-ppo:
    env: BreakoutNoFrameskip-v4
    run: PPO
    config:
        # Works for both torch and tf.
        framework: tf
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        rollout_fragment_length: 100
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 10
        num_envs_per_worker: 5
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        model:
            vf_share_layers: true
        num_gpus: 1

I ran a training session using the command rllib train -f "settings.yaml" and I compared the obtained reward with the reference results shown here. Unfortunately, my mean reward got stuck at values around 2.

Is still reasonable to use framework: tf in the configuration settings? I noticed that with framework: tf2, the reward improved as expected. A picture of the reward in both cases is shown below.

breakout_tf_tf2

vakker commented 1 year ago

@gjoliver What do you think about this issue?

vakker commented 1 year ago

Is there any update on this issue? Is it reproducible or is this something particular to my setup?

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

chrisbitter commented 1 year ago

Commenting so the issue doesn't get closed.

I'm having a similar experience with my custom env. Training with ray 2.1.0 (green), after switching to 2.3.1 (brown), and after switching back to 2.1.0 (red). Everything else is pretty much the same, except for minor adjustments to get gymnasium running in 2.3.1

Framework: torch

rewreu commented 1 year ago

I encountered similar problems while using PPO to train Google Football. Specifically, I noticed that the CPU usage was fluctuating between 0% and 100%, while the GPU usage was fluctuating between 0% and 30%. I suspect that there may be some internal context switching within Ray that is impeding performance.

klekkala commented 1 year ago

Is this fixed? I too am getting bad performance on ray 2.3.1 when running the tuned config files. Rewards for SpaceInvaders don't even cross 15 after 5M steps.

ray-project / ray