ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.94k stars 5.58k forks source link

[RLlib] PPO speed and performance issues #29623

Open vakker opened 1 year ago

vakker commented 1 year ago

What happened + What you expected to happen

I'm running some benchmarks, and I'm getting varying results on Atari. I'm using the Breakout benchmark from the learning_tests folder and the Pong example from the tuned_examples folder.

I tried the for framework all: tf, tf2 and torch. I expected that they would all pass, or run according to the On a single GPU, this achieves maximum reward in ~15-20 minutes. comment. They didn't.

You can see the Tensorboard logs here: https://tensorboard.dev/experiment/8JLpS12qQcKtNc9Lm6BZWw/

This is a summary: Env Framework Time to 10M steps Comment
Pong TF 1h 41min High throughput (2x compared to TF2 and Torch), no learning at all
Pong TF2 3h 30min Lowest throughput, best learning
Pong Torch 3h 2min Slightly better throughput compared to TF2, bad learning (sample efficiency is half compared to TF2)
Breakout TF 2h 45min Same as Pong, doesn't learn, but at least it does it fast
Breakout TF2 7h 31min Lowest throughput, but it can't run on 2 GPUs (which shouldn't actually matter, I think)
Breakout Torch 4h 41min Medium throughput, learns same as TF2

Versions / Dependencies

Stuff Version
Ubuntu 20.04.4
Ray de238dd6
Python 3.9.7
Nvidia driver 510.47.03
CUDA 11.6
GPU GeForce RTX 2080 Ti
CPU Intel i9-10900X CPU @ 3.70GHz (20 cores)
Pytorch 1.12.1+cu116
Tensorflow 2.10.0

Reproduction script

Pong:

pong-ppo:
    env: PongNoFrameskip-v4
    run: PPO
    stop:
        timesteps_total: 10000000
    config:
        framework:
            grid_search:
                - tf2
                - tf
                - torch
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        rollout_fragment_length: 20
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 32
        num_cpus_per_worker: 0.5
        num_envs_per_worker: 5
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        num_gpus: 1
        model:
            dim: 42
            vf_share_layers: true

Breakout (note: for TF2 this will fail, it needs num_gpus: 1):

ppo-breakoutnoframeskip-v4:
    env: BreakoutNoFrameskip-v4
    run: PPO
    stop:
        timesteps_total: 10000000
    config:
        framework:
            grid_search:
                - tf2
                - tf
                - torch            
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        rollout_fragment_length: 100
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 30
        num_cpus_per_worker: 0.5
        num_envs_per_worker: 1
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        model:
            vf_share_layers: true
        num_gpus: 2
        min_time_s_per_iteration: 30
        lr: 0.0001
        grad_clip: 100

Run:

rllib train file <file>

Issue Severity

High: It blocks me from completing my task.

Yak46 commented 1 year ago

Hello,

I tried to run the atari breakout benchmark available in the ray/rllib/tuned_examples folder.

My system:

OS: Ubuntu 20.04.4 LTS RAY: 1.13 PYTHON: 3.9.10

Settings:

atari-ppo:
    env: BreakoutNoFrameskip-v4
    run: PPO
    config:
        # Works for both torch and tf.
        framework: tf
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        rollout_fragment_length: 100
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 10
        num_envs_per_worker: 5
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        model:
            vf_share_layers: true
        num_gpus: 1

I ran a training session using the command rllib train -f "settings.yaml" and I compared the obtained reward with the reference results shown here. Unfortunately, my mean reward got stuck at values around 2.

Is still reasonable to use framework: tf in the configuration settings? I noticed that with framework: tf2, the reward improved as expected. A picture of the reward in both cases is shown below.

breakout_tf_tf2

vakker commented 1 year ago

@gjoliver What do you think about this issue?

vakker commented 1 year ago

Is there any update on this issue? Is it reproducible or is this something particular to my setup?

stale[bot] commented 1 year ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

chrisbitter commented 1 year ago

Commenting so the issue doesn't get closed.

I'm having a similar experience with my custom env. Training with ray 2.1.0 (green), after switching to 2.3.1 (brown), and after switching back to 2.1.0 (red). Everything else is pretty much the same, except for minor adjustments to get gymnasium running in 2.3.1

Framework: torch

image

rewreu commented 1 year ago

I encountered similar problems while using PPO to train Google Football. Specifically, I noticed that the CPU usage was fluctuating between 0% and 100%, while the GPU usage was fluctuating between 0% and 30%. I suspect that there may be some internal context switching within Ray that is impeding performance.

klekkala commented 1 year ago

Is this fixed? I too am getting bad performance on ray 2.3.1 when running the tuned config files. Rewards for SpaceInvaders don't even cross 15 after 5M steps.