ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.31k stars 5.63k forks source link

[RLlib] gpu cannot enable #42388

Open Blanchard-Zhu opened 8 months ago

Blanchard-Zhu commented 8 months ago

Training is slow and may be pending because the GPU may not be enabled. Could this be a problem due to excessive memory requirements?

Here is the code.

  config = (
        PPOConfig().environment(Legalization, env_config = {"designs": train_benchmarks})
        .framework('torch')
        .rollouts(num_rollout_workers = 1)
        .resources(num_gpus = 1)
        .training(model = {'conv_filters': [[32, [50, 5], 2], [64, [50, 5], 3], [128, [50, 5], 3], [1024, [250, 25], 1]]})
        .evaluation(evaluation_interval = 1, evaluation_config=PPOConfig.overrides(
            env_config={
                "designs": test_benchmarks,
            }
        ))
    )

    stop_config = {
        "training_iteration": 1000,
        "episode_reward_mean": -1,
    }

    tuner = tune.Tuner(
        "PPO",
        param_space = config,
        run_config = train.RunConfig(
            stop = stop_config
        )
    )
    results = tuner.fit()
    best_result = results.get_best_result(metric = "episode_reward_mean", mode = "max")
    best_checkpoint = best_result.checkpoint

The following is the log.

Trial status: 1 PENDING Current time: 2024-01-14 08:33:15. Total running time: 42s Logical resource usage: 2.0/16 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:G)

Versions / Dependencies

ray: 2.9.0 torch: 2.1.2

Reproduction script

ray.init(num_gpus = 1)
config = (
        PPOConfig().environment(Legalization, env_config = {"designs": train_benchmarks})
        .framework('torch')
        .rollouts(num_rollout_workers = 1)
        .resources(num_gpus = 1)
        .training(model = {'conv_filters': [[32, [50, 5], 2], [64, [50, 5], 3], [128, [50, 5], 3], [1024, [250, 25], 1]]})
        .evaluation(evaluation_interval = 1, evaluation_config=PPOConfig.overrides(
            env_config={
                "designs": test_benchmarks,
            }
        ))
    )

    stop_config = {
        "training_iteration": 1000,
        "episode_reward_mean": -1,
    }

    tuner = tune.Tuner(
        "PPO",
        param_space = config,
        run_config = train.RunConfig(
            stop = stop_config
        )
    )

    results = tuner.fit()
    best_result = results.get_best_result(metric = "episode_reward_mean", mode = "max")
    best_checkpoint = best_result.checkpoint

Issue Severity

High: It blocks me from completing my task.

sven1977 commented 7 months ago

Hey @zhubenchao , thanks for raising this issue. In your output, does the "PENDING" ever turn into a "RUNNING"? If not, then the job has not even started and is possibly blocked by something else. Maybe the GPU is not available or blocked by another process. If there is no GPU at all on your machine, you should see a Ray error, though, so that's seems to be not the case.