ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.66k stars 5.72k forks source link

[Bug] [RLLIB] APEX-DDPG crashes if using multiple gpus #22179

Open Mathieu-Prouveur opened 2 years ago

Mathieu-Prouveur commented 2 years ago

Search before asking

Ray Component

RLlib

What happened + What you expected to happen

I tried out the torch implementation of APEX-DDPG with multiple gpus but it crashes with the following error :

  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/execution/concurrency_ops.py", line 135, in base_iterator
    raise RuntimeError("Dequeue `check()` returned False! "
RuntimeError: Dequeue `check()` returned False! Exiting with Exception from Dequeue iterator.

Versions / Dependencies

ray 1.10.0 torch 1.10.2

Reproduction script

I tried running the following config (same as the one in tuned examples but with torch and 2 gpus) using rllib train -f xxx.yaml :

mountaincarcontinuous-apex-ddpg:
    env: MountainCarContinuous-v0
    run: APEX_DDPG
    stop:
        episode_reward_mean: 90
    config:
        # Works for both torch and tf.
        framework: torch
        clip_rewards: False
        num_gpus: 2
        num_workers: 16
        exploration_config:
            ou_base_scale: 1.0
        n_step: 3
        target_network_update_freq: 50000
        tau: 1.0
        evaluation_interval: 5
        evaluation_num_episodes: 10

Anything else

The error below is systematic. I can run the Impala multi-gpu config just fine (https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/atari-impala-multi-gpu.yaml), so that shouldn't be hardware related.

Here is the error :

(test) mathieu@zeta:~/mathieu$ rllib train -f test/configs/test.yaml
(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:40,266    WARNING deprecation.py:46 -- DeprecationWarning: `evaluation_num_episodes` has been deprecated. Use ``evaluation_duration` and `evaluation_duration_unit=episodes`` instead. This will raise an error in the future!
(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:40,266    INFO simple_q.py:154 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:40,266    INFO trainer.py:792 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(RolloutWorker pid=983076) 2022-02-07 17:26:42,106      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983071) 2022-02-07 17:26:42,061      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=982990) 2022-02-07 17:26:42,038      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983059) 2022-02-07 17:26:42,086      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983011) 2022-02-07 17:26:42,096      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983022) 2022-02-07 17:26:42,102      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983037) 2022-02-07 17:26:42,070      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983018) 2022-02-07 17:26:42,123      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(ApexDDPGTrainer pid=983058) /home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/ddpg/ddpg_torch_model.py:103: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_numpy.cpp:189.)
(ApexDDPGTrainer pid=983058)   torch.from_numpy(self.action_space.low).float())
(RolloutWorker pid=983043) 2022-02-07 17:26:42,136      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983074) 2022-02-07 17:26:42,185      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983025) 2022-02-07 17:26:42,185      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983052) 2022-02-07 17:26:42,144      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983049) 2022-02-07 17:26:42,158      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983027) 2022-02-07 17:26:42,158      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983068) 2022-02-07 17:26:42,251      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983056) 2022-02-07 17:26:42,258      WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
== Status ==
Current time: 2022-02-07 17:26:47 (running for 00:00:08.82)
Memory usage on this node: 43.3/125.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 17.0/64 CPUs, 2.0/2 GPUs, 0.0/56.78 GiB heap, 0.0/28.32 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /home/mathieu/ray_results/mountaincarcontinuous-apex-ddpg
Number of trials: 1/1 (1 RUNNING)
+------------------------------------------------+----------+------------------------+
| Trial name                                     | status   | loc                    |
|------------------------------------------------+----------+------------------------|
| APEX_DDPG_MountainCarContinuous-v0_b9b87_00000 | RUNNING  | 192.168.251.123:983058 |
+------------------------------------------------+----------+------------------------+

(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:47,500    WARNING deprecation.py:46 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:47,526    WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(MultiAgentReplayBuffer pid=982996) 2022-02-07 17:26:47,686     INFO replay_buffer.py:41 -- Estimated max memory usage for replay buffer is 0.0305 GB (500000.0 batches of size 1, 61 bytes each), available system memory is 134.933086208 GB
(MultiAgentReplayBuffer pid=983030) 2022-02-07 17:26:47,687     INFO replay_buffer.py:41 -- Estimated max memory usage for replay buffer is 0.0305 GB (500000.0 batches of size 1, 61 bytes each), available system memory is 134.933086208 GB
(MultiAgentReplayBuffer pid=983062) 2022-02-07 17:26:47,697     INFO replay_buffer.py:41 -- Estimated max memory usage for replay buffer is 0.0305 GB (500000.0 batches of size 1, 61 bytes each), available system memory is 134.933086208 GB
(MultiAgentReplayBuffer pid=982987) 2022-02-07 17:26:47,709     INFO replay_buffer.py:41 -- Estimated max memory usage for replay buffer is 0.0305 GB (500000.0 batches of size 1, 61 bytes each), available system memory is 134.933086208 GB
== Status ==
Current time: 2022-02-07 17:26:48 (running for 00:00:09.82)
Memory usage on this node: 43.4/125.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 17.0/64 CPUs, 2.0/2 GPUs, 0.0/56.78 GiB heap, 0.0/28.32 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /home/mathieu/ray_results/mountaincarcontinuous-apex-ddpg
Number of trials: 1/1 (1 RUNNING)
+------------------------------------------------+----------+------------------------+
| Trial name                                     | status   | loc                    |
|------------------------------------------------+----------+------------------------|
| APEX_DDPG_MountainCarContinuous-v0_b9b87_00000 | RUNNING  | 192.168.251.123:983058 |
+------------------------------------------------+----------+------------------------+

(ApexDDPGTrainer pid=983058) Exception in thread Thread-1:
(ApexDDPGTrainer pid=983058) Traceback (most recent call last):
(ApexDDPGTrainer pid=983058)   File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(ApexDDPGTrainer pid=983058)     self.run()
(ApexDDPGTrainer pid=983058)   File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/dqn/learner_thread.py", line 42, in run
(ApexDDPGTrainer pid=983058)     self.step()
(ApexDDPGTrainer pid=983058)   File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/dqn/learner_thread.py", line 58, in step
(ApexDDPGTrainer pid=983058)     replay)
(ApexDDPGTrainer pid=983058)   File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 859, in learn_on_batch
(ApexDDPGTrainer pid=983058)     info_out[pid] = policy.learn_on_batch(batch)
(ApexDDPGTrainer pid=983058)   File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
(ApexDDPGTrainer pid=983058)     return func(self, *a, **k)
(ApexDDPGTrainer pid=983058)   File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py", line 434, in learn_on_batch
(ApexDDPGTrainer pid=983058)     grads, fetches = self.compute_gradients(postprocessed_batch)
(ApexDDPGTrainer pid=983058)   File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/policy/policy_template.py", line 335, in compute_gradients
(ApexDDPGTrainer pid=983058)     return parent_cls.compute_gradients(self, batch)
(ApexDDPGTrainer pid=983058)   File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
(ApexDDPGTrainer pid=983058)     return func(self, *a, **k)
(ApexDDPGTrainer pid=983058)   File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py", line 589, in compute_gradients
(ApexDDPGTrainer pid=983058)     assert len(self.devices) == 1
(ApexDDPGTrainer pid=983058) AssertionError
(ApexDDPGTrainer pid=983058)
2022-02-07 17:26:51,152 ERROR trial_runner.py:927 -- Trial APEX_DDPG_MountainCarContinuous-v0_b9b87_00000: Error processing event.
Traceback (most recent call last):
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 893, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 707, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/worker.py", line 1733, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::ApexDDPGTrainer.train() (pid=983058, ip=192.168.251.123, repr=ApexDDPGTrainer)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/trainable.py", line 315, in train
    result = self.step()
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 982, in step
    raise e
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 963, in step
    step_attempt_results = self.step_attempt()
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1042, in step_attempt
    step_results = self._exec_plan_or_training_iteration_fn()
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1962, in _exec_plan_or_training_iteration_fn
    results = next(self.train_exec_impl)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 1075, in build_union
    item = next(it)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/execution/concurrency_ops.py", line 135, in base_iterator
    raise RuntimeError("Dequeue `check()` returned False! "
RuntimeError: Dequeue `check()` returned False! Exiting with Exception from Dequeue iterator.
== Status ==
Current time: 2022-02-07 17:26:51 (running for 00:00:12.41)
Memory usage on this node: 43.6/125.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/64 CPUs, 0/2 GPUs, 0.0/56.78 GiB heap, 0.0/28.32 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /home/mathieu/ray_results/mountaincarcontinuous-apex-ddpg
Number of trials: 1/1 (1 ERROR)
+------------------------------------------------+----------+------------------------+
| Trial name                                     | status   | loc                    |
|------------------------------------------------+----------+------------------------|
| APEX_DDPG_MountainCarContinuous-v0_b9b87_00000 | ERROR    | 192.168.251.123:983058 |
+------------------------------------------------+----------+------------------------+
Number of errored trials: 1
+------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Trial name                                     |   # failures | error file                                                                                                                              |
|------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------|
| APEX_DDPG_MountainCarContinuous-v0_b9b87_00000 |            1 | /home/mathieu/ray_results/mountaincarcontinuous-apex-ddpg/APEX_DDPG_MountainCarContinuous-v0_b9b87_00000_0_2022-02-07_17-26-38/error.txt |
+------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+

Traceback (most recent call last):
  File "/home/mathieu/.opt/miniconda3/envs/test/bin/rllib", line 8, in <module>
    sys.exit(cli())
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/scripts.py", line 36, in cli
    train.run(options, train_parser)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/train.py", line 267, in run
    concurrent=True)
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/tune.py", line 739, in run_experiments
    callbacks=callbacks).trials
  File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/tune.py", line 630, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [APEX_DDPG_MountainCarContinuous-v0_b9b87_00000])

Are you willing to submit a PR?

sven1977 commented 2 years ago

Hey @Mathieu-Prouveur , yeah, I think our APEX algos (DQN and DDPG) do NOT support multi-GPUs yet. Let me take a look ...