[X] I searched the issues and found no similar issues.
Ray Component
RLlib
What happened + What you expected to happen
I tried out the torch implementation of APEX-DDPG with multiple gpus but it crashes with the following error :
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/execution/concurrency_ops.py", line 135, in base_iterator
raise RuntimeError("Dequeue `check()` returned False! "
RuntimeError: Dequeue `check()` returned False! Exiting with Exception from Dequeue iterator.
Versions / Dependencies
ray 1.10.0
torch 1.10.2
Reproduction script
I tried running the following config (same as the one in tuned examples but with torch and 2 gpus) using rllib train -f xxx.yaml :
mountaincarcontinuous-apex-ddpg:
env: MountainCarContinuous-v0
run: APEX_DDPG
stop:
episode_reward_mean: 90
config:
# Works for both torch and tf.
framework: torch
clip_rewards: False
num_gpus: 2
num_workers: 16
exploration_config:
ou_base_scale: 1.0
n_step: 3
target_network_update_freq: 50000
tau: 1.0
evaluation_interval: 5
evaluation_num_episodes: 10
(test) mathieu@zeta:~/mathieu$ rllib train -f test/configs/test.yaml
(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:40,266 WARNING deprecation.py:46 -- DeprecationWarning: `evaluation_num_episodes` has been deprecated. Use ``evaluation_duration` and `evaluation_duration_unit=episodes`` instead. This will raise an error in the future!
(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:40,266 INFO simple_q.py:154 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:40,266 INFO trainer.py:792 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(RolloutWorker pid=983076) 2022-02-07 17:26:42,106 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983071) 2022-02-07 17:26:42,061 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=982990) 2022-02-07 17:26:42,038 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983059) 2022-02-07 17:26:42,086 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983011) 2022-02-07 17:26:42,096 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983022) 2022-02-07 17:26:42,102 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983037) 2022-02-07 17:26:42,070 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983018) 2022-02-07 17:26:42,123 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(ApexDDPGTrainer pid=983058) /home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/ddpg/ddpg_torch_model.py:103: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:189.)
(ApexDDPGTrainer pid=983058) torch.from_numpy(self.action_space.low).float())
(RolloutWorker pid=983043) 2022-02-07 17:26:42,136 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983074) 2022-02-07 17:26:42,185 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983025) 2022-02-07 17:26:42,185 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983052) 2022-02-07 17:26:42,144 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983049) 2022-02-07 17:26:42,158 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983027) 2022-02-07 17:26:42,158 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983068) 2022-02-07 17:26:42,251 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(RolloutWorker pid=983056) 2022-02-07 17:26:42,258 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
== Status ==
Current time: 2022-02-07 17:26:47 (running for 00:00:08.82)
Memory usage on this node: 43.3/125.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 17.0/64 CPUs, 2.0/2 GPUs, 0.0/56.78 GiB heap, 0.0/28.32 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /home/mathieu/ray_results/mountaincarcontinuous-apex-ddpg
Number of trials: 1/1 (1 RUNNING)
+------------------------------------------------+----------+------------------------+
| Trial name | status | loc |
|------------------------------------------------+----------+------------------------|
| APEX_DDPG_MountainCarContinuous-v0_b9b87_00000 | RUNNING | 192.168.251.123:983058 |
+------------------------------------------------+----------+------------------------+
(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:47,500 WARNING deprecation.py:46 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(ApexDDPGTrainer pid=983058) 2022-02-07 17:26:47,526 WARNING deprecation.py:46 -- DeprecationWarning: `rllib.env.remote_vector_env.RemoteVectorEnv` has been deprecated. Use `ray.rllib.env.remote_base_env.RemoteBaseEnv` instead. This will raise an error in the future!
(MultiAgentReplayBuffer pid=982996) 2022-02-07 17:26:47,686 INFO replay_buffer.py:41 -- Estimated max memory usage for replay buffer is 0.0305 GB (500000.0 batches of size 1, 61 bytes each), available system memory is 134.933086208 GB
(MultiAgentReplayBuffer pid=983030) 2022-02-07 17:26:47,687 INFO replay_buffer.py:41 -- Estimated max memory usage for replay buffer is 0.0305 GB (500000.0 batches of size 1, 61 bytes each), available system memory is 134.933086208 GB
(MultiAgentReplayBuffer pid=983062) 2022-02-07 17:26:47,697 INFO replay_buffer.py:41 -- Estimated max memory usage for replay buffer is 0.0305 GB (500000.0 batches of size 1, 61 bytes each), available system memory is 134.933086208 GB
(MultiAgentReplayBuffer pid=982987) 2022-02-07 17:26:47,709 INFO replay_buffer.py:41 -- Estimated max memory usage for replay buffer is 0.0305 GB (500000.0 batches of size 1, 61 bytes each), available system memory is 134.933086208 GB
== Status ==
Current time: 2022-02-07 17:26:48 (running for 00:00:09.82)
Memory usage on this node: 43.4/125.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 17.0/64 CPUs, 2.0/2 GPUs, 0.0/56.78 GiB heap, 0.0/28.32 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /home/mathieu/ray_results/mountaincarcontinuous-apex-ddpg
Number of trials: 1/1 (1 RUNNING)
+------------------------------------------------+----------+------------------------+
| Trial name | status | loc |
|------------------------------------------------+----------+------------------------|
| APEX_DDPG_MountainCarContinuous-v0_b9b87_00000 | RUNNING | 192.168.251.123:983058 |
+------------------------------------------------+----------+------------------------+
(ApexDDPGTrainer pid=983058) Exception in thread Thread-1:
(ApexDDPGTrainer pid=983058) Traceback (most recent call last):
(ApexDDPGTrainer pid=983058) File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/threading.py", line 926, in _bootstrap_inner
(ApexDDPGTrainer pid=983058) self.run()
(ApexDDPGTrainer pid=983058) File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/dqn/learner_thread.py", line 42, in run
(ApexDDPGTrainer pid=983058) self.step()
(ApexDDPGTrainer pid=983058) File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/dqn/learner_thread.py", line 58, in step
(ApexDDPGTrainer pid=983058) replay)
(ApexDDPGTrainer pid=983058) File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 859, in learn_on_batch
(ApexDDPGTrainer pid=983058) info_out[pid] = policy.learn_on_batch(batch)
(ApexDDPGTrainer pid=983058) File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
(ApexDDPGTrainer pid=983058) return func(self, *a, **k)
(ApexDDPGTrainer pid=983058) File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py", line 434, in learn_on_batch
(ApexDDPGTrainer pid=983058) grads, fetches = self.compute_gradients(postprocessed_batch)
(ApexDDPGTrainer pid=983058) File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/policy/policy_template.py", line 335, in compute_gradients
(ApexDDPGTrainer pid=983058) return parent_cls.compute_gradients(self, batch)
(ApexDDPGTrainer pid=983058) File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
(ApexDDPGTrainer pid=983058) return func(self, *a, **k)
(ApexDDPGTrainer pid=983058) File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py", line 589, in compute_gradients
(ApexDDPGTrainer pid=983058) assert len(self.devices) == 1
(ApexDDPGTrainer pid=983058) AssertionError
(ApexDDPGTrainer pid=983058)
2022-02-07 17:26:51,152 ERROR trial_runner.py:927 -- Trial APEX_DDPG_MountainCarContinuous-v0_b9b87_00000: Error processing event.
Traceback (most recent call last):
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 893, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 707, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/worker.py", line 1733, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::ApexDDPGTrainer.train() (pid=983058, ip=192.168.251.123, repr=ApexDDPGTrainer)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/trainable.py", line 315, in train
result = self.step()
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 982, in step
raise e
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 963, in step
step_attempt_results = self.step_attempt()
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1042, in step_attempt
step_results = self._exec_plan_or_training_iteration_fn()
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 1962, in _exec_plan_or_training_iteration_fn
results = next(self.train_exec_impl)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
return next(self.built_iterator)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
for item in it:
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 1075, in build_union
item = next(it)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
return next(self.built_iterator)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
for item in it:
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/execution/concurrency_ops.py", line 135, in base_iterator
raise RuntimeError("Dequeue `check()` returned False! "
RuntimeError: Dequeue `check()` returned False! Exiting with Exception from Dequeue iterator.
== Status ==
Current time: 2022-02-07 17:26:51 (running for 00:00:12.41)
Memory usage on this node: 43.6/125.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/64 CPUs, 0/2 GPUs, 0.0/56.78 GiB heap, 0.0/28.32 GiB objects (0.0/1.0 accelerator_type:G)
Result logdir: /home/mathieu/ray_results/mountaincarcontinuous-apex-ddpg
Number of trials: 1/1 (1 ERROR)
+------------------------------------------------+----------+------------------------+
| Trial name | status | loc |
|------------------------------------------------+----------+------------------------|
| APEX_DDPG_MountainCarContinuous-v0_b9b87_00000 | ERROR | 192.168.251.123:983058 |
+------------------------------------------------+----------+------------------------+
Number of errored trials: 1
+------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------|
| APEX_DDPG_MountainCarContinuous-v0_b9b87_00000 | 1 | /home/mathieu/ray_results/mountaincarcontinuous-apex-ddpg/APEX_DDPG_MountainCarContinuous-v0_b9b87_00000_0_2022-02-07_17-26-38/error.txt |
+------------------------------------------------+--------------+-----------------------------------------------------------------------------------------------------------------------------------------+
Traceback (most recent call last):
File "/home/mathieu/.opt/miniconda3/envs/test/bin/rllib", line 8, in <module>
sys.exit(cli())
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/scripts.py", line 36, in cli
train.run(options, train_parser)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/rllib/train.py", line 267, in run
concurrent=True)
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/tune.py", line 739, in run_experiments
callbacks=callbacks).trials
File "/home/mathieu/.opt/miniconda3/envs/test/lib/python3.7/site-packages/ray/tune/tune.py", line 630, in run
raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [APEX_DDPG_MountainCarContinuous-v0_b9b87_00000])
Search before asking
Ray Component
RLlib
What happened + What you expected to happen
I tried out the torch implementation of APEX-DDPG with multiple gpus but it crashes with the following error :
Versions / Dependencies
ray 1.10.0 torch 1.10.2
Reproduction script
I tried running the following config (same as the one in tuned examples but with torch and 2 gpus) using
rllib train -f xxx.yaml
:Anything else
The error below is systematic. I can run the Impala multi-gpu config just fine (https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/impala/atari-impala-multi-gpu.yaml), so that shouldn't be hardware related.
Here is the error :
Are you willing to submit a PR?