[ ] I searched the issues and found no similar issues.
Ray Component
RLlib
Issue Severity
High: It blocks me to complete my task.
What happened + What you expected to happen
I am trying to run the tuned Dreamer example. I run it like rllib train -f dreamer.yaml using that exact file. I get an absurd CUDA out of memory error - it is requesting 286.10 GB??
Traceback (most recent call last):
File "/opt/venv/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 999, in get_next_executor_event
future_result = ray.get(ready_future)
File "/opt/venv/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/opt/venv/lib/python3.9/site-packages/ray/worker.py", line 1925, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ^[[36mray::DREAMERTrainer.train()^[[39m (pid=2421055, ip=172.17.80.4, repr=DREAMERTrainer)
File "/opt/venv/lib/python3.9/site-packages/ray/rllib/agents/dreamer/dreamer_torch_policy.py", line 165, in dreamer_loss
policy.stats_dict = compute_dreamer_loss(
File "/opt/venv/lib/python3.9/site-packages/ray/rllib/agents/dreamer/dreamer_torch_policy.py", line 68, in compute_dreamer_loss
image_loss = -torch.mean(image_pred.log_prob(obs))
File "/opt/venv/lib/python3.9/site-packages/torch/distributions/independent.py", line 91, in log_prob
log_prob = self.base_dist.log_prob(value)
File "/opt/venv/lib/python3.9/site-packages/torch/distributions/normal.py", line 77, in log_prob
return -((value - self.loc) ** 2) / (2 * var) - log_scale - math.log(math.sqrt(2 * math.pi))
RuntimeError: CUDA out of memory. Tried to allocate 286.10 GiB (GPU 0; 23.70 GiB total capacity; 1.51 GiB already allocated; 19.56 GiB free; 1.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Search before asking
Ray Component
RLlib
Issue Severity
High: It blocks me to complete my task.
What happened + What you expected to happen
I am trying to run the tuned Dreamer example. I run it like
rllib train -f dreamer.yaml
using that exact file. I get an absurd CUDA out of memory error - it is requesting 286.10 GB??Versions / Dependencies
Ray: 2.0.0.dev0 Python: 3.9.5 Ubuntu: 18.04
Reproduction script
See above.
Anything else
No response
Are you willing to submit a PR?