ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34k stars 5.78k forks source link

[RLlib][Bug] RLLib Dreamer tuned example requesting unreasonable amount of GPU memory #23479

Open zplizzi opened 2 years ago

zplizzi commented 2 years ago

Search before asking

Ray Component

RLlib

Issue Severity

High: It blocks me to complete my task.

What happened + What you expected to happen

I am trying to run the tuned Dreamer example. I run it like rllib train -f dreamer.yaml using that exact file. I get an absurd CUDA out of memory error - it is requesting 286.10 GB??

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/ray/tune/ray_trial_executor.py", line 999, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/opt/venv/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/ray/worker.py", line 1925, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ^[[36mray::DREAMERTrainer.train()^[[39m (pid=2421055, ip=172.17.80.4, repr=DREAMERTrainer)
  File "/opt/venv/lib/python3.9/site-packages/ray/rllib/agents/dreamer/dreamer_torch_policy.py", line 165, in dreamer_loss
    policy.stats_dict = compute_dreamer_loss(
  File "/opt/venv/lib/python3.9/site-packages/ray/rllib/agents/dreamer/dreamer_torch_policy.py", line 68, in compute_dreamer_loss
    image_loss = -torch.mean(image_pred.log_prob(obs))
  File "/opt/venv/lib/python3.9/site-packages/torch/distributions/independent.py", line 91, in log_prob
    log_prob = self.base_dist.log_prob(value)
  File "/opt/venv/lib/python3.9/site-packages/torch/distributions/normal.py", line 77, in log_prob
    return -((value - self.loc) ** 2) / (2 * var) - log_scale - math.log(math.sqrt(2 * math.pi))
RuntimeError: CUDA out of memory. Tried to allocate 286.10 GiB (GPU 0; 23.70 GiB total capacity; 1.51 GiB already allocated; 19.56 GiB free; 1.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Versions / Dependencies

Ray: 2.0.0.dev0 Python: 3.9.5 Ubuntu: 18.04

Reproduction script

See above.

Anything else

No response

Are you willing to submit a PR?

krfricke commented 2 years ago

cc @avnishn @sven1977