hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.77k stars 2.11k forks source link

torch.cuda.OutOfMemoryError: CUDA out of memory. #419

Closed TYang92677626 closed 3 months ago

TYang92677626 commented 4 months ago

I am using 4 GPUs (Quadro RTX 6000 24G) for reasoning, and I keep reporting insufficient GPU memory. It keeps saying that the GPU space is insufficient. I found that I was missing dozens of MB when using 4 GPUs and 1 GPU. When I used multiple GPUs, it still didn't solve the problem of insufficient memory. My batch_size is 1, and other parameter settings are also reduced accordingly, but it still says that the GPU space is insufficient. Does Open-Sora require that a single GPU must have enough space? Has anyone encountered the same problem? How to solve it? (我正在用4块GPU(Quadro RTX 6000 24G)显卡来推理,一直报GPU内存不足。 一直提示GPU空间不够,我用4块GPU和1块GPU发现都提示缺少几十Mb。我的batch_size是1,其他参数设置都相应地减少了,仍然报GPU空间不足。是单块GPU必须有足够大的空间吗?有遇到同样问题的吗?怎么解决呢?)

(env_py310) [opensora@localhost scripts]$ nvidia-smi
Wed May 29 15:18:16 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 6000                Off |   00000000:1A:00.0 Off |                  Off |
| 32%   33C    P0             64W /  260W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Quadro RTX 6000                Off |   00000000:1B:00.0 Off |                  Off |
| 34%   37C    P0             68W /  260W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Quadro RTX 6000                Off |   00000000:88:00.0 Off |                  Off |
| 35%   32C    P0             66W /  260W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Quadro RTX 6000                Off |   00000000:89:00.0 Off |                    0 |
| 33%   31C    P2             48W /  260W |       1MiB /  23040MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Below is the command to run inference(下面是运行推理命令):

(env_py310) [opensora@localhost Open-Sora]$ CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 scripts/inference.py configs/opensora-v1-1/inference/sample.py --prompt "A beautiful sunset over the city" --num-frames 16 --image-size 240 426
Part of the log is as follows:
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/soft/aconda_soft/Open-Sora/scripts/inference.py", line 189, in <module>
[rank3]:     main()
[rank3]:   File "/home/soft/aconda_soft/Open-Sora/scripts/inference.py", line 163, in main
[rank3]:     samples = scheduler.sample(
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/opensora/schedulers/iddpm/__init__.py", line 68, in sample
[rank3]:     model_args = text_encoder.encode(prompts)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 192, in encode
[rank3]:     caption_embs, emb_masks = self.t5.get_text_embeddings(text)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 130, in get_text_embeddings
[rank3]:     text_encoder_embs = self.model(
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1975, in forward
[rank3]:     encoder_outputs = self.encoder(
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1110, in forward
[rank3]:     layer_outputs = layer_module(
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 694, in forward
[rank3]:     self_attention_outputs = self.layer[0](
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 601, in forward
[rank3]:     attention_output = self.SelfAttention(
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 523, in forward
[rank3]:     key_states = project(
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 497, in project
[rank3]:     hidden_states = shape(proj_layer(hidden_states))
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
[rank3]:     return F.linear(input, self.weight, self.bias)
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU  has a total capacity of 22.15 GiB of which 2.62 MiB is free. Including non-PyTorch memory, this process has 22.14 GiB memory in use. Of the allocated memory 21.90 GiB is allocated by PyTorch, and 48.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/soft/aconda_soft/Open-Sora/scripts/inference.py", line 189, in <module>
[rank0]:     main()
[rank0]:   File "/home/soft/aconda_soft/Open-Sora/scripts/inference.py", line 163, in main
[rank0]:     samples = scheduler.sample(
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/opensora/schedulers/iddpm/__init__.py", line 68, in sample
[rank0]:     model_args = text_encoder.encode(prompts)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 192, in encode
[rank0]:     caption_embs, emb_masks = self.t5.get_text_embeddings(text)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 130, in get_text_embeddings
[rank0]:     text_encoder_embs = self.model(
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1975, in forward
[rank0]:     encoder_outputs = self.encoder(
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 1110, in forward
[rank0]:     layer_outputs = layer_module(
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 694, in forward
[rank0]:     self_attention_outputs = self.layer[0](
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 601, in forward
[rank0]:     attention_output = self.SelfAttention(
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 526, in forward
[rank0]:     value_states = project(
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py", line 497, in project
[rank0]:     hidden_states = shape(proj_layer(hidden_states))
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
[rank0]:     return F.linear(input, self.weight, self.bias)
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 
W0529 15:14:36.836000 139722452449088 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 163929 closing signal SIGTERM
W0529 15:14:36.837000 139722452449088 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 163930 closing signal SIGTERM
E0529 15:14:42.369000 139722452449088 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 163928) of binary: /home/proot/anaconda3/envs/env_py310/bin/python
Traceback (most recent call last):
  File "/home/proot/anaconda3/envs/env_py310/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/proot/anaconda3/envs/env_py310/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/inference.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-29_15:14:36
  host      : localhost.localdomain
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 163931)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-29_15:14:36
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 163928)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
tonney007 commented 4 months ago

一样,有没有最低的显存占用标准?

JThh commented 3 months ago

You may try using single card - I benchmarked using a slice of A100 80GB (about 40192MiB slice), and it took me ~20.18 GB to run the exact same command as yours, i.e. torchrun --nproc_per_node=1 scripts/inference.py configs/opensora-v1-1/inference/sample.py --prompt "A beautiful sunset over the city" --num-frames 16 --image-size 240 426.

The benchmark script is at https://github.com/JThh/Open-Sora/blob/benchmem/scripts/inference.py.

TYang92677626 commented 3 months ago

那么,显着降低消耗标准?

不太明白是什么意思

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 7 days with no activity.

zhengzangw commented 3 months ago
Screenshot 2024-06-22 at 12 38 30

The above is the memory requirement for OpenSora 1.2. And at OpenSora 1.1, sequence parallelism is not supported. Please try with OpenSora 1.2.

GallonDeng commented 1 month ago

use 4 rtx4090, always out of memory: CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 scripts/inference.py configs/opensora-v1-2/inference/sample.py --num-frames 4s --resolution 480p --aspect-ratio 9:16 --prompt "a beautiful waterfall"