Closed raclettes closed 3 years ago
By default, torch has no memory allocated.
>>> print(torch.cuda.memory_summary(device=None, abbreviated=False))
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|===========================================================================|
Running it just after https://github.com/lucidrains/deep-daze/blob/main/deep_daze/deep_daze.py#L168 produces the following output
|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 360168 KB | 1374 MB | 12678 MB | 12327 MB |
| from large pool | 347904 KB | 1362 MB | 12629 MB | 12290 MB |
| from small pool | 12264 KB | 13 MB | 49 MB | 37 MB |
|---------------------------------------------------------------------------|
| Active memory | 360168 KB | 1374 MB | 12678 MB | 12327 MB |
| from large pool | 347904 KB | 1362 MB | 12629 MB | 12290 MB |
| from small pool | 12264 KB | 13 MB | 49 MB | 37 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 1396 MB | 1396 MB | 1396 MB | 0 B |
| from large pool | 1382 MB | 1382 MB | 1382 MB | 0 B |
| from small pool | 14 MB | 14 MB | 14 MB | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 20760 KB | 25791 KB | 275962 KB | 255202 KB |
| from large pool | 18688 KB | 23808 KB | 224128 KB | 205440 KB |
| from small pool | 2072 KB | 2139 KB | 51834 KB | 49762 KB |
|---------------------------------------------------------------------------|
| Allocations | 351 | 359 | 725 | 374 |
| from large pool | 88 | 92 | 137 | 49 |
| from small pool | 263 | 272 | 588 | 325 |
|---------------------------------------------------------------------------|
| Active allocs | 351 | 359 | 725 | 374 |
| from large pool | 88 | 92 | 137 | 49 |
| from small pool | 263 | 272 | 588 | 325 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 25 | 25 | 25 | 0 |
| from large pool | 18 | 18 | 18 | 0 |
| from small pool | 7 | 7 | 7 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 11 | 12 | 171 | 160 |
| from large pool | 6 | 6 | 15 | 9 |
| from small pool | 5 | 7 | 156 | 151 |
|===========================================================================|
For reference, I have a GeForce RTX 2060
There's a similar issue happening in: https://github.com/lucidrains/deep-daze/issues/80#issuecomment-798844142
But yeah, you don't have enough VRAM. Most consumer GPUs dont - so don't feel bad. Less than 8 GiB of VRAM makes it pretty tough to do. But you might be able to if you set image_width to 256 or lower. There's a lot of people with this issue today so please check the link for information on how to solve it. I've typed too much for now ha.
Edit: as usual (unfortunately) the best (free) way to run this program is with the Google Colab notebooks. If you're not opposed to that you can use it for free (seriously) and you're basically guaranteed a GPU with 16 GB of VRAM. You can find them on the front page of this project ("README.md")
@discordstars
@afiaka87 Oh alright, thanks for the quick response. I'll give it a go with a smaller image width; I already tried smaller batch size
For sure no problem. The most important bit on that page is @NotNANtoN's benchmarks for the 256 image_width while varying batch size. GPU usage on the right. bs
is the batch_size. grad_acc stands for --gradient_accumulate_every=1
. It defaults to 4, but you don't need it as much with higher batch sizes.
bs 8, num_layers 48: 5.3 GB
bs 16, num_layers 48: 5.46 GB - 2.0 it/s
bs 32, num_layers 48: 5.92 GB - 1.67 it/s
bs 8, num_layers 44: 5 GB - 2.39 it/s
bs 32, num_layers 44, grad_acc 1: 5.62 GB - 4.83 it/s
bs 96, num_layers 44, grad_acc 1: 7.51 GB - 2.77 it/s
bs 32, num_layers 66, grad_acc 1: 7.09 GB - 3.7 it/s
Keep in mind, your OS (windows, linux?) is going to be using some GPU VRAM as well. Anywhere from 500 MB to 2 GB in my experience.
@discordstars Thanks for filing an issue btw! We always appreciate it even if we're too busy to get around to helping everyone.
If you're new to github, make sure you mash that "Close Issue" button if you feel your question's been answered. Do let me know if you manage to get it working on there. It's useful for future users to know if it's even possible.
Not new, but thanks for the reminder.
I'll give it a go with smaller image sizes and batch sizes and update the issue before I close it :)
Edit: and oops, I must have entirely skimmed over the links in the README. I'll do that after too (for the sake of actually getting decent output)
Not new, but thanks for the reminder.
My bad. I try to make as few assumptions about people on here. Hope it didnt come across as patronizing.
@afiaka87 Absolutely not, no worries 😆 just making a remark.
I was able to run with --image-width 256
with the 6GiB of VRAM. I haven't tried other resolutions but this is working. ~2.84 it/s.
I encounter this error upon running:
I attempted clearing cuda cache, but the same error occured.