styler00dollar / VSGAN-tensorrt-docker

Using VapourSynth with super resolution and interpolation models and speeding them up with TensorRT.
BSD 3-Clause "New" or "Revised" License
288 stars 30 forks source link

CUDA out of Memory #5

Closed mmkzer0 closed 2 years ago

mmkzer0 commented 2 years ago

System Specs: Ryzen 9 5900HX, NVidia 3070 Mobile, Arch Linux (EndeavorOS) on Kernel 5.17.2

Whenever I try to run a model that is relying on CUDA, for example cugan, the program exits with

Error: Failed to retrieve frame 0 with error: CUDA out of memory. Tried to allocate 148.00 MiB (GPU 0; 7.80 GiB total capacity; 5.53 GiB already allocated; 68.56 MiB free; 5.69 GiB reserved in total by PyTorch)

and stops after having output 4 frames.

However, TensorRT works fine for models that support it (like RealESRGAN for example).

Edit: Running nvidia-smi while the command is executed reveals that vspipe is allocating GPU Memory, but <2 GiB of VRAM, far from the 8GiB my model has.

styler00dollar commented 2 years ago

Different models and implementations have different requirements. In it's current state, cugan does have tiling added, so you can just adjust the tiling size. I did hear that 8gb vram was not enough for higher resolutions for one person though, so your gpu might not even be sufficient and you would need to accept it. I was mainly testing with 16gb vram.

mmkzer0 commented 2 years ago

Thank you for the explanation. I will try adjusting the tiling size and report back on my findings.

If I may ask; if the amount of VRAM my GPU has is the problem on CUDA Models, why does TensorRT work?

styler00dollar commented 2 years ago

The architectures cugan, esrgan and compact are fundamentally different. At least cugan is a bit known to eat vram. Not only the backend plays a role in vram usage, but what the model actually does is also important. Some architectures require more recources than others.

TensorRT also does certain stuff under the hood to try to avoid out of memory. TensoRT backends with the python apis do have sizes hardcoded or use very small models, which should be sufficient for 8gb vram usage, while with the C++ api the tool trtexec tries to avoid problems and can do that to some degree during model creation. I have 3 different TensorRT apis in my code, so that's the best summary I can do.

styler00dollar commented 2 years ago

I guess I can close this issue, since this is just a hardware limitation.