Using multiple GPUs - Githubissues

superyu1337 commented 2 years ago

Hey there, I got two GPUs in my system that I'd like to use for the upscaling process. Is there any way to get VSGAN to run on both GPUs? Currently only my primary GPU is getting hit by VSGAN, which leaves a ton of performance on the table.

rlaphoenix commented 2 years ago

There's not currently a way to no, but you could interleave every 2nd frame to your 2nd GPU.

E.g., near the end of the script use SelectEvery to split the video into 2 video clip variables. Where variable 1 (clip_a) is frames 0, 2, 4, 6 ..., and variable 2 (clip_b) is frames 1, 3, 5, 7 ... Then pass them into 2 separate VSGAN instances with the different GPUs specified as the device. Finally, take the clip_a and clip_b and weave them back together.

I can't say that this will actually improve performance, but why not give it a shot.

superyu1337 commented 2 years ago

Hmm, I'll try that out. Thanks!

rlaphoenix commented 2 years ago

Hmm, I'll try that out. Thanks!

Do let me know how it goes, because if it works and shows genuine benefit, then I may add this as an option directly within VSGAN when multiple devices are supplied.

superyu1337 commented 2 years ago

I just tried it and I got an FPS increase from 7 FPS up to 12. I'm upscaling some 480p anime with animevideov3 on a GTX 1060 and a GTX 970.

Edit: Accidentally closed this issue lol.

rlaphoenix commented 2 years ago

I don't currently have 2 GPUs in my system so I was wondering if you could send me the important bits of your script that you can confirm it working with, just so I have something to base off of for implementing it directly.

superyu1337 commented 2 years ago

Sorry for the late answer, here is my script.

import vapoursynth as vs
from vsgan import ESRGAN
core = vs.core

c = core.lsmas.LWLibavSource(source="./input.mkv")

c = c.fmtc.resample (css="444")
c = c.fmtc.matrix (mat="601", col_fam=vs.RGB)
c = c.fmtc.bitdepth (bits=16)

ca = core.std.SelectEvery(clip=c, cycle=2, offsets=0)
cb = core.std.SelectEvery(clip=c, cycle=2, offsets=1)

ca = ESRGAN(ca, device="cuda:0").load(r'/home/janek/realesrgan-models/realesr-animevideov3.pth').apply().clip
cb = ESRGAN(cb, device="cuda:1").load(r'/home/janek/realesrgan-models/realesr-animevideov3.pth').apply().clip

c = core.std.Interleave(clips=[ca, cb]);

c = c.fmtc.matrix (mat="601", col_fam=vs.YUV, bits=16)
c = c.fmtc.resample (css="420")
c = c.fmtc.bitdepth (bits=10)

c.set_output()

rlaphoenix commented 2 years ago

Yep thats what I expected. If that genuinely did help performance, will implement it directly now.

One thing I realized though is for at least ESRGAN and maybe others I could do this job a lot easier by just splitting the workload to different initialized torch devices based on the frame number during apply(). Will implement this now but I don't know how soon this will come to a stable release as I have to fix the doc builds before I can release. They are currently broken due to being unable to install VapourSynth on readthedoc's end.

rlaphoenix commented 2 years ago

Take a look at the multi-gpu branch.

If you could test that out for me it would be highly appreciated. If you could also ensure that the performance between single and multi-GPU is expected that would be amazing. There's currently no build for it on PyPI (pip install) but you can follow this to install from source (just make sure you install from the right branch): Installing from Source Code

Since the docs don't currently build, I will make a brief explanation of the changes here for now.

The only change you need to make from a typical usage of VSGAN, is change ESRGAN(clip, device="bla") to e.g., ESRGAN(clip, "cuda:0", "cuda:1") or whatever torch device specifier you wish to use. You can even do "cpu", "cuda:0" if you were crazy enough.

superyu1337 commented 2 years ago

Installed the branch through modifying the AUR package. Got an error that I encountered yesterday when using a high enough resolution as input (I am assuming it's some kind of out of memory error.) terminate called after throwing an instance of 'std::out_of_range'

rlaphoenix commented 2 years ago

I'm not sure what that error is from, or from here exactly, but I did have a few mistakes propagating from old commits from months ago that I never made into a stable release. I fixed those and updated the multi-gpu branch with the changes as well. Can you try now perhaps?

If you get that same error, as much information on it would be great, like where that error was thrown/displayed, and if you have more stack information on it, that would be useful.

superyu1337 commented 2 years ago

Tried the update, still getting an error.

➜  upscaling-shenanigans vspipe --y4m upscale-multigpu-branch.vpy  - | mpv - 
Deprecated option --y4m specified, use -c y4m instead
[file] Reading from stdin...
Warning: /usr/lib/python3.10/site-packages/vsgan/utilities.py:36: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/utils/tensor_new.cpp:1105.)
  torch.frombuffer(

terminate called after throwing an instance of 'std::out_of_range'
  what():  map::at

Exiting... (Quit)
[1]    2323090 IOT instruction (core dumped)  vspipe --y4m upscale-multigpu-branch.vpy - | 
       2323091 exit 4                         mpv -

The error only seems to happen if I use more than one device. Whether it's "cuda:0" and "cuda:1" or "cuda:1" and "cpu!" doesn't matter. The error also doesn't seem to be related to the size of the input in this case, trying it with 144p as input still throws an error.

rlaphoenix commented 2 years ago

The terminate error seems quite strange and is a bit hard to debug, however, the warning about buffer I do know about. Can you try as of https://github.com/rlaphoenix/VSGAN/commit/cacc016c9c63cfe8cdb91f8a73ec5906e72cd0bf Who knows maybe that fixes the terminate call.

Also I've added a 2nd GPU for testing and might have fixed/improved stuff since I asked you to retry. On my end I have it working but the current method I'm trying doesn't seem to actually work to improve performance. Likely something relating to threading as it seems to be because the GPUs arent at full load, their graphs are like spikes seemingly due to switching of GPU constantly one at a time.

I do have an alternative method based on what I originally mentioned to you, which you have tested working. I have it implemented but its just an overall less ideal method as it feels like a lazy way of doing it. Sadly I just don't know why the more normal method isn't using the GPUs at full load. Will be trying some stuff out.

rlaphoenix commented 2 years ago

I've cleaned both the master and multi-gpu branches as of commits starting from today, with 2 history rewrites to fix 2 blatant issues way earlier in the commit tree, just so a lot of earlier commits actually work.

I've rebased the multi-gpu branch, and I've fully tested both branches on both EGVSR and ESRGAN, on RGB24, RGBS, and RGB48. I also tested both single and multiple devices and on my Windows system, it worked fine, no terminate errors or such. This bug might be linux specific or there might be a scenario I'm not spotting. Maybe PyTorch version? VapourSynth version? I'm not sure. I currently use PyTorch 1.11.0+cu113 with VapourSynth r59 on Python 3.10.5. If anything differs here to you, maybe give that a shot?

The terminate error description looks like some form of C/C++ trace log rather than Python which is making me think it could be maybe a VapourSynth crash or something as well.

superyu1337 commented 2 years ago

I don't seem to get that crash anymore. However performance is really not as good as you said. Upscaling every Nth frame seems to be the fastest.

rlaphoenix commented 2 years ago

Upscaling every Nth frame seems to be the fastest.

Yeah but effectively that is what I'm doing here, just its lower-level, without the need to split the clips. It's possible some sort of threading optimization is being done with the SelectEvery+Interleave method in comparison.

I get just over half the performance of just my single 2080ti when I use both my 2080ti+1080ti with the current method. Yet with the method we first spoke about that you tried and had working, works flawlessly and gets me just slightly under the performance of the two cards solo. Basically the right speed. In task manager only this method is getting my GPUs full load.

However, this method is only working if the threads are fast and that the GPU is constantly getting frames, forever. Almost to the point where its overloaded with data. By default on my system core.num_threads is 12, and the good FPS lasts about 20 encoded frames before it slows down to about the speed of my 2080ti solo. If I set it to 50, I get to about 100 before the same, and if I set it to 100, I got to 357 before it happened.

It genuinely seems like we just need to start every single frame ASAP and it keeps the GPU up and running fast or something.

When my GPUs are like this:

I got 1.10~ FPS, about the combined speed of my 2 cards.

After an amount of time, depending on core.num_threads, load will eventually be spikey:

When this happens my FPS would drop to just under the speed of my 2080ti.

It's as if once the threads get consumed fully, they don't really recover and start running one at a time. I've noticed this for more scripts/systems than just VSGAN too. E.g., deinterlacers.

superyu1337 commented 2 years ago

I believe the issue could be that the indifference in computing performance of one GPU which will cause the other one to idle. The same issue was present when adding my CPU to the mix (Ryzen 9 5950X), both GPUs would just be waiting for it instead of already handling the next frames. Which I guess makes sense since vapoursynth wants to output it in sequential order.

rlaphoenix / VSGAN

Using multiple GPUs #24