Multi GPU support for iw3

nagadomi commented 1 year ago

from https://github.com/nagadomi/nunif/discussions/28#discussioncomment-7247459

i have 2 gpus. Haw can i use all gpu in iw3?

nagadomi commented 1 year ago

https://github.com/pytorch/pytorch/issues/8637 MiDaS has the problem linked above and cannot be used with nn.DataParallel.

nagadomi commented 1 year ago

the above problem was fixed by https://github.com/nagadomi/MiDaS_iw3/commit/22193f41f05c8099489aa2124c86ba1c951e93d7 , but still seems to have register_forward_hook problem on multi GPUs.

nagadomi commented 1 year ago

register_forward_hook problem was fixed by https://github.com/nagadomi/MiDaS_iw3/commit/0da1ad010b603b179079ada153af708c3c7021c0 https://github.com/nagadomi/ZoeDepth_iw3/commit/55bacaf72c90ba3fbf68ad9e5681cbb6af42c07c

iw3 now works with multiple GPUs.

updating steps

for git,

# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models

for windows_package, run update.bat.

examples

CLI

python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8

GUI

Choose All CUDA Device in the device panel
Increase Depth Batch Size (should be a multiple of GPUs)

I tested only 2 GPU case on Linux CLI.

@elecimage Could you check to see it works?

elecimage commented 1 year ago

register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf

iw3 now works with multiple GPUs.

updating steps

for git,
# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models
for windows_package, run update.bat.

examples

CLI
python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8 
GUI

Choose All CUDA Device in the device panel

Increase Depth Batch Size (should be a multiple of GPUs)

I tested only 2 GPU case on Linux CLI.

@elecimage Could you check to see it works?

oh Thank you. I'll test it soon

elecimage commented 1 year ago

register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf

iw3 now works with multiple GPUs.

updating steps

for git,
# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models
for windows_package, run update.bat.

examples

CLI
python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8 
GUI

Choose All CUDA Device in the device panel

Increase Depth Batch Size (should be a multiple of GPUs)

I tested only 2 GPU case on Linux CLI.

@elecimage Could you check to see it works?

yes it works but slower than 1 gpu use.~

nagadomi commented 1 year ago

@elecimage First, I do not have Windows multi-GPU environment, so I may not be able to solve this problem. On Linux 2GPU(Tesla T4 x2) environment, it is possible to achieve roughly 2x FPS.

Here are some possible causes and questions,

How slow is it? I would like to know if it is a little slow or very slow.
What GPU are you using? If you are using different GPUs and one is slower than the other, it may be slower.
Have you increased Depth Batch Size? If the batch size is too small, it may be slower due to multi-GPU overhead.

elecimage commented 1 year ago

3. epth Batch Size

With 2 GPu's it is roughly 1.2x slower than with 1.

I'm using two 2080ti.

I've tried changing the Depth Batch Size several times, but it doesn't make much difference.

nagadomi commented 1 year ago

OK, I will try to create a Windows VM on cloud and check the behavior.

nagadomi commented 1 year ago

Maybe fixed by https://github.com/nagadomi/nunif/commit/2b7cbf9625d2fd719e331dfc1c6b813ece20b428. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.

elecimage commented 1 year ago

Maybe fixed by 2b7cbf9. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.

oh Thank you. I'll test it soon

elecimage commented 1 year ago

Maybe fixed by 2b7cbf9. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.

I'm still having problems. It doesn't speed up or even slow down, When using 2gpu, it only loads half the vram of the gpu. I've tested it with 1 GPU and it loads all the VRAM and gets about 2 FPS, With 2 GPUs, I get about 2 FPS and only half the VRAM is loaded. The speed is about the same or slightly slower when using 2 GPUs.

nagadomi commented 1 year ago

When using multi GPUs, the batch size is divided for each GPU. So for the same batch size setting, each GPU's VRAM usage will be 1/GPU.

In my test above, I tried the following settings. with 720x720 video, 1GPU = 2.5 FPS, 2GPU = 3.7 FPS.

Depth Model: ZoeD_N

Deivce: All CUDA Device
Depth Resolution: Default
Depth Batch Size: 8 or 16
Stereo Batch Size: 64
Low VRAM: False
TTA: False
FP16: True

GPU is Tesla T4 x2, T4 is the same generation architecture as RTX 2080ti and should have slightly worse performance. OS is Windows Server 2022, installed the latest NVIDIA driver.

For reference, on Linux single RTX3070ti: 8 FPS. on Linux 2x Tesla T4: 5 FPS.

nagadomi commented 1 year ago

Recent Changes,

a little better FPS with minor improvements (Not related to Multi GPU)
Implemented GPU parallel mode that keeps replicas of models
Allow multiple iw3 GUI instances to be launched (each instance can run on a different GPU)

The issue of FPS not improving with multiple GPUs may be caused by Windows NVIDIA Driver mode (TCC/WDDM, seems to differ between Tesla Driver and GeForce Driver), so it may not be improved.

ohjoij-ys commented 2 months ago

Multi GPU not work in my PC. system:ubuntu 20.04 gpu : rtx3080 x2 drivers: 560 (not open)

1gpu 2 gpu

ohjoij-ys commented 2 months ago

and same result in the windows ,I'm pretty sure it has identified all the cards

nagadomi commented 2 months ago

ZoeD_Any_N and ZoeD_Any_K do not support Multi GPU. Which model did you try? Try ZoeD_N first.

ohjoij-ys commented 2 months ago

use ZoeD_N
ALLcuda - 3.8 FPS Singal GPU -5.3 fps

IF I Manually specifying the number of threads to call will result in out of memory error

Low_Vram MOD 16 thread ALL_CUDA -4.01FPS one gpu 4.01 Fps

ohjoij-ys commented 2 months ago

use Any_V2_N_S (Doesn't seem to support multi gpu) All_cuda : 4.85 Fps singal gpu: 5.0 Fps

ohjoij-ys commented 2 months ago

all_cuda vs singal gpu it seems all not support ,driver/cuda version problem? any_s : 4.91 vs 4.91 Any_B : 4.81 vs 4.81 Any_V2_N_B :4.77 vs 4.85 zoeD_K 4.06 vs 4.2

nagadomi commented 2 months ago

Mutli-GPU DataParallel seems to be working (first screenshot of nvidia-smi). It may just slow. Try increasing Depth Batch Size instead of Worker Threads. DataParallel distributes batch size to multiple GPUs.

Also, you can monitor nvidia-smi with the following commands

watch -n 1 nvidia-smi

or

nvidia-smi -lms

ohjoij-ys commented 2 months ago

it's not work

nagadomi commented 2 months ago

Turn off Low VRAM and decrease Worker threads or set to 0. Low VRAM limits batch size to 1.

ohjoij-ys commented 2 months ago

max batch size is 4 (it will be out of vram when higher) multi gpu 3.84 fps singal gpu 5.30 fps it's reduce performance?

nagadomi commented 2 months ago

Try Stereo Processing Width to auto. Also, if batch-size=4 works for single GPU, batch-size=8 should work for multi-GPU (batch-size=4 x 2GPU).

ohjoij-ys commented 2 months ago

batch-size = 8 for gpu multi gpu 4.2 Fps singal gpu 5.9 Fps

nagadomi commented 2 months ago

Try closing the application once and then try again (to avoid out of memory). Also, DepthAnything model(Any_B) uses less VRAM, so you can try larger batch sizes.

Multi-GPU feature only supports depth estimation models, so if there are other bottlenecks, they will not be improved. Try low-resolution video as well.

Also, when processing multiple videos, the following method is effective.

Allow multiple iw3 GUI instances to be launched (each instance can run on a different GPU)

ohjoij-ys commented 2 months ago

ANY_B low-resolution video bath_size = 16 singal gpu，fps :38 multi-gpu ,fps:38

bath_size =32 multi gpu fps:38 singal gpu fps:38 cpu load mode seems be different

nagadomi commented 2 months ago

I tried All CUDA in a Tesla T4 x 2 Linux environment.

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [04:02<00:00,  7.36it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model ZoeD_N --zoed-batch-size 8 --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [05:42<00:00,  5.21it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model ZoeD_N --zoed-batch-size 4 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [05:47<00:00,  5.14it/s]

multi gpu fps: 7.36 single gpu fps: 5.14

With Depth Anything(Any_B), the difference is even smaller.

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 1 --depth-model Any_B --zoed-batch-size 32 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [02:00<00:00, 14.83it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model Any_B --zoed-batch-size 32 --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [02:27<00:00, 12.14it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model Any_B --zoed-batch-size 16 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [02:26<00:00, 12.18it/s]

multi gpu fps: 14.83 single gpu fps: 12.18

I have an idea about another multi-GPU strategy. I plan to test that. (GPU round robin on thread pool)

ohjoij-ys commented 2 months ago

Maybe it’s because Nvidia has cut some features from gaming graphics cards compared to professional cards. Anyway, I’m looking forward to your new multi-GPU strategy.

nagadomi commented 2 months ago

I have an idea about another multi-GPU strategy. I plan to test that. (GPU round robin on thread pool)

I made this change. Recommended settings are: Worker Threads = 2 to 4 multiples of the number of GPUs, and Batch Size is small.

T4 x2 + Linux + 8 core (When tested above, it was 2 cores...)

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [01:29<00:00, 19.97it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out  --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [00:57<00:00, 30.90it/s]

Old code for comparison

% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 --depth-model Any_B --zoed-batch-size 4  --max-workers 8  --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [01:45<00:00, 16.87it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out  --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes 
1080p.mp4: 100%|█████████████████| 1786/1786 [01:22<00:00, 21.69it/s]

Single GPU performance is also improved.

On T4 x2 + Windows Server, multi gpu fps: 22 single gpu fps: 18 Very little difference.

ohjoij-ys commented 2 months ago

it seems not work on my pc Any_B python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes fz.mkv: 100%|████████████████▉| 2230/2232 [01:15<00:00, 29.47it/s]

python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes fz.mkv: 100%|████████████████▉| 2230/2232 [01:08<00:00, 32.55it/s

ZoeD_N python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes fz.mkv: 100%|████████████████▉| 2230/2232 [02:49<00:00, 13.14it/s]

python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes fz.mkv: 100%|████████████████▉| 2230/2232 [03:23<00:00, 10.96it/s]

nagadomi commented 2 months ago

Maybe CPU or IO is the bottleneck and single GPU performance is higher compared to them. Single RTX 3080 is about 2x faster than Single T4.

Is the single GPU performance of --gpu 0 and --gpu 1 the same?

nagadomi commented 2 months ago

I changed part of https://github.com/nagadomi/nunif/issues/59#issuecomment-2322922142 change to only enable it when --cuda-stream option is specified. #213

ohjoij-ys commented 2 months ago

gpu0 has a little difference compare with Gpu1,and multi gpu are still not work change ssd not work

Any_B: --zoed-batch-size 4 --max-workers 8 --yes cpu :13600kf
ssd: RD20 GPU0: fz.mkv: 100%|████████████████▉| 2230/2232 [01:31<00:00, 24.32it/s] GPU1： fz.mkv: 100%|████████████████▉| 2230/2232 [01:37<00:00, 22.85it/s] Multi Gpu: fz.mkv: 100%|████████████████▉| 2230/2232 [01:14<00:00, 30.10it/s] cpu load state:

add --cuda-stream GPU0: fz.mkv: 100%|████████████████▉| 2230/2232 [01:20<00:00, 27.60it/s] GPU1: fz.mkv: 100%|████████████████▉| 2230/2232 [01:04<00:00, 34.78it/s] Multi Gpu: fz.mkv: 100%|████████████████▉| 2230/2232 [01:03<00:00, 35.27it/s]

cpu load state

change ssd to optane 900P not add --cuda-stream Gpu0: fz.mkv: 100%|████████████████▉| 2230/2232 [01:32<00:00, 24.23it/s] Gpu1: fz.mkv: 100%|████████████████▉| 2230/2232 [01:28<00:00, 25.22it/s] Multi Gpu: fz.mkv: 100%|████████████████▉| 2230/2232 [01:15<00:00, 29.61it/s]

add --cuda-stream Gpu0: fz.mkv: 100%|████████████████▉| 2230/2232 [01:16<00:00, 29.19it/s] Gpu1: fz.mkv: 100%|████████████████▉| 2230/2232 [01:07<00:00, 33.22it/s] Multi Gpu: fz.mkv: 100%|████████████████▉| 2230/2232 [01:04<00:00, 34.47it/s]

cpu load state:

ohjoij-ys commented 2 months ago

ZoeD_N: GPU0: fz.mkv: 100%|████████████████▉| 2230/2232 [02:20<00:00, 15.92it/s] GPU1 : fz.mkv: 100%|████████████████▉| 2230/2232 [02:16<00:00, 16.36it/s MULTI GPU: fz.mkv: 100%|████████████████▉| 2230/2232 [03:28<00:00, 10.70it/s]

Add --cuda-stream Gpu0: fz.mkv: 100%|████████████████▉| 2230/2232 [02:52<00:00, 12.91it/s] Gpu1: fz.mkv: 100%|████████████████▉| 2230/2232 [02:52<00:00, 12.95it/s] MULTI GPU: fz.mkv: 100%|████████████████▉| 2230/2232 [03:23<00:00, 10.95it/s]

ohjoij-ys commented 2 months ago

close efficent cores in 13600k, not work

python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes --cuda-stream

fz.mkv: 100%|████████████████▉| 2230/2232 [03:13<00:00, 11.55it/s]

nagadomi commented 2 months ago

I think the multi-GPU feature is working, but it is simply not efficient. Python thread is difficult to parallelize properly because of the Global Interpreter Lock problem. multiprocessing might solve the problem, but I am still afraid to do because it needs a lot of changes.

ohjoij-ys commented 2 months ago

I think the multi-GPU feature is working, but it is simply not efficient.我认为多 GPU 功能确实有效，但效率不高。 Python thread is difficult to parallelize properly because of the Global Interpreter Lock problem.由于全局解释器锁问题，Python 线程很难正确并行化。 multiprocessing might solve the problem, but I am still afraid to do because it needs a lot of changes.多处理可能会解决问题，但我仍然不敢这样做，因为它需要很多改变。

OK,i see,thank you for your patient answer

nagadomi / nunif