Open nagadomi opened 1 year ago
https://github.com/pytorch/pytorch/issues/8637 MiDaS has the problem linked above and cannot be used with nn.DataParallel.
the above problem was fixed by https://github.com/nagadomi/MiDaS_iw3/commit/22193f41f05c8099489aa2124c86ba1c951e93d7 , but still seems to have register_forward_hook problem on multi GPUs.
register_forward_hook problem was fixed by https://github.com/nagadomi/MiDaS_iw3/commit/0da1ad010b603b179079ada153af708c3c7021c0 https://github.com/nagadomi/ZoeDepth_iw3/commit/55bacaf72c90ba3fbf68ad9e5681cbb6af42c07c
iw3 now works with multiple GPUs.
for git,
# update source code
git pull
# update MiDas and ZoeDepth
python -m iw3.download_models
for windows_package,
run update.bat
.
CLI
python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8
GUI
All CUDA Device
in the device panelDepth Batch Size
(should be a multiple of GPUs)I tested only 2 GPU case on Linux CLI.
@elecimage Could you check to see it works?
register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf
iw3 now works with multiple GPUs.
updating steps
for git,
# update source code git pull # update MiDas and ZoeDepth python -m iw3.download_models
for windows_package, run
update.bat
.examples
CLI
python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8
GUI
- Choose
All CUDA Device
in the device panel- Increase
Depth Batch Size
(should be a multiple of GPUs)I tested only 2 GPU case on Linux CLI.
@elecimage Could you check to see it works?
oh Thank you. I'll test it soon
register_forward_hook problem was fixed by nagadomi/MiDaS_iw3@0da1ad0 nagadomi/ZoeDepth_iw3@55bacaf
iw3 now works with multiple GPUs.
updating steps
for git,
# update source code git pull # update MiDas and ZoeDepth python -m iw3.download_models
for windows_package, run
update.bat
.examples
CLI
python -m iw3 -i ./tmp/test.mp4 -o ./tmp/ --gpu 0 1 --zoed-batch-size 8
GUI
- Choose
All CUDA Device
in the device panel- Increase
Depth Batch Size
(should be a multiple of GPUs)I tested only 2 GPU case on Linux CLI.
@elecimage Could you check to see it works?
yes it works but slower than 1 gpu use.~
@elecimage First, I do not have Windows multi-GPU environment, so I may not be able to solve this problem. On Linux 2GPU(Tesla T4 x2) environment, it is possible to achieve roughly 2x FPS.
Here are some possible causes and questions,
Depth Batch Size
? If the batch size is too small, it may be slower due to multi-GPU overhead.3. epth Batch Size
With 2 GPu's it is roughly 1.2x slower than with 1.
I'm using two 2080ti.
I've tried changing the Depth Batch Size several times, but it doesn't make much difference.
OK, I will try to create a Windows VM on cloud and check the behavior.
Maybe fixed by https://github.com/nagadomi/nunif/commit/2b7cbf9625d2fd719e331dfc1c6b813ece20b428. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.
Maybe fixed by 2b7cbf9. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.
oh Thank you. I'll test it soon
Maybe fixed by 2b7cbf9. @elecimage Would you please update and try again? On the Virtual Machine I tried, it was about 1.5x faster with 2x GPU.
I'm still having problems. It doesn't speed up or even slow down, When using 2gpu, it only loads half the vram of the gpu. I've tested it with 1 GPU and it loads all the VRAM and gets about 2 FPS, With 2 GPUs, I get about 2 FPS and only half the VRAM is loaded. The speed is about the same or slightly slower when using 2 GPUs.
When using multi GPUs, the batch size is divided for each GPU. So for the same batch size setting, each GPU's VRAM usage will be 1/GPU.
In my test above, I tried the following settings. with 720x720 video, 1GPU = 2.5 FPS, 2GPU = 3.7 FPS.
Depth Model: ZoeD_N
Deivce: All CUDA Device
Depth Resolution: Default
Depth Batch Size: 8 or 16
Stereo Batch Size: 64
Low VRAM: False
TTA: False
FP16: True
GPU is Tesla T4 x2, T4 is the same generation architecture as RTX 2080ti and should have slightly worse performance. OS is Windows Server 2022, installed the latest NVIDIA driver.
For reference, on Linux single RTX3070ti: 8 FPS. on Linux 2x Tesla T4: 5 FPS.
Recent Changes,
The issue of FPS not improving with multiple GPUs may be caused by Windows NVIDIA Driver mode (TCC/WDDM, seems to differ between Tesla Driver and GeForce Driver), so it may not be improved.
Multi GPU not work in my PC. system:ubuntu 20.04 gpu : rtx3080 x2 drivers: 560 (not open)
1gpu 2 gpu
and same result in the windows ,I'm pretty sure it has identified all the cards
ZoeD_Any_N
and ZoeD_Any_K
do not support Multi GPU.
Which model did you try? Try ZoeD_N
first.
use ZoeD_N
ALLcuda - 3.8 FPS
Singal GPU -5.3 fps
IF I Manually specifying the number of threads to call will result in out of memory error
Low_Vram MOD 16 thread ALL_CUDA -4.01FPS one gpu 4.01 Fps
use Any_V2_N_S (Doesn't seem to support multi gpu) All_cuda : 4.85 Fps singal gpu: 5.0 Fps
all_cuda vs singal gpu it seems all not support ,driver/cuda version problem? any_s : 4.91 vs 4.91 Any_B : 4.81 vs 4.81 Any_V2_N_B :4.77 vs 4.85 zoeD_K 4.06 vs 4.2
Mutli-GPU DataParallel seems to be working (first screenshot of nvidia-smi). It may just slow.
Try increasing Depth Batch Size
instead of Worker Threads
. DataParallel distributes batch size to multiple GPUs.
Also, you can monitor nvidia-smi with the following commands
watch -n 1 nvidia-smi
or
nvidia-smi -lms
it's not work
Turn off Low VRAM
and decrease Worker threads
or set to 0.
Low VRAM
limits batch size to 1.
max batch size is 4 (it will be out of vram when higher) multi gpu 3.84 fps singal gpu 5.30 fps it's reduce performance?
Try Stereo Processing Width
to auto
.
Also, if batch-size=4 works for single GPU, batch-size=8 should work for multi-GPU (batch-size=4 x 2GPU).
batch-size = 8 for gpu multi gpu 4.2 Fps singal gpu 5.9 Fps
Try closing the application once and then try again (to avoid out of memory). Also, DepthAnything model(Any_B) uses less VRAM, so you can try larger batch sizes.
Multi-GPU feature only supports depth estimation models, so if there are other bottlenecks, they will not be improved. Try low-resolution video as well.
Also, when processing multiple videos, the following method is effective.
Allow multiple iw3 GUI instances to be launched (each instance can run on a different GPU)
ANY_B low-resolution video bath_size = 16 singal gpu,fps :38 multi-gpu ,fps:38
bath_size =32 multi gpu fps:38 singal gpu fps:38 cpu load mode seems be different
I tried All CUDA in a Tesla T4 x 2 Linux environment.
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [04:02<00:00, 7.36it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model ZoeD_N --zoed-batch-size 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [05:42<00:00, 5.21it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model ZoeD_N --zoed-batch-size 4 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [05:47<00:00, 5.14it/s]
multi gpu fps: 7.36 single gpu fps: 5.14
With Depth Anything(Any_B), the difference is even smaller.
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 1 --depth-model Any_B --zoed-batch-size 32 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [02:00<00:00, 14.83it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model Any_B --zoed-batch-size 32 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [02:27<00:00, 12.14it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/ --gpu 0 --depth-model Any_B --zoed-batch-size 16 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [02:26<00:00, 12.18it/s]
multi gpu fps: 14.83 single gpu fps: 12.18
I have an idea about another multi-GPU strategy. I plan to test that. (GPU round robin on thread pool)
Maybe it’s because Nvidia has cut some features from gaming graphics cards compared to professional cards. Anyway, I’m looking forward to your new multi-GPU strategy.
I have an idea about another multi-GPU strategy. I plan to test that. (GPU round robin on thread pool)
I made this change. Recommended settings are: Worker Threads = 2 to 4 multiples of the number of GPUs, and Batch Size is small.
T4 x2 + Linux + 8 core (When tested above, it was 2 cores...)
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [01:29<00:00, 19.97it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [00:57<00:00, 30.90it/s]
Old code for comparison
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [01:45<00:00, 16.87it/s]
% python -m iw3.cli -i ./tmp/1080p.mp4 -o ./tmp/out --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes
1080p.mp4: 100%|█████████████████| 1786/1786 [01:22<00:00, 21.69it/s]
Single GPU performance is also improved.
On T4 x2 + Windows Server, multi gpu fps: 22 single gpu fps: 18 Very little difference.
it seems not work on my pc Any_B python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes fz.mkv: 100%|████████████████▉| 2230/2232 [01:15<00:00, 29.47it/s]
python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model Any_B --zoed-batch-size 4 --max-workers 8 --yes fz.mkv: 100%|████████████████▉| 2230/2232 [01:08<00:00, 32.55it/s
ZoeD_N python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes fz.mkv: 100%|████████████████▉| 2230/2232 [02:49<00:00, 13.14it/s]
python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes fz.mkv: 100%|████████████████▉| 2230/2232 [03:23<00:00, 10.96it/s]
Maybe CPU or IO is the bottleneck and single GPU performance is higher compared to them. Single RTX 3080 is about 2x faster than Single T4.
Is the single GPU performance of --gpu 0
and --gpu 1
the same?
I changed part of https://github.com/nagadomi/nunif/issues/59#issuecomment-2322922142 change to only enable it when --cuda-stream
option is specified. #213
gpu0 has a little difference compare with Gpu1,and multi gpu are still not work change ssd not work
Any_B:
--zoed-batch-size 4 --max-workers 8 --yes
cpu :13600kf
ssd: RD20
GPU0:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:31<00:00, 24.32it/s]
GPU1:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:37<00:00, 22.85it/s]
Multi Gpu:
fz.mkv: 100%|████████████████▉| 2230/2232 [01:14<00:00, 30.10it/s]
cpu load state:
add --cuda-stream GPU0: fz.mkv: 100%|████████████████▉| 2230/2232 [01:20<00:00, 27.60it/s] GPU1: fz.mkv: 100%|████████████████▉| 2230/2232 [01:04<00:00, 34.78it/s] Multi Gpu: fz.mkv: 100%|████████████████▉| 2230/2232 [01:03<00:00, 35.27it/s]
cpu load state
change ssd to optane 900P not add --cuda-stream Gpu0: fz.mkv: 100%|████████████████▉| 2230/2232 [01:32<00:00, 24.23it/s] Gpu1: fz.mkv: 100%|████████████████▉| 2230/2232 [01:28<00:00, 25.22it/s] Multi Gpu: fz.mkv: 100%|████████████████▉| 2230/2232 [01:15<00:00, 29.61it/s]
add --cuda-stream Gpu0: fz.mkv: 100%|████████████████▉| 2230/2232 [01:16<00:00, 29.19it/s] Gpu1: fz.mkv: 100%|████████████████▉| 2230/2232 [01:07<00:00, 33.22it/s] Multi Gpu: fz.mkv: 100%|████████████████▉| 2230/2232 [01:04<00:00, 34.47it/s]
cpu load state:
ZoeD_N: GPU0: fz.mkv: 100%|████████████████▉| 2230/2232 [02:20<00:00, 15.92it/s] GPU1 : fz.mkv: 100%|████████████████▉| 2230/2232 [02:16<00:00, 16.36it/s MULTI GPU: fz.mkv: 100%|████████████████▉| 2230/2232 [03:28<00:00, 10.70it/s]
Add --cuda-stream Gpu0: fz.mkv: 100%|████████████████▉| 2230/2232 [02:52<00:00, 12.91it/s] Gpu1: fz.mkv: 100%|████████████████▉| 2230/2232 [02:52<00:00, 12.95it/s] MULTI GPU: fz.mkv: 100%|████████████████▉| 2230/2232 [03:23<00:00, 10.95it/s]
close efficent cores in 13600k, not work
python -m iw3.cli -i /home/ohjoij/视频/fz.mkv -o /home/ohjoij/视频/test.mkv --gpu 0 1 --depth-model ZoeD_N --zoed-batch-size 4 --max-workers 8 --yes --cuda-stream
fz.mkv: 100%|████████████████▉| 2230/2232 [03:13<00:00, 11.55it/s]
I think the multi-GPU feature is working, but it is simply not efficient. Python thread is difficult to parallelize properly because of the Global Interpreter Lock problem. multiprocessing might solve the problem, but I am still afraid to do because it needs a lot of changes.
I think the multi-GPU feature is working, but it is simply not efficient.我认为多 GPU 功能确实有效,但效率不高。 Python thread is difficult to parallelize properly because of the Global Interpreter Lock problem.由于全局解释器锁问题,Python 线程很难正确并行化。 multiprocessing might solve the problem, but I am still afraid to do because it needs a lot of changes.多处理可能会解决问题,但我仍然不敢这样做,因为它需要很多改变。
OK,i see,thank you for your patient answer
from https://github.com/nagadomi/nunif/discussions/28#discussioncomment-7247459