GPU acceleration/performance improvement?

ThreeDeeJay commented 1 year ago

Hi, I'm getting really nice results with the large models, but performance is terrible (<1FPS) I'm currently running a RTX 2080 Ti that's barely getting any use. would GPU acceleration get significant performance boost? If so, could you please be more specific as to how to set up DepthViewer with CUDA/cuDNN? There's so many options I'm not sure what/how to install them exactly, or where to get it from.

parkchamchi commented 1 year ago

Yes, GPU acceleration would boost the performance. I get 13+ fps for dpt_hybrid_384. Prepare the CUDA/cuDNN and on the program find Options -> Model Settings. Select dpt_hybrid_384 or any models and toggle Use GPU for OnnxRuntime. The CUDA dropdown shouldn't be changed (others are not implemented). Click the Load button and it'll be set if no problem arose.

ThreeDeeJay commented 1 year ago

Prepare the CUDA/cuDNN

I think this is what I think I'm missing. I installed the official CUDA installer and ran all the pip install scripts here

But whenever I check Use GPU for OnnxRuntime and click Load, it says Model is not set!

parkchamchi commented 1 year ago

You don't have to install OnnxRuntime via pip since the dll files are included in the build. I think the issue here is that the cuDNN has not been installed. Are there any cudnn*.dll-like files under the CUDA bin folder or any directory reachable from PATH? Also, the console output (activated by the backtick[`] key) would be helpful.

ThreeDeeJay commented 1 year ago

Ohh now we're getting somewhere but not quite there yet. I installed CUDA 11.7 then extracted cuDNN bin DLL files into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin

And now I'm able to load dpt_hybrid_384 with Use GPU for OnnxRuntime enabled and set to CUDA

However, as soon as I load a file, the whole program crashes. Is there a way to enable logging to check what went wrong?

parkchamchi commented 1 year ago

The log file is under C:\Users\<USERNAME>\AppData\LocalLow\parkchamchi\DepthViewer, as Player.log. Please check if any significant output exists under the line Loading model:

It appears that a fatal error occurred loading onnxruntime_providers_cuda.dll or onnxruntime_providers_shared.dll. My guess is that the cuDNN version does not match the version of CUDA. My setup is CUDA v11.7 and cuDNN v8.2.4. If the problem persists, try walking the dependencies for the both DepthViewer.exe and the dll files in ./DepthViewer_Data/Plugins/x86_64.

ThreeDeeJay commented 1 year ago

That was it! I tried (cuDNN 8.4.2) and now I'm getting around 10FPS on the first run and higher on subsequent plays! DepthViewer096 DepthViewer101 DepthViewer102 DepthViewer104 By the way, this program also works with NVIDIA 3D Vision, which is what I used to capture full res 1080p cross-eyed 3D screenshots 👌

Anyhow, I guess the only thing to sort out is a possible bottleneck. Neither the CPU nor the GPU have over 1/3 usage but it's still dropping a significant amount of frames. Any idea why this is happening? 🤔

parkchamchi commented 1 year ago

Running the same model using Python OnnxRuntime script (with no 3D visualization or whatsoever) gives me ~25fps so I guess there is a significant bottleneck here. My guess is that it's due to the GPU-CPU bottleneck, since the code using ORT fetches the RenderTexture (GPU) to Texture2D (CPU), converts it to the float array, then converts it to a tensor (GPU). Testing the built-in model (MiDaS v2.1 small 256), the default Barracuda one, which does not have such CPU-GPU overhead, has ~500fps; while the OnnxRuntime one has ~150fps, with significant oscillation. But what is weird is that the small model has somewhat insignificant overhead unlike the large one. The only reason OnnxRuntime is used is that Unity's ML framework Barracuda 3.0.0 wouldn't accept MiDaS v3+ models. I'd update it when Barracuda 4.0 comes out and (hopefully) supports the newer models.

CubicReg commented 1 year ago

Hi, I have the same Model is not set! message when loading another model. I installed Cuda 12.2 and copied the Cudnn 8.9 DLLs in Cuda’s bin directory. The error message I get is

Loading model: ./onnx\dpt_beit_large_512.onnx
OnnxRuntimeDepthModel(): using the provider CUDA
Using gpuid=0
LoadModel(): Got exception: Microsoft.ML.OnnxRuntime.OnnxRuntimeException: [ErrorCode:RuntimeException] D:\a\_work\1\s\onnxruntime\core\session\provider_bridge_ort.cc:1069 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "V:\Games_VR\DepthViewer\Build\DepthViewer_Data\Plugins\x86_64\onnxruntime_providers_cuda.dll"

ThreeDeeJay commented 1 year ago

@CubicReg Guess you'll also wanna stick to CUDA v11.7 and cuDNN v8.2.4 lol @parkchamchi So any idea which is the model with most accurate results that runs in realtime? and which one do you personally use? The default one has good performance but wiggles a lot and I think it's also a lot jaggier.

On a side note, maybe it'd be nice to save the last used model, since I have to switch from the built-in one every time I run the app 😅

parkchamchi commented 1 year ago

@CubicReg The current ORT version does not support CUDA 12.x. # @ThreeDeeJay I use dpt_hybrid_384 since it's accurate and robust. I agree that model preloading option would be convenient, I'd add it later.

ThreeDeeJay commented 1 year ago

Alright, guess I'll close this issue since I technically did get GPU acceleration working and better performance, though I'd love to stay updated on a possible bottleneck fix/workaround because I'd love to have dpt_beit_large_512 doing its magic at full speed.

In the meantime, I made a guide for people new to this to set up DepthViewer with GPU acceleration and better models, so feel free to adapt it into the readme because I have a feeling a lot of people try this app, see bad accuracy/performance then just quit and miss out on its full potential:

:fast_forward: = Skip unless you want more accurate results, though possibly worse performance

Download and extract somewhere to run the app from https://github.com/parkchamchi/DepthViewer/releases
:fast_forward:Download and install https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_516.01_windows.exe
:fast_forward:Download https://developer.nvidia.com/compute/machine-learning/cudnn/secure/8.2.4/11.4_20210831/cudnn-11.4-windows-x64-v8.2.4.15.zip (needs NVIDIA login)
:fast_forward:Extract cudnn-11.4-windows-x64-v8.2.4.15.zip\cuda\bin\ DLL files into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin folder
:fast_forward:Download and copy one/all of these into the DepthViewer\onnx folder: https://github.com/parkchamchi/MiDaS/releases/download/22.12.07/dpt_hybrid_384.onnx (worst depth detection but faster) https://github.com/parkchamchi/MiDaS/releases/download/23.02.18/dpt_swin2_large_384.onnx (somewhere in between) https://github.com/parkchamchi/MiDaS/releases/download/23.02.18/dpt_beit_large_512.onnx (good detection but slower)
Run and configure DepthViewer.exe:
- :fast_forward:Click Options and scroll down to Model Settings
- :fast_forward: :white_check_mark: Use GPU for OnnxRuntime
- :fast_forward: dpt_hybrid_384 or dpt_beit_large_512
- :fast_forward: Click Load and wait for it to display the model name (and not null like this)
- :fast_forward: Click Options again to hide the menu then browse a (video/image) file or folder
- Adjust parameters. Here's what I use for my 3D monitor (VR might need different values):
- ProjRatio (how much of the background fills the screen): 1
- CamDistL (low = high convergence): 0.75
- DepthMultRL (separation): 0.75

Some music videos with dpt_beit_large_512: DepthViewer348 DepthViewer358 DepthViewer369 DepthViewer372 DepthViewer375 DepthViewer377 DepthViewer379 DepthViewer380 DepthViewer383 DepthViewer385 DepthViewer395 DepthViewer396 DepthViewer409 DepthViewer415 DepthViewer422 DepthViewer438 DepthViewer446 DepthViewer454 DepthViewer462

CubicReg commented 1 year ago

I still have the same error message Model is not set! when loading another model after installing the files from your links (https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_516.01_windows.exe and https://developer.nvidia.com/compute/machine-learning/cudnn/secure/8.2.4/11.4_20210831/cudnn-11.4-windows-x64-v8.2.4.15.zip)

The models load fine when I don’t activate the Use GPU for OnnxRuntime option.

The full log when loading a model:

Loading model: ./onnx\dpt_swin2_large_384.onnx
OnnxRuntimeDepthModel(): using the provider CUDA
Using gpuid=0
LoadModel(): Got exception: Microsoft.ML.OnnxRuntime.OnnxRuntimeException: [ErrorCode:RuntimeException] D:\a\_work\1\s\onnxruntime\core\session\provider_bridge_ort.cc:1069 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "V:\Games_VR\DepthViewer\Build\DepthViewer_Data\Plugins\x86_64\onnxruntime_providers_cuda.dll"

  at Microsoft.ML.OnnxRuntime.NativeApiStatus.VerifySuccess (System.IntPtr nativeStatus) [0x0002c] in <d7591e6396b14c7b9ff6a962184da8b3>:0 
  at Microsoft.ML.OnnxRuntime.SessionOptions.AppendExecutionProvider_CUDA (System.Int32 deviceId) [0x0000d] in <d7591e6396b14c7b9ff6a962184da8b3>:0 
  at Microsoft.ML.OnnxRuntime.SessionOptions.MakeSessionOptionWithCudaProvider (System.Int32 deviceId) [0x0000d] in <d7591e6396b14c7b9ff6a962184da8b3>:0 
  at OnnxRuntimeDepthModel..ctor (System.String onnxpath, System.String modelType, System.String provider, System.Int32 gpuid, System.String settings) [0x00150] in <638d9a2582cc46beaff79da68ac7e852>:0 
  at DepthModelBehavior.GetDepthModel (System.String onnxpath, System.String modelType, System.Boolean useOnnxRuntime) [0x0003b] in <638d9a2582cc46beaff79da68ac7e852>:0 
  at MainBehavior.LoadModel (System.String onnxpath, System.Boolean useOnnxRuntime) [0x0003d] in <638d9a2582cc46beaff79da68ac7e852>:0 
Failed to load: dpt_swin2_large_384

I have the latest NVidia drivers on a 4090.

parkchamchi commented 1 year ago

That's weird, can you type nvcc --version on cmd to check if it is on the PATH?

CubicReg commented 1 year ago

That's weird, can you type nvcc --version on cmd to check if it is on the PATH?

With nvcc --version on cmd I first got 'nvcc' is not recognized as an internal or external command, operable program or batch file. I checked the PATH, in System Variables I had both CUDA_PATH and CUDA PATH_V117 set to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7

I had installed the setup with only the CUDA runtimes.

So I re-installed the CUDA setup with the runtimes and the compiler options below the development section. Still the same error.

Then re-installed the CUDA setup with the runtimes, the compiler and the tools under the development section. Now the command nvcc --version works:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_19:00:59_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

But I still get Model is not set! when loading a model with Use GPU for OnxxRuntime. Log from LocalLow\parkchamchi\DepthViewer\Player.log is the same error message as before too

parkchamchi commented 1 year ago

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin should be under PATH, since that directory is where the cuda/cudnn dll files (and nvcc.exe) are located. I don't know what you mean by "development section", but since nvcc --version works I assume that it is under PATH. If the cuDNN dll files are under the directory it should work. But as it doesn't...

Check if the cuDNN files are located.
Using the dependency walker, examine DepthViewer_Data/Plugins/x86_64/onnxruntime_providers_cuda.dll and see if there is any missing dependency.

If none of these work I don't know what the problem is, and in that case I'd recommend you the python-zeromq method, which infers the ML model on the python side not C#/Unity side.

CubicReg commented 1 year ago

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin should be under PATH

Right I do have C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\bin in the PATH environment variable.

I don't know what you mean by "development section"

I was talking about the options in the setup

I checked every dependancy DLL from DepthViewer_Data\Plugins\x86_64\nnxruntime_providers_cuda.dll, each one is present and has the status "Loading PE file xxxx.dll successfull" in Dependancy Walker.

I will think about the Python method.

Edit: actually after unfolding several levels of dependencies in Dependancy Walker, I see ext-ms-win-oobe-query-l1-1-0.dll missing but from what I read on stackoverflow it can be ignored.

CubicReg commented 1 year ago

I got the models loading via GPU CUDA now, I can’t explain what I did other than restarting the computer. As far as I’m concerned this issue can be closed.

parkchamchi commented 1 year ago

I got the models loading via GPU CUDA now, I can’t explain what I did other than restarting the computer. As far as I’m concerned this issue can be closed.

Glad to hear that, I think restarting the computer may have affected the CUDA setup.

ThreeDeeJay commented 9 months ago

Testing the built-in model (MiDaS v2.1 small 256), the default Barracuda one, which does not have such CPU-GPU overhead, has ~500fps; while the OnnxRuntime one has ~150fps, with significant oscillation. But what is weird is that the small model has somewhat insignificant overhead unlike the large one. The only reason OnnxRuntime is used is that Unity's ML framework Barracuda 3.0.0 wouldn't accept MiDaS v3+ models. I'd update it when Barracuda 4.0 comes out and (hopefully) supports the newer models.

@parkchamchi By the way, did you check out performance with Sentis? Apparently it's the successor of Barracuda 3.0: https://forum.unity.com/threads/unity-sentis.1454530/ https://blog.unity.com/engine-platform/introducing-unity-muse-and-unity-sentis-ai https://docs.unity3d.com/Packages/com.unity.sentis@1.3/manual/index.html https://docs.unity3d.com/Packages/com.unity.sentis@1.1/manual/upgrade-guide.html

I still haven't figured out how to get at least 24FPS video on the high quality models, not even with CUDA 😔 ShareX_dHY6qqhy7w Taskmgr_AXfoa1lLeZ I know there's the option to pre-generate the whole .depthviewer file, but that's also affected by the performance bottleneck, and I'm not sure if there's a way to force ffpymq to load them 🤔

On a side note, I ran some benchmarks here to compare the output of multiple models. There are more, but I got errors since they're not implemented.

parkchamchi commented 9 months ago

Sentis

Thanks for letting me know, I'll try it later. Maybe it can relieve the bottleneck.

P.S. The table is great, thank you.

ThreeDeeJay commented 9 months ago

@parkchamchi Thanks, I can confirm Depth Anything in ONNX format now works directly via the Unity app. 👌 However, performance-wise, there doesn't seem to be a noticeable change (still low FPS with massive drops on heavy models and low GPU/CPU usage) Here's some tests I ran. Mostly Sentis, except _ONNX are for onnxruntime checked in the options.

                    FPS     CPU     GPU RAM     Build
Built-in MiDAS              100%        17%     31% 3900MB      v0.10.0-beta 1
dpt_hybrid_384+CUDA             10      13%     54% 5000MB      v0.10.0-beta 1
depth_anything_vits14+CUDA      15      14%     47% 4800MB      v0.10.0-beta 1  
depth_anything_vitb14+CUDA      8       9%      70% 5000MB      v0.10.0-beta 1
depth_anything_vitl14+CUDA      5*      10%     85% 8300MB      v0.10.0-beta 1
depth_anything_vitl14_ONNX      0.2     32%     0%  7000MB      v0.10.0-beta 1
depth_anything_vitl14_ONNX+CUDA     8*      15%     50% 7000MB      v0.10.0-beta 1
depth_anything_vitl14_ffpymq+CUDA   2       26%     70%     5300MB      v0.10.0-beta 1

dpt_beit_large_512_ONNX+CUDA        8       15%     50%     7000MB      v0.9.1
dpt_beit_large_512_ONNX+CUDA        8       15%     50%     7000MB      v0.10.0-beta 1

*Constant spikes

Did you notice similar performance on BEiT/Depth Anything? I wonder if using run_video.py implemented optimizations for video performance. and what version of Python, CUDA, CUDNN and Torch do you use? someone brought up xformers which apparently isn't available on CUDA 11.7 so I wonder if implementing that and allowing CUDA upgrade would help squeeze a few extra frames 🤔

parkchamchi commented 9 months ago

That run_video.py file seems to behave identically to our script, inferring per frame. (See under while raw_video.isOpened():.)

and what version of Python, CUDA, CUDNN and Torch do you use?

python 3.9.6, cuda v11.7, cudnn v8.3.1, torch 2.0.1+cu117

Higher version of python/torch would work with the existing scripts.

parkchamchi / DepthViewer

GPU acceleration/performance improvement? #5