Add MPS support with fused kernels

andrewkchan commented 7 months ago

This PR adds support for GPU acceleration via the MPS backend on MacOS per https://github.com/pierotofy/OpenSplat/issues/60.

I ported the custom gsplat PyTorch ops with fused kernels for gaussian projection, rasterization, etc. to metal performance shaders.
I tested the app compiled with MPS support on the banana example. It works!
I also tested using the unit tests in the gsplat repo (I will probably upstream these changes at some point). The backward pass jacobians for some ops are sometimes a little off the pure PyTorch jacobians but it doesn't seem to make a difference when testing with real examples.

Here's the speedup on my M3 Pro with MacOS Sonoma 14.3. Wall clock goes from 5 minutes to 5 seconds!

GPU

time ./opensplat ~/Downloads/banana/ -n 100
./opensplat ~/Downloads/banana/ -n 100  3.84s user 1.54s system 90% cpu 5.920 total

CPU

time ./opensplat ~/Downloads/banana/ -n 100 --cpu
./opensplat ~/Downloads/banana/ -n 100 --cpu  647.85s user 93.12s system 227% cpu 5:26.11 total

Some implementation notes:

I added some logs I found useful while developing indicating when resources are loaded and printing errors. I couldn't find any practices on different log levels (debug, info, warn, error, etc.) so right now everything is visible to the user.
Global resources like compute pipeline state objects and command queues are allocated lazily then leak forever for the lifetime of the program, I think this is ok because we're only allocating a very small fixed number of these and they'll be cleaned up when the application exits.

It was very useful to generate an XCode project from the CMakeLists.txt for debugging this for future reference, since XCode provides some nice GPU tools.

pfxuan commented 7 months ago

@andrewkchan Thank you for your fantastic work in making the MPS backend finally work! I did a quick test for this PR and found it works almost faultlessly from my laptop. My test setup is based on a customized M2 Max (30 of the 38 GPU cores).

CMake and XCode build:

cmake -G Xcode -DCMAKE_PREFIX_PATH=../../libtorch -DGPU_RUNTIME=MPS -DCMAKE_BUILD_TYPE=Release -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON  ../

./opensplat ~/Data/banana -n 100  3.76s user 1.75s system 100% cpu 5.465 total

CMake and make build:

cmake -DCMAKE_PREFIX_PATH=../../libtorch -DGPU_RUNTIME=MPS -DCMAKE_BUILD_TYPE=Release -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON  ../

./opensplat ~/Data/banana -n 100  4.05s user 1.78s system 101% cpu 5.741 total

Perhaps this PR is the first time to make it become possible to natively running 3DGS workload on apple silicon. I'm super excited for this new feature and expecting it would add a new impact in this area soon 🚀

josephldobson commented 7 months ago

Great work! very excited to start using this :)

pierotofy commented 7 months ago

Agree! Very exciting @andrewkchan, thanks for the PR. I can't wait to run this on my M1 🙏

I will test/review this sometimes by today or tomorrow.

Quick question, I noticed you made some changes in RasterizeGaussiansCPU and changed size_t pixIdx = (i * width + j), did you find some issues with the CPU implementation while working on this?

pierotofy commented 7 months ago

Currently getting this error while trying to run:

...
init_gsplat_metal_context: load function rasterize_backward_kernel with label: (null)
init_gsplat_metal_context: error: load pipeline error: Error Domain=AGXMetal13_3 Code=3 "Compiler encountered an internal error" UserInfo={NSLocalizedDescription=Compiler encountered an internal error}

Also noticed that the definition for rasterize_forward_kernel_cpso might be missing?

pfxuan commented 7 months ago

The error might be coupled with your macos version (13.3 vs 14.4). I was able to run through metal compile with the latest 14.4.1. Maybe we should consider tweaking cmake like https://github.com/ggerganov/llama.cpp/pull/6370 did.

Also, it seems like there is a memory leaking problem on metal and I'm trying to resolve it now.

pierotofy commented 7 months ago

Yep, I'm on 13.2. I'll try a few things, see if I can get it to compile.

Also, saw that we use nd_rasterize_forward_kernel_cpso for the rasterize forward pass, so that explains why rasterize_forward_kernel_cpso is not included. Makes sense.

andrewkchan commented 7 months ago

Quick question, I noticed you made some changes in RasterizeGaussiansCPU and changed size_t pixIdx = (i * width + j), did you find some issues with the CPU implementation while working on this?

Hmm, I don't remember changing this code. Looks like it was from the experimental commit that I based my changes on https://github.com/pierotofy/OpenSplat/pull/76/commits/472a45a10c2b44d0ea6b8c19de863422b1839f55

Also, saw that we use nd_rasterize_forward_kernel_cpso for the rasterize forward pass, so that explains why rasterize_forward_kernel_cpso is not included. Makes sense.

Yeah, I had ported over the ND rasterize function instead of the rasterize by accident but then decided to just use that. It's possible this was causing the slight numerical differences in unit tests. Happy to port over the rasterize_forward_kernel if needed.

Also, it seems like there is a memory leaking problem on metal and I'm trying to resolve it now.

Curious what problem you are running into! Since as noted in the OP I'm intentionally leaking some resources forever.

pfxuan commented 7 months ago

The memory consumption keeps growing when increasing the training iter num. e.g. 2000 iters will accumulate 12+ GB GPU memory.

I added emptyCache for MPS device. But it seems like there is no visible improvement.

#elif defined(USE_MPS)
#include <torch/mps.h>

        if (device != torch::kCPU){
            #ifdef USE_HIP
                    c10::hip::HIPCachingAllocator::emptyCache();
            #elif defined(USE_CUDA)
                    c10::cuda::CUDACachingAllocator::emptyCache();
            #elif defined(USE_MPS)
                    at::detail::getMPSHooks().emptyCache();
            #endif
        }

Also, adding or enforcing autoreleasepool across all methods in gsplat_metal.mm doesn't have any help so far. The root cause is probably linked with the deep copy of mps tensor:

pierotofy commented 7 months ago

Happy to port over the rasterize_forward_kernel if needed.

No need, but could certainly be done as part of another PR.

I'm still trying to compile this on 13.2; I've isolated the problem to a call to simd_shuffle_and_fill_down, but it's strange since my machine should have Metal 3 and that function has been available since 2.4.. investigating.

andrewkchan commented 7 months ago

What is the expected memory usage for something like the banana example? It's not great if there is a leak. But I'm not able to find anything using the XCode "Leaks" tool except for two allocations of 128 byte objects. And I thought that memory usage is generally expected to increase over training because the number of gaussians will increase with scheduled splits.

pfxuan commented 7 months ago

We can use this table as a baseline https://github.com/pierotofy/OpenSplat/issues/3#issuecomment-2002379362

pfxuan commented 7 months ago

For 2000 iters, the combined memory consumption is around 5.5 GB (CPU: 4.1 GB, GPU: 1.4 GB) from cuda version:

andrewkchan commented 7 months ago

Here's what I'm seeing around 1900 iters on the banana (8.7GB): Screenshot 2024-04-14 at 2 27 23 PM

I wasn't aware of the at::detail::getMPSHooks() you suggested. Curious where you found that! Interestingly, I tried printing at::detail::getMPSHooks().getCurrentAllocatedMemory() and at::detail::getMPSHooks().getDriverAllocatedMemory() and the numbers are way off. I wonder if that interface is not managing what we want.

pierotofy commented 7 months ago

I ended up upgrading to 14.4 and it now runs 🥳

I think there might be something off with the rasterize forward pass however, this is the result of the metal renderer after 100 iters on the banana dataset (you can do even just 10 iters):

100

./opensplat -n 100 --val-render render ./banana Step 100: 0.326634

Compared to the CPU run:

100

./opensplat -n 100 --val-render render ./banana --cpu Step 100: 0.188744

Looks like a width/height mismatch. I had these issues when writing the CPU rasterizer, I recommend using the simple_trainer app with different --width and --height parameters to test (by default they are equal), e.g. ./simple_trainer --width 256 --height 128 --render render

andrewkchan commented 7 months ago

Nice catch! You are totally right. Fixed and the loss is much lower after 100 iters now - 0.15966 vs 0.286877 for me.

pfxuan commented 7 months ago

Here's what I'm seeing around 1900 iters on the banana (8.7GB):

I wasn't aware of the at::detail::getMPSHooks() you suggested. Curious where you found that! Interestingly, I tried printing at::detail::getMPSHooks().getCurrentAllocatedMemory() and at::detail::getMPSHooks().getDriverAllocatedMemory() and the numbers are way off. I wonder if that interface is not managing what we want.

I'm thinking the size of the combined memory we got from mps backend is likely within the correct range. With emptyCache(), I was able to achieve a slightly lower memory footprint (7.8GB vs 8.7GB). The latest Pytorch doesn't provide a corresponding C++ API (such as c10::mps::MPSCachingAllocator::emptyCache()) to explicitly release the MPS cache. Current workaround (at::detail::getMPSHooks().emptyCache()) was discovered from this python API and I'm still uncertain if it's truly effective.

pfxuan commented 7 months ago

Never mind, the memory footprint bumped to 9.1 GB (2k) after changing the the iter num (3k to 30K)

./opensplat ~/Data/banana -n 30000  228.13s user 12.47s system 81% cpu 4:53.68 total

pierotofy commented 7 months ago

Nice! This is looking pretty amazing and can be merged. Thanks for everyone's help. 👍

Memory improvements, as well as a possible port of the rasterize_forward_kernel_cpso function can be done as part of a separate contribution. About the latter, I noticed that the CPU rasterizer tends to converge a bit faster on the banana dataset (with downscale: 2, loss: ~0.15 after 100 iters vs 0.17), not sure if it's just random, but might be worth to investigate.

pfxuan commented 7 months ago

Great! I'll enable the MPS backend CI build via another PR. From my latest test, the combined memory footprint reaches 30.6 GB at the step 8800:

 ./opensplat ~/Data/banana -n 30000  2923.88s user 104.37s system 73% cpu 1:08:21.08 total

pierotofy / OpenSplat

Add MPS support with fused kernels #76