Closed andrewkchan closed 7 months ago
@andrewkchan Thank you for your fantastic work in making the MPS backend finally work! I did a quick test for this PR and found it works almost faultlessly from my laptop. My test setup is based on a customized M2 Max (30 of the 38 GPU cores).
CMake and XCode build:
cmake -G Xcode -DCMAKE_PREFIX_PATH=../../libtorch -DGPU_RUNTIME=MPS -DCMAKE_BUILD_TYPE=Release -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON ../
./opensplat ~/Data/banana -n 100 3.76s user 1.75s system 100% cpu 5.465 total
CMake and make build:
cmake -DCMAKE_PREFIX_PATH=../../libtorch -DGPU_RUNTIME=MPS -DCMAKE_BUILD_TYPE=Release -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON ../
./opensplat ~/Data/banana -n 100 4.05s user 1.78s system 101% cpu 5.741 total
Perhaps this PR is the first time to make it become possible to natively running 3DGS workload on apple silicon. I'm super excited for this new feature and expecting it would add a new impact in this area soon 🚀
Great work! very excited to start using this :)
Agree! Very exciting @andrewkchan, thanks for the PR. I can't wait to run this on my M1 🙏
I will test/review this sometimes by today or tomorrow.
Quick question, I noticed you made some changes in RasterizeGaussiansCPU and changed size_t pixIdx = (i * width + j)
, did you find some issues with the CPU implementation while working on this?
Currently getting this error while trying to run:
...
init_gsplat_metal_context: load function rasterize_backward_kernel with label: (null)
init_gsplat_metal_context: error: load pipeline error: Error Domain=AGXMetal13_3 Code=3 "Compiler encountered an internal error" UserInfo={NSLocalizedDescription=Compiler encountered an internal error}
Also noticed that the definition for rasterize_forward_kernel_cpso
might be missing?
The error might be coupled with your macos version (13.3 vs 14.4). I was able to run through metal compile with the latest 14.4.1. Maybe we should consider tweaking cmake like https://github.com/ggerganov/llama.cpp/pull/6370 did.
Also, it seems like there is a memory leaking problem on metal and I'm trying to resolve it now.
Yep, I'm on 13.2. I'll try a few things, see if I can get it to compile.
Also, saw that we use nd_rasterize_forward_kernel_cpso
for the rasterize forward pass, so that explains why rasterize_forward_kernel_cpso
is not included. Makes sense.
Quick question, I noticed you made some changes in RasterizeGaussiansCPU and changed size_t pixIdx = (i * width + j), did you find some issues with the CPU implementation while working on this?
Hmm, I don't remember changing this code. Looks like it was from the experimental commit that I based my changes on https://github.com/pierotofy/OpenSplat/pull/76/commits/472a45a10c2b44d0ea6b8c19de863422b1839f55
Also, saw that we use nd_rasterize_forward_kernel_cpso for the rasterize forward pass, so that explains why rasterize_forward_kernel_cpso is not included. Makes sense.
Yeah, I had ported over the ND rasterize function instead of the rasterize by accident but then decided to just use that. It's possible this was causing the slight numerical differences in unit tests. Happy to port over the rasterize_forward_kernel if needed.
Also, it seems like there is a memory leaking problem on metal and I'm trying to resolve it now.
Curious what problem you are running into! Since as noted in the OP I'm intentionally leaking some resources forever.
The memory consumption keeps growing when increasing the training iter num. e.g. 2000 iters will accumulate 12+ GB GPU memory.
I added emptyCache for MPS device. But it seems like there is no visible improvement.
#elif defined(USE_MPS)
#include <torch/mps.h>
if (device != torch::kCPU){
#ifdef USE_HIP
c10::hip::HIPCachingAllocator::emptyCache();
#elif defined(USE_CUDA)
c10::cuda::CUDACachingAllocator::emptyCache();
#elif defined(USE_MPS)
at::detail::getMPSHooks().emptyCache();
#endif
}
Also, adding or enforcing autoreleasepool
across all methods in gsplat_metal.mm doesn't have any help so far. The root cause is probably linked with the deep copy of mps tensor:
Happy to port over the rasterize_forward_kernel if needed.
No need, but could certainly be done as part of another PR.
I'm still trying to compile this on 13.2; I've isolated the problem to a call to simd_shuffle_and_fill_down
, but it's strange since my machine should have Metal 3 and that function has been available since 2.4.. investigating.
What is the expected memory usage for something like the banana example? It's not great if there is a leak. But I'm not able to find anything using the XCode "Leaks" tool except for two allocations of 128 byte objects. And I thought that memory usage is generally expected to increase over training because the number of gaussians will increase with scheduled splits.
We can use this table as a baseline https://github.com/pierotofy/OpenSplat/issues/3#issuecomment-2002379362
For 2000 iters, the combined memory consumption is around 5.5 GB (CPU: 4.1 GB, GPU: 1.4 GB) from cuda version:
Here's what I'm seeing around 1900 iters on the banana (8.7GB):
I wasn't aware of the at::detail::getMPSHooks()
you suggested. Curious where you found that! Interestingly, I tried printing at::detail::getMPSHooks().getCurrentAllocatedMemory()
and at::detail::getMPSHooks().getDriverAllocatedMemory()
and the numbers are way off. I wonder if that interface is not managing what we want.
I ended up upgrading to 14.4 and it now runs 🥳
I think there might be something off with the rasterize forward pass however, this is the result of the metal renderer after 100 iters on the banana dataset (you can do even just 10 iters):
./opensplat -n 100 --val-render render ./banana
Step 100: 0.326634
Compared to the CPU run:
./opensplat -n 100 --val-render render ./banana --cpu
Step 100: 0.188744
Looks like a width/height mismatch. I had these issues when writing the CPU rasterizer, I recommend using the simple_trainer
app with different --width
and --height
parameters to test (by default they are equal), e.g. ./simple_trainer --width 256 --height 128 --render render
Nice catch! You are totally right. Fixed and the loss is much lower after 100 iters now - 0.15966
vs 0.286877
for me.
Here's what I'm seeing around 1900 iters on the banana (8.7GB):
I wasn't aware of the
at::detail::getMPSHooks()
you suggested. Curious where you found that! Interestingly, I tried printingat::detail::getMPSHooks().getCurrentAllocatedMemory()
andat::detail::getMPSHooks().getDriverAllocatedMemory()
and the numbers are way off. I wonder if that interface is not managing what we want.
I'm thinking the size of the combined memory we got from mps backend is likely within the correct range. With emptyCache(), I was able to achieve a slightly lower memory footprint (7.8GB vs 8.7GB). The latest Pytorch doesn't provide a corresponding C++ API (such as c10::mps::MPSCachingAllocator::emptyCache()) to explicitly release the MPS cache. Current workaround (at::detail::getMPSHooks().emptyCache()) was discovered from this python API and I'm still uncertain if it's truly effective.
Never mind, the memory footprint bumped to 9.1 GB (2k) after changing the the iter num (3k to 30K)
./opensplat ~/Data/banana -n 30000 228.13s user 12.47s system 81% cpu 4:53.68 total
Nice! This is looking pretty amazing and can be merged. Thanks for everyone's help. 👍
Memory improvements, as well as a possible port of the rasterize_forward_kernel_cpso
function can be done as part of a separate contribution. About the latter, I noticed that the CPU rasterizer tends to converge a bit faster on the banana dataset (with downscale: 2, loss: ~0.15 after 100 iters vs 0.17), not sure if it's just random, but might be worth to investigate.
Great! I'll enable the MPS backend CI build via another PR. From my latest test, the combined memory footprint reaches 30.6 GB at the step 8800:
./opensplat ~/Data/banana -n 30000 2923.88s user 104.37s system 73% cpu 1:08:21.08 total
This PR adds support for GPU acceleration via the MPS backend on MacOS per https://github.com/pierotofy/OpenSplat/issues/60.
gsplat
PyTorch ops with fused kernels for gaussian projection, rasterization, etc. to metal performance shaders.Here's the speedup on my M3 Pro with MacOS Sonoma 14.3. Wall clock goes from 5 minutes to 5 seconds!
GPU
CPU
Some implementation notes:
It was very useful to generate an XCode project from the CMakeLists.txt for debugging this for future reference, since XCode provides some nice GPU tools.