Metal support for training/inference

wanmeihuali / taichi_3d_gaussian_splatting

An unofficial implementation of paper 3D Gaussian Splatting for Real-Time Radiance Field Rendering by taichi lang.

Apache License 2.0

637 stars 58 forks source link

Metal support for training/inference #147

Open almondai opened 9 months ago

almondai commented 9 months ago

Is there a timeline for adding metal backend to the Taichi training code? I have other GPUs but apple silicon macs have a lot of unified memory and are very energy efficient. I think it would be a good long term platform for experimenting with Gaussian splats.

wanmeihuali commented 9 months ago

@almondai the latest Taichi version(please see their website to compile the latest version) does run on Metal, but it's not fully optimized yet, causing low fps rates during inference and slower training times. While memory isn't a bottleneck for 3D Gaussian splats (usually under 6GB), the Metal backend on Apple Silicon may require specific optimizations for improved performance, e.g. tune tile/share memory size.

If you're keen, diving into these optimizations could be a fascinating challenge, though quite time-consuming.

almondai commented 9 months ago

@wanmeihuali thanks, I will check out the Taichi repo and hopefully share some numbers on training runtimes. Also I am interested in your code's lower memory usage (<8gb) while the original repo from the authors suggested at least 24gb of VRAM for highest fidelity?

wanmeihuali commented 9 months ago

Hi @almondai , I don't think the official implementation need that much VRAM... Anyway, the GPU memory usage are highly depends on image resolution and num of Gaussian points. For the current truck scene running on AWS sagemaker T4(16GB VRAM), the GPU Memory Utilization is around 21%(3.4GB).

almondai commented 8 months ago

I was able to run it on Apple silicon/metal after compiling Taichi from source (v1.7.0) to link with the metal3 API. Then I trained the truck dataset on a m2 ultra with 60 gpu cores. It was pretty slow, but I also noticed errors in the output:

Here is a screenshot of the Truck scene trained on Metal (and rendered via Metal)

mac_metal_Screenshot 2023-11-21 at 9 47 47 PM

So I ran the same training parameters using a CUDA/Nvidia backend and here is the screenshot of the same scene but trained on CUDA (rendered on Linux)

Linux_cuda_Screenshot from 2023-11-21 21-45-15

Linux_cuda_Screenshot from 2023-11-21 21-44-48

and the same parquet file (CUDA) rendered on a Mac with metal backend.

mac_cuda_Screenshot 2023-11-21 at 9 44 24 PM

mac_cuda_Screenshot 2023-11-21 at 9 43 58 PM

It seems to me the forward pass on metal is mostly correct. However the backward pass is not working correctly although the training did complete all 30,000 iterations without crashing.