scier / MetalSplatter

Render Gaussian Splats using Metal on Apple platforms (iOS/iPhone/iPad, macOS, and visionOS)
MIT License
286 stars 35 forks source link

Any progress or hint on implementing GPU sorting? #39

Open Yang-Xijie opened 1 month ago

Yang-Xijie commented 1 month ago

I find the following sentence in README in commit 8876629deb61fba978b91540b201eb3a42b81d4a:

https://github.com/scier/MetalSplatter/blob/8876629deb61fba978b91540b201eb3a42b81d4a/README.md?plain=1#L23

which was removed in commit 221348301cae5e14388410b9ed080c3d61cbfbbc:

https://github.com/scier/MetalSplatter/blob/221348301cae5e14388410b9ed080c3d61cbfbbc/README.md?plain=1#L13

However, it seems sorting still happens on the CPU side:

https://github.com/scier/MetalSplatter/blob/ec6d0f2cdee958c9617925eda474362869cbf3de/MetalSplatter/Sources/SplatRenderer.swift#L608


The performance of CPU sort is N*log(N). Here are some performance report (on M2 Pro MacBook Pro, release mode, random numbers):

〇 UInt32 1000000: 0.071 s 10000000: 0.759 s 〇 Float32 1000000 0.084 s 10000000 0.934 s

impeding real-time rendering of 3D gaussians...


It seems that radix sort on GPU for Metal is missing. (https://developer.apple.com/forums/thread/105886)

I did find an implementation on Apple Silicon, but it is difficult for me to understand. (https://github.com/ShoYamanishi/AppleNumericalComputing/tree/main/05_radix_sort#54-metal-radix-sort-implementations)

Will there be someone implementing the sorting on GPU to improve the performance? Or is there some library we can directly adopt to accelerate?

Yang-Xijie commented 1 month ago

Not sure if https://github.com/ml-explore/mlx/blob/b05bcfd27f5f1293401b74dce02e38c8fd7ef66a/mlx/backend/metal/kernels/sort.metal could be directly used...

https://github.com/ml-explore/mlx/blob/b05bcfd27f5f1293401b74dce02e38c8fd7ef66a/mlx/backend/metal/kernels/sort.h

// Based on GPU merge sort algorithm at
// https://github.com/NVIDIA/cccl/tree/main/cub/cub
kemchenj commented 1 month ago

@Yang-Xijie I have some trouble using MLX sort/argSort. So I would say don't.

And actually MetalPerformanceShaderGraph provide some ML operators like sort and argSort that we could use here. They have better performance and stability compared to MLX.

Here is a simple implementation based on MPSGraph https://gist.github.com/kemchenj/26e1dad40e5b89de2828bad36c81302f. Feel free to use it if you got interest.