Open Yang-Xijie opened 1 month ago
Not sure if https://github.com/ml-explore/mlx/blob/b05bcfd27f5f1293401b74dce02e38c8fd7ef66a/mlx/backend/metal/kernels/sort.metal could be directly used...
// Based on GPU merge sort algorithm at
// https://github.com/NVIDIA/cccl/tree/main/cub/cub
@Yang-Xijie I have some trouble using MLX sort/argSort. So I would say don't.
And actually MetalPerformanceShaderGraph
provide some ML operators like sort and argSort that we could use here. They have better performance and stability compared to MLX.
Here is a simple implementation based on MPSGraph
https://gist.github.com/kemchenj/26e1dad40e5b89de2828bad36c81302f. Feel free to use it if you got interest.
I find the following sentence in README in commit 8876629deb61fba978b91540b201eb3a42b81d4a:
https://github.com/scier/MetalSplatter/blob/8876629deb61fba978b91540b201eb3a42b81d4a/README.md?plain=1#L23
which was removed in commit 221348301cae5e14388410b9ed080c3d61cbfbbc:
https://github.com/scier/MetalSplatter/blob/221348301cae5e14388410b9ed080c3d61cbfbbc/README.md?plain=1#L13
However, it seems sorting still happens on the CPU side:
https://github.com/scier/MetalSplatter/blob/ec6d0f2cdee958c9617925eda474362869cbf3de/MetalSplatter/Sources/SplatRenderer.swift#L608
The performance of CPU sort is N*log(N). Here are some performance report (on M2 Pro MacBook Pro, release mode, random numbers):
〇 UInt32 1000000: 0.071 s 10000000: 0.759 s 〇 Float32 1000000 0.084 s 10000000 0.934 s
impeding real-time rendering of 3D gaussians...
It seems that radix sort on GPU for Metal is missing. (https://developer.apple.com/forums/thread/105886)
I did find an implementation on Apple Silicon, but it is difficult for me to understand. (https://github.com/ShoYamanishi/AppleNumericalComputing/tree/main/05_radix_sort#54-metal-radix-sort-implementations)
Will there be someone implementing the sorting on GPU to improve the performance? Or is there some library we can directly adopt to accelerate?