nerfstudio-project / gsplat

CUDA accelerated rasterization of gaussian splatting
https://docs.gsplat.studio/
Apache License 2.0
1.25k stars 139 forks source link

nd rasterizer is 10x slower than rasterizer #68

Open zubair-irshad opened 8 months ago

zubair-irshad commented 8 months ago

Hi, Great work! nd rasterizer is around 10x slower than sh rasterizer. To be precise, my model inference time with sh rasterization is 0.008s which gives me >100FPS as described in the original gaussian splatting paper but just adding nd rasterizaiton reduces it to 0.075 s and 13 FPS.

Is there a way to make it better? Any intuition would be greatly appreciated. With nd rasterization, it looks like we lose the benefits i.e. speed of gaussian splatting. Thank you again for the awesome work!

vye16 commented 8 months ago

Hi! What N are you using? In rasterization, each pixel requires an N-d array of workspace memory. For RGB, we can fit that in register memory, and can specify this statically at compile time. We wrote N-d for the case that the necessary workspace exceeds available register memory, and must be in global memory. This means we can't make the same kinds of optimizations in the RGB rasterizer. If this is the case for you, then you can either stick with the global memory situation, or you can rasterize in batches with the current optimized RGB rasterizer (channels 0-3, 3-6, etc). We're considering adding an in-between version of the rasterizer for MAX_REGISTER_CHANNELS=16 with similar optimizations to the RGB rasterizer.

On Thu, Nov 2, 2023 at 3:52 PM Zubair Irshad @.***> wrote:

Hi, Great work! nd rasterizer is around 10x slower than sh rasterizer. To be precise, my model inference time with sh rasterization is 0.008s which gives me >100FPS as described in the original gaussian splatting paper but just adding nd rasterizaiton reduces it to 0.075 s and 13 FPS.

Is there a way to make it better? Any intuition would be greatly appreciated? With nd rasterization, it looks like we loose the benefits i.e. speed of gaussian splatting. Thank you again for the awesome work!

— Reply to this email directly, view it on GitHub https://github.com/nerfstudio-project/gsplat/issues/68, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLOKW3JAPS4MXO7IX2BRLTYCQP4PAVCNFSM6AAAAAA63TUTHCVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TKMJWGQ3DKMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

zubair-irshad commented 8 months ago

Thank you for the great intuition and detailed response. My channel size is currently 29 but I am considering increasing the feature size to 128 or even 256 which my worry is it will be slower than 13 FPS. I will try the batched RGB rasterizer as you suggested in a for-loop manner and see if it gives a higher FPS, thank you!

zubair-irshad commented 8 months ago

@vye16 Reporting back what I found. Implementing a for loop to rasterize multiple channels in batches i.e. 0-3, 3-5 etc instead of ND rasterization is slightly worse in performance and I didn't find it to improve performance. My guess is due to the for loop which has to run 10 times for the channel size I am trying i.e. 30. Any other intuition to improve performance is greatly appreciated, thank you!

Just to provide more specifics, per iteration time for 640 by 480 image for nd rasterization with N=30 is ~74-76ms, with batched (the one I shared above is 82-85ms, with just sh i.e. 3 channel rendering it is 16ms. The same results translate to fps numbers during inference i.e. 13fps for nd_rasterization with N=30 vs >100fps for sh rasterization only

zubair-irshad commented 8 months ago

Update: with batched implementation fps increased to 25 though it is still quite less than >100 for the original rasterizer implementation

zubair-irshad commented 8 months ago

@vye16 @maturk Any plans on supporting larger register channels i.e. MAX_REGISTER_CHANNELS>3 perhaps 16 or 32 to achieve same level of optimization that native sh rasterizer gives? I am happy to create a PR. Though just increasing this number gives some errors elsewhere for instance AT_ERROR("v_colors must have dimensions (N, 3)"); Should I change anything else in the CUDA code to achieve this?

I am wondering if there are any downsides of specifying 128, 256 or 512 MAX_REGISTER_CHANNELS, would it affect the memory? I think GPUs with larger sizes can support this? Any intuition is greatly appreciated.

vye16 commented 7 months ago

Hi Zubair, sorry for the late response. Currently the color rasterization represents color in float3 (CUDA vectorized type). We can make a version that accepts N-d colors up to ~32 channels that could fit in shared memory during rasterization. Unfortunately 128, 256, 512 would be too big to fit in shared memory in one pass, but it is possible to rasterize them in batches of channels (0-32) that fit in shared memory. This is unlikely to reach similar performance, but would be better than the current ND-rasterizer. In the near-term we're not currently working on it, but I'm happy to guide you if you'd like to make a PR.

zubair-irshad commented 7 months ago

Thanks @vye16! I am happy to work on it and make a PR. Any pointers on where I start/which parts I look at changing first would be appreciated, thanks a lot!

SeanGuo063 commented 6 months ago

Any update to this issue? I am also working on rendering high dimensional features, and want to know how to speed up nd rasterizer

kerrj commented 4 months ago

130 works towards this issue, let me know if you try it out! @zubair-irshad

zubair-irshad commented 4 months ago

This is great, I will check it asap. Thanks @kerrj.