m-schuetz / SimLOD

MIT License
454 stars 24 forks source link

Faster inner loop in render.cu #8

Closed adamtassier closed 2 months ago

adamtassier commented 2 months ago

Hi, I've reimplemented a large part of your other 'Compute Rasterizer' paper in Vulkan. While doing this I found a slightly faster way to process the points in the inner loop of the compute shader. For my main rasterization pass this gets me from from 9.98ms to 9.54ms. Might be really GPU dependant but anyway:

Instead of doing this loop where each thread has its own 'pointIndex'

for(
  int pointIndex = block.thread_rank(); 
  pointIndex < numElements; 
  pointIndex += block.num_threads()
){

We can scalarize this loop to:

for(
  int pointIndex = 0 
  pointIndex < numElements; 
  pointIndex += block.num_threads()
){
int threadIndex = min(pointIndex + block.thread_rank() , numElements - 1);

Of course the last few threads will do some useless work and write out a couple of points twice. You can add another branch on pointIndex + block.thread_rank() < numElements - 1 instead of the min but I found this to be slightly slower (9.59ms) than just processing them twice.

Hope it works for you as well and thanks for all the great papers and code.

Tested on a crappy Nvidia T1200 GPU.