About the render_ray_kernel

Hi, Thanks for sharing the excellent works. I have carefully read the code, and I find in the cuda implementation of render_ray_kernel function, you calculate a single ray with a warp of kernels, each kernel caculates a single channel, as I understand. I wonder the benifits of this way, as I find that many repeated computations are excuted for the SH method, the only parallism is the add of multiple coefficient, as I understand. I wonder if this is the best way you have tried for accelerating computations for SH representations. And why this way is better. Thanks you.

sxyu / svox2

About the render_ray_kernel #118