Do you have any details how you fuse kernels together?
If I am not mistaken, Nvidia's project does it by hand.
Do you do it automatically? Are there any limitations?
I also wrote the kernels by hand. The main difference is:
tiny-cuda-nn: Kernels are instantiated on compile-time (switch-statements to template instantiations) --> full control and slightly faster
quick-mlp: Kernels are assembled on runtime by setting pre-processor macros and template parameters based on the network config and then compile using nvrtc --> more flexible
Do you have any details how you fuse kernels together? If I am not mistaken, Nvidia's project does it by hand. Do you do it automatically? Are there any limitations?