I'm slowly getting the hang of what .in() does, but this I don't get. It seems that the first block is meant to copy it to block Shared Memory, and then the second one (the one embedded in code here) is meant to load it into registers? Maybe I'm not familiar with how CUDA works, but how can a function be loaded into registers? Every value goes into a register? Why do you know this in this case? Doesn't there need to be a .store_in(MemoryType::Register) then? Same for the loading in the shared memory: doesn't it need a .store_in(MemoryType::GPUShared)?
The app/interpolate no longer uses the
.in()
directive. A new app should be chosen to guide the reader to a useful example. https://github.com/halide/Halide/blob/c0192ffa71bbebfbdcb6eddcdf060169f5022ea2/src/Func.h#L1313-L1316While we are at
.in()
(again with FAQs efforts in mind), I'd like to also hear about the technique of copying memory into a SM's shared memory for improved performance. There is a trick in the apps somewhere that uses.in().in()
to achieve this. I think this needs extensive elaboration: https://github.com/halide/Halide/blob/c0192ffa71bbebfbdcb6eddcdf060169f5022ea2/apps/stencil_chain/stencil_chain_generator.cpp#L86-L101I'm slowly getting the hang of what
.in()
does, but this I don't get. It seems that the first block is meant to copy it to block Shared Memory, and then the second one (the one embedded in code here) is meant to load it into registers? Maybe I'm not familiar with how CUDA works, but how can a function be loaded into registers? Every value goes into a register? Why do you know this in this case? Doesn't there need to be a.store_in(MemoryType::Register)
then? Same for the loading in the shared memory: doesn't it need a.store_in(MemoryType::GPUShared)
?