Open bobcao3 opened 2 years ago
While prototyping this change, I discovered using alloca
is a bit hard to implement on LLVM backends where the runtime is compiled from a cpp
file and then the loop body done through a function pointer. This produces an scope problem, not sure how we would fix it yet. On backends with src2src codegen or spirv codegen, this should be very straight forward.
Thread local memory pointer is really only a thing on CUDA iirc. Hardware wise all thread local memory are in the registers, as the first level of cache will be the shared memory. From the code it seems only atomic reduction is using TLS, and from the code it does not seem to need the capability to use direct memory offests instead of Allocas. We know & defined the TLS prologue to be always outside of the loop, so adding alloca statements into the TLS prologue should suffice. Is there anything we are missing here? @strongoier @yuanming-hu @k-ye
If we truly don't need a numeric pointer based TLS, we should remove it asap. This should give us a lot easier time implementing faster reduction on all backends, and with the new warp SIMT instructions, we might be able to directly bake in warp-level reduction in the code transform stage (no need for per-backend implementation).