[lang] Use Allocas for TLS instead of a memory pointer

Thread local memory pointer is really only a thing on CUDA iirc. Hardware wise all thread local memory are in the registers, as the first level of cache will be the shared memory. From the code it seems only atomic reduction is using TLS, and from the code it does not seem to need the capability to use direct memory offests instead of Allocas. We know & defined the TLS prologue to be always outside of the loop, so adding alloca statements into the TLS prologue should suffice. Is there anything we are missing here? @strongoier @yuanming-hu @k-ye

If we truly don't need a numeric pointer based TLS, we should remove it asap. This should give us a lot easier time implementing faster reduction on all backends, and with the new warp SIMT instructions, we might be able to directly bake in warp-level reduction in the code transform stage (no need for per-backend implementation).

taichi-dev / taichi

[lang] Use Allocas for TLS instead of a memory pointer #4635