taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.52k stars 2.29k forks source link

[lang] Use Allocas for TLS instead of a memory pointer #4635

Open bobcao3 opened 2 years ago

bobcao3 commented 2 years ago

Thread local memory pointer is really only a thing on CUDA iirc. Hardware wise all thread local memory are in the registers, as the first level of cache will be the shared memory. From the code it seems only atomic reduction is using TLS, and from the code it does not seem to need the capability to use direct memory offests instead of Allocas. We know & defined the TLS prologue to be always outside of the loop, so adding alloca statements into the TLS prologue should suffice. Is there anything we are missing here? @strongoier @yuanming-hu @k-ye

If we truly don't need a numeric pointer based TLS, we should remove it asap. This should give us a lot easier time implementing faster reduction on all backends, and with the new warp SIMT instructions, we might be able to directly bake in warp-level reduction in the code transform stage (no need for per-backend implementation).

bobcao3 commented 2 years ago

While prototyping this change, I discovered using alloca is a bit hard to implement on LLVM backends where the runtime is compiled from a cpp file and then the loop body done through a function pointer. This produces an scope problem, not sure how we would fix it yet. On backends with src2src codegen or spirv codegen, this should be very straight forward.