Open bobcao3 opened 3 years ago
Another question regarding the memory model of CHI IR: During the execution of an offload, if a load is carried out multiple times with the same arguments (same address), and given that the kernel this load is from does not issue store to that address, is it guaranteed to return the same result?
OpenGL & NSight Graphics:
CUDA & NSight Compute:
(If you want to get profiling to work on OpenGL, add a glFinish call to your gui platform's redraw call, so that NSight or similar tools can grab onto a "frame". You don't need to do this for NSight compute / cuda.)
After some profiling, the very first striking thing that appeared to me is that the CUDA backend emitted an insane amount of memory operations, while the OpenGL backend barely touched memory at all. This is very visible from the super different VRAM and L2 SOL metrics. This may explain why moving stuff even out of ifs improves performance on CUDA as I suspect these memory operations are all temporary variables.
There are quite a lot of allocas generated in the IR, and the OpenGL backend just maps that to variables. This might be the culprit?
Found this: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp
Is there a way to verify whether this pass is ran or not? Do we need to add it in with the function_pass_manager
?
With some profiling with NSight Compute, I found that all the loads and stores that seems to be related to allocas are using LD.E
(load / store to generic memory) instead of LD.local
as the lowering alloca pass suggested. Maybe a LLVM configuration problem or something. @yuanming-hu Any ideas?
Thanks for all the feedback - TBH I'm not sure if the NVPTXLowerAlloca.cpp
pass is included in the NVPTX codegen. At this point we don't have too much customization there:
https://github.com/taichi-dev/taichi/blob/a4279bd8e1aa6b7d0d269ff909773d333fab5daa/taichi/backends/cuda/jit_cuda.cpp#L164-L314
Maybe you can take a close look there :-)
I strongly agree that providing more information to the LD
instruction may result in a better performance :-) Maybe @xumingkuan can investigate this together.
Another question regarding the memory model of CHI IR: During the execution of an offload, if a load is carried out multiple times with the same arguments (same address), and given that the kernel this load is from does not issue store to that address, is it guaranteed to return the same result?
Yes, and ideally the "identical load elimination" (in "cfg (control-flow graph) optimization") can help eliminate the redundant loads.
Offloaded tasks are executed one by one on the backends.
Structural Observations:
These performance numbers are NOT valid any more!
Performance measurements:
Cornell Box, 5000 samples per frame
SDF renderer, 5000 samples per frame
MPM128 4x Quality
FEM128 N = 24
Something interesting & weird:
Doing this process in every type of blocks, including ifs, cornell_box+CUDA improved to 1065 But doing so in other programs or same program but using OpenGL, this improvment is not seen.
This might be explained by SIMT divergence. If most threads are diverging then the if statements are carried out anyways.
However this explaination does not make sense for the OpenGL vs CUDA case. Is the actual parallel mode different in these systems? Could it be CUDA backend is launching too many threads in a workgroup (or the equivalent term in cuda)?