[perf] Weird performance observations on CUDA (abnormally high amount of global memory access)

bobcao3 commented 3 years ago

Structural Observations:

Many programs contain loops and certainly graphis programs contain a lot of them. It is very often that the loop is going over the same data structure and the same computations, just with different data
If the operands of a statement that has no side effects are all not from the current scope, it can only be from all the previous scopes. (If we assume the IR is well formed, and we have access to those operands)
Moving a statement in a loop, that does not depend on anything within the loop, one layer up, will result in less redundant computations. The more iterations a loop has, the more redundancy is removed.
Doing so may or may not increase register pressure. In the case of all operands are still in use after the moved statement, we have increased register pressure in the loop up until the moved statement. If the operands are not used later, we have in fact decreased or maintained the same register pressure.

These performance numbers are NOT valid any more!

Performance measurements:

(test) (backend): (without) -> (with & include nested ifs) / (with but not include nested if)

Cornell Box, 5000 samples per frame

OpenGL : 1210 -> 977 / 1235
CUDA : 965 -> 1078 / 970

SDF renderer, 5000 samples per frame

OpenGL: 154 -> 137 / 155
CUDA: 286 -> 253 / 290

MPM128 4x Quality

OpenGL: 19.3FPS -> 19.5FPS / 19.5FPS
CUDA: 19.8FPS -> 20FPS / 20FPS
x64: 11FPS -> 11.5FPS / 11.5FPS (2x quality)

FEM128 N = 24

OpenGL: 55FPS -> 58FPS / 57FPS
CUDA: 31FPS -> 32.2FPS / 31.9FPS
x64: 37FPS -> 38.4FPS / 38FPS (N = 12)

Something interesting & weird:

Doing this process in every type of blocks, including ifs, cornell_box+CUDA improved to 1065 But doing so in other programs or same program but using OpenGL, this improvment is not seen.

This might be explained by SIMT divergence. If most threads are diverging then the if statements are carried out anyways.

However this explaination does not make sense for the OpenGL vs CUDA case. Is the actual parallel mode different in these systems? Could it be CUDA backend is launching too many threads in a workgroup (or the equivalent term in cuda)?

bobcao3 commented 3 years ago

Another question regarding the memory model of CHI IR: During the execution of an offload, if a load is carried out multiple times with the same arguments (same address), and given that the kernel this load is from does not issue store to that address, is it guaranteed to return the same result?

bobcao3 commented 3 years ago

OpenGL & NSight Graphics:

CUDA & NSight Compute:

(If you want to get profiling to work on OpenGL, add a glFinish call to your gui platform's redraw call, so that NSight or similar tools can grab onto a "frame". You don't need to do this for NSight compute / cuda.)

After some profiling, the very first striking thing that appeared to me is that the CUDA backend emitted an insane amount of memory operations, while the OpenGL backend barely touched memory at all. This is very visible from the super different VRAM and L2 SOL metrics. This may explain why moving stuff even out of ifs improves performance on CUDA as I suspect these memory operations are all temporary variables.

bobcao3 commented 3 years ago

There are quite a lot of allocas generated in the IR, and the OpenGL backend just maps that to variables. This might be the culprit?

bobcao3 commented 3 years ago

Found this: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXLowerAlloca.cpp

Is there a way to verify whether this pass is ran or not? Do we need to add it in with the function_pass_manager?

With some profiling with NSight Compute, I found that all the loads and stores that seems to be related to allocas are using LD.E (load / store to generic memory) instead of LD.local as the lowering alloca pass suggested. Maybe a LLVM configuration problem or something. @yuanming-hu Any ideas?

yuanming-hu commented 3 years ago

Thanks for all the feedback - TBH I'm not sure if the NVPTXLowerAlloca.cpp pass is included in the NVPTX codegen. At this point we don't have too much customization there: https://github.com/taichi-dev/taichi/blob/a4279bd8e1aa6b7d0d269ff909773d333fab5daa/taichi/backends/cuda/jit_cuda.cpp#L164-L314

Maybe you can take a close look there :-)

I strongly agree that providing more information to the LD instruction may result in a better performance :-) Maybe @xumingkuan can investigate this together.

xumingkuan commented 3 years ago

Another question regarding the memory model of CHI IR: During the execution of an offload, if a load is carried out multiple times with the same arguments (same address), and given that the kernel this load is from does not issue store to that address, is it guaranteed to return the same result?

Yes, and ideally the "identical load elimination" (in "cfg (control-flow graph) optimization") can help eliminate the redundant loads.

Offloaded tasks are executed one by one on the backends.

taichi-dev / taichi

[perf] Weird performance observations on CUDA (abnormally high amount of global memory access) #2324