Closed valassi closed 3 years ago
I trid to understand if there was any other analysis to look into this. I could only find hints about how to do it with the old tools nvprof and nvvp. (I guess profile files should be .nvvp?).
nvprof -o pippo.prof -a branch ./gcheck.exe -p 65536 128 1
nvvp
This is a screenshot from nvvp on that profile. It just says there are no issues with divergent branches, without any more details. I guess it uses the same metrics as in stall barrier? Anyway, I think there really are no issues
Note that this is also relevant to vectorisation, #71 and #72.
The fact that we get almost the full factor 4 from AVX2 is a sign that we have no divergence on the CPU.
We should keep this open to reevaluate when we add a selection cut.
After a few months I have come back on this issue with two improvements
(1) NEW TESTS AND METRICS
The code is in PR #202 and #203
The main metric is sm__sass_average_branch_targets_threads_uniform.pct https://github.com/madgraph5/madgraph4gpu/blob/2510b365c1470f56ea784db2726367fd52f045b1/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/throughput12.sh#L157 This metric should be 100% for uniform execution (i.e. no divergence) and less than 100% for divergence.
Unfortunately, it is difficult to translate a percentage of non-uniformity into a throughput degradation. In the example below, I get a 96% uniformity, but the throughput degradation is around 20-30%, not just 4%!
The test is this https://github.com/madgraph5/madgraph4gpu/blob/2510b365c1470f56ea784db2726367fd52f045b1/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc#L118 Essentially, in half of the threads in a warp I use the default optimized opzxxx, in the other half I use the non optimized oxxxxx.
The actual "4%" seems to be computed in the following way: there are a number of "branches" in the code in total, which can be either uniform (taken by all threads in a warp) or divergent (taken by some threads but not all, essentially). In my test, the current eemumu cuda, WE HAVE NO DIVERGENCE, and there are 53 branches, all 53 are taken in a uniform way. I guess these 53 include function calls and other possible decision points (or maybe, we actually have many ifs...). If I introduce a very silly/simple divergence as above, the number of branches goes from 53 to 109, and actually this reports 4 non unfirm branches and 105 uniform branches. The 105/109 is 96.33%. Not really helpful to translate to throughputs, but that's it. WE SHOULD AIM TO STAY AT 100% UNIFORM BRANCH EXECUTION.
This is with the artifical divergence https://github.com/madgraph5/madgraph4gpu/commit/b51bee6b03316b898dc34775f82b519dfab539ca
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.711994e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.745683 sec
2,603,540,638 cycles # 2.655 GHz
3,537,849,260 instructions # 1.36 insn per cycle
1.049477458 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
: smsp__sass_branch_targets.sum 109 4.18/usecond
: smsp__sass_branch_targets_threads_uniform.sum 105 4.03/usecond
: smsp__sass_branch_targets_threads_divergent.sum 4 153.37/msecond
: smsp__warps_launched.sum 1
=========================================================================
This is without divergence (I also include ggttgg) https://github.com/madgraph5/madgraph4gpu/commit/aaa28b7f81206639e9aba486ef1a76d23fc0d775
On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.425099e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.741551 sec
2,589,547,187 cycles # 2.655 GHz
3,537,039,425 instructions # 1.37 insn per cycle
1.044156654 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
: smsp__sass_branch_targets.sum 53 2.89/usecond
: smsp__sass_branch_targets_threads_uniform.sum 53 2.89/usecond
: smsp__sass_branch_targets_threads_divergent.sum 0 0/second
: smsp__warps_launched.sum 1
-------------------------------------------------------------------------
FP precision = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.454874e+05 ) sec^-1
MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4
TOTAL : 0.602111 sec
2,193,960,041 cycles # 2.654 GHz
2,948,877,241 instructions # 1.34 insn per cycle
0.885704400 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
: smsp__sass_branch_targets.sum 17,683 1.52/usecond
: smsp__sass_branch_targets_threads_uniform.sum 17,683 1.52/usecond
: smsp__sass_branch_targets_threads_divergent.sum 0 0/second
: smsp__warps_launched.sum 1
=========================================================================
Note that in the tests above I go for -p 1 32 1, which launches only one warp (32 threads) in total.
These tests above are using ncu with the command line interface.
(2) COMMENTS ON THE OLD TESTS IN THIS THREAD
Concerning my previous comments on stalled barriers, this does not seem to be very useful to measure thread divergence. At least, in my simple test with oxxxxx/opzxxx, the ncu metrics about stalled barriers were not helpful.
I did a few more tests with ncu using the GUI. This is also interesting. For instance
ALL IN ALL, THIS SHOWS THAT EVEN A MINIMAL DIVERGENCE CAUSES BIG BIG ISSUES...
Notice that the stalled barrier that I had mentioned before, instead, does not seem to have any relevance, in this example it is zero both for the divergenet and the unform test
Finally, THREAD DIVERGENCE IS INDICATED IN THE FINAL NCU SECTION, SOURCE COUNTERS. This is also the one that complains about non coalesced memory access
In the default non-divergent code version, I am told 100% branch efficiency, and I am not told about any non-coalesced memory access
(3) ABOUT NVVP
About my previous comments on the older nvvp tool, I will not reproduce tests here. I showed that using ncu, either in command line mode or GUI mode, is enough to check if ther eis thread divergence
Finally, a few useful links about branch efficiency
I think that this can be closed
Eventually, if we do start having branch divergence (hopefully not), I think that it shoud be possible to correlate branch divergence to divergent branches (as discussed also in the thre elinks above)
Closing as completed...
PS last comment: I also checked the utilization of the ADU pipeline (address divergence unit) https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-decoder
However in my example there does not seem to be a big difference (actually ADU is slightly more busy with no divergence?)
Just a note as a reminder, followin g up on 'SIMD/SIMT' issues. After investigating SOA/AOS data access and showing we have no uncoalesced amemory access (issue #16) for momenta, I was wondering how to best check in the profiler if we have issues with divergent branches, i.e. threads in our warps which go out of 'lockstep'.
The only reference I found in the profiler doc is here https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#statistical-sampler
If I understand correctly, this means that we should see "Stalled Barrier" in the Warp statistics. This seems to be always at zero.
I would say that we have no issues with branch divergence. Not surprising really, as all threads are doing exactly the same operations...