Branch efficiency: check that we have no issues with branch divergence

valassi commented 4 years ago

Just a note as a reminder, followin g up on 'SIMD/SIMT' issues. After investigating SOA/AOS data access and showing we have no uncoalesced amemory access (issue #16) for momenta, I was wondering how to best check in the profiler if we have issues with divergent branches, i.e. threads in our warps which go out of 'lockstep'.

The only reference I found in the profiler doc is here https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#statistical-sampler

If I understand correctly, this means that we should see "Stalled Barrier" in the Warp statistics. This seems to be always at zero.

I would say that we have no issues with branch divergence. Not surprising really, as all threads are doing exactly the same operations...

valassi commented 4 years ago

I trid to understand if there was any other analysis to look into this. I could only find hints about how to do it with the old tools nvprof and nvvp. (I guess profile files should be .nvvp?).

nvprof -o pippo.prof -a branch ./gcheck.exe -p 65536 128 1 
nvvp

This is a screenshot from nvvp on that profile. It just says there are no issues with divergent branches, without any more details. I guess it uses the same metrics as in stall barrier? Anyway, I think there really are no issues

valassi commented 3 years ago

Note that this is also relevant to vectorisation, #71 and #72.

The fact that we get almost the full factor 4 from AVX2 is a sign that we have no divergence on the CPU.

We should keep this open to reevaluate when we add a selection cut.

valassi commented 3 years ago

After a few months I have come back on this issue with two improvements

one, I think I understand better which are the relevant metrics
two, i made a small test that introduces artifically some divergence, just to see what this gives in the profiles I also make some comments on my previous posts

(1) NEW TESTS AND METRICS

The code is in PR #202 and #203

The main metric is sm__sass_average_branch_targets_threads_uniform.pct https://github.com/madgraph5/madgraph4gpu/blob/2510b365c1470f56ea784db2726367fd52f045b1/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/throughput12.sh#L157 This metric should be 100% for uniform execution (i.e. no divergence) and less than 100% for divergence.

Unfortunately, it is difficult to translate a percentage of non-uniformity into a throughput degradation. In the example below, I get a 96% uniformity, but the throughput degradation is around 20-30%, not just 4%!

The test is this https://github.com/madgraph5/madgraph4gpu/blob/2510b365c1470f56ea784db2726367fd52f045b1/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc#L118 Essentially, in half of the threads in a warp I use the default optimized opzxxx, in the other half I use the non optimized oxxxxx.

The actual "4%" seems to be computed in the following way: there are a number of "branches" in the code in total, which can be either uniform (taken by all threads in a warp) or divergent (taken by some threads but not all, essentially). In my test, the current eemumu cuda, WE HAVE NO DIVERGENCE, and there are 53 branches, all 53 are taken in a uniform way. I guess these 53 include function calls and other possible decision points (or maybe, we actually have many ifs...). If I introduce a very silly/simple divergence as above, the number of branches goes from 53 to 109, and actually this reports 4 non unfirm branches and 105 uniform branches. The 105/109 is 96.33%. Not really helpful to translate to throughputs, but that's it. WE SHOULD AIM TO STAY AT 100% UNIFORM BRANCH EXECUTION.

This is with the artifical divergence https://github.com/madgraph5/madgraph4gpu/commit/b51bee6b03316b898dc34775f82b519dfab539ca

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.711994e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.745683 sec
     2,603,540,638      cycles                    #    2.655 GHz
     3,537,849,260      instructions              #    1.36  insn per cycle
       1.049477458 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
                             : smsp__sass_branch_targets.sum                       109        4.18/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       105        4.03/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     4          153.37/msecond
                             : smsp__warps_launched.sum                            1
=========================================================================

This is without divergence (I also include ggttgg) https://github.com/madgraph5/madgraph4gpu/commit/aaa28b7f81206639e9aba486ef1a76d23fc0d775

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.425099e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.741551 sec
     2,589,547,187      cycles                    #    2.655 GHz
     3,537,039,425      instructions              #    1.37  insn per cycle
       1.044156654 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       53         2.89/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       53         2.89/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     0          0/second
                             : smsp__warps_launched.sum                            1
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.454874e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.602111 sec
     2,193,960,041      cycles                    #    2.654 GHz
     2,948,877,241      instructions              #    1.34  insn per cycle
       0.885704400 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       17,683     1.52/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       17,683     1.52/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     0          0/second
                             : smsp__warps_launched.sum                            1
=========================================================================

Note that in the tests above I go for -p 1 32 1, which launches only one warp (32 threads) in total.

These tests above are using ncu with the command line interface.

(2) COMMENTS ON THE OLD TESTS IN THIS THREAD

Concerning my previous comments on stalled barriers, this does not seem to be very useful to measure thread divergence. At least, in my simple test with oxxxxx/opzxxx, the ncu metrics about stalled barriers were not helpful.

I did a few more tests with ncu using the GUI. This is also interesting. For instance

The throughput indeed decreases by 26%, ie the kernel time increases by that much
Memory usage degrades considerably: I now get ncu warnings about non-coalesced memory access, which I was not getting before, and the number of requests an dtransaction sincrease by 40% (not clear why?)
The number of instructions increases by 10%
Memory throughput decreases by 10%
Even the number of registers increases slightly! From 120 to 128

ALL IN ALL, THIS SHOWS THAT EVEN A MINIMAL DIVERGENCE CAUSES BIG BIG ISSUES...

Notice that the stalled barrier that I had mentioned before, instead, does not seem to have any relevance, in this example it is zero both for the divergenet and the unform test

Finally, THREAD DIVERGENCE IS INDICATED IN THE FINAL NCU SECTION, SOURCE COUNTERS. This is also the one that complains about non coalesced memory access

In the default non-divergent code version, I am told 100% branch efficiency, and I am not told about any non-coalesced memory access

(3) ABOUT NVVP

About my previous comments on the older nvvp tool, I will not reproduce tests here. I showed that using ncu, either in command line mode or GUI mode, is enough to check if ther eis thread divergence

valassi commented 3 years ago

Finally, a few useful links about branch efficiency

I think that this can be closed

the branch efficiency metric is printed out routinely via ncu in my throughtput12.sh script, we should check it is 100%
it is easy to get it from ncu also in a GUI via the source counters section

Eventually, if we do start having branch divergence (hopefully not), I think that it shoud be possible to correlate branch divergence to divergent branches (as discussed also in the thre elinks above)

Closing as completed...

valassi commented 3 years ago

PS last comment: I also checked the utilization of the ADU pipeline (address divergence unit) https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-decoder

However in my example there does not seem to be a big difference (actually ADU is slightly more busy with no divergence?)

madgraph5 / madgraph4gpu

Branch efficiency: check that we have no issues with branch divergence #25