what is happening here? (performance around some multiplications)

smithdh commented 2 years ago

Hi,

While testing some changes to the complex number class using the alpaka port I noticed the change in performance noted below, and thought I'd document/and or have some brainstorming about what is happening. This is with the gg->ttgg process.

Variant 1

ALPAKA_FN_ACC friend complex<FPType> operator*(const FPType &lhs, complex<FPType> rhs) {
const FPType s1 = lhs*rhs.r_;
const FPType s3 = lhs*rhs.i_;
rhs.r_ = s1;
rhs.i_ = s3;
return rhs;
}

EvtsPerSec[MatrixElems] (3) = ( 2.599119e+05 ) sec^-1

Variant 2

ALPAKA_FN_ACC friend complex<FPType> operator*(const FPType &lhs, complex<FPType> rhs) {
const FPType s1 = lhs*rhs.r_;
const FPType s3 = lhs*(rhs.r_+rhs.i_);
rhs.r_ = s1;
rhs.i_ = s3-s1;
return rhs;
}

EvtsPerSec[MatrixElems] (3) = ( 3.616479e+05 ) sec^-1

Variant 3

ALPAKA_FN_ACC friend complex<FPType> operator*(const FPType &lhs, complex<FPType> rhs) {
const FPType s1 = lhs*rhs.r_;
const FPType s3 = lhs*rhs.i_ + 1;
rhs.r_ = s1;
rhs.i_ = s3-1;
return rhs;
}

EvtsPerSec[MatrixElems] (3) = ( 3.669847e+05 ) sec^-1

So in terms of performance these go (3),(2),(1) with (3) being about 40% faster than (1). (The odd form of (2) was due to me initially simplifying the complex*complex calculation and writing the function in that form). I add some cubin disassemblies below:

smithdh commented 2 years ago

Cubin dissamblies (unusually, this was building objects to cubin rather than ptx (which then uses jit the first time you run), using "-gencode arch=compute_70,code=sm_70"). Approach was to write a small function to call the operator in isolation to allow easier isolation of the cubin.

e.g.

  template< typename T_Acc>
  ALPAKA_FN_ACC
  cxtype_sv dhstest( T_Acc const &acc, const fptype_sv s, cxtype_sv &cxio) {
    return s * cxio;
  }
#endi

Variant 1

                Function : _ZN5gProc7dhstestIN6alpaka12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEN8alsimple7complexIdEERKT_dRS8_
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;       /* 0x000000fffffff389 */
                                                                       /* 0x000fe200000e00ff */
        /*0010*/                   LD.E.128.SYS R12, [R10] ;           /* 0x000000000a0c7980 */
                                                                       /* 0x000ea4000010ed00 */
        /*0020*/                   DMUL R14, R8, R14 ;                 /* 0x0000000e080e7228 */
                                                                       /* 0x004fc80000000000 */
        /*0030*/                   DMUL R12, R8, R12 ;                 /* 0x0000000c080c7228 */
                                                                       /* 0x000e120000000000 */
        /*0040*/                   ST.E.128.SYS [R4], R12 ;            /* 0x0000000004007385 */
                                                                       /* 0x0011e4000010ed0c */
        /*0050*/                   RET.ABS.NODEC R20 0x0 ;             /* 0x0000000014007950 */
                                                                       /* 0x001fea0003e00000 */
        /*0060*/                   BRA 0x60;                           /* 0xfffffff000007947 */
                                                                       /* 0x000fc0000383ffff */
        /*0070*/                   NOP;                                /* 0x0000000000007918 */
                                                                       /* 0x000fc00000000000 */

2.

                Function : _ZN5gProc7dhstestIN6alpaka12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEN8alsimple7complexIdEERKT_dRS8_
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;       /* 0x000000fffffff389 */
                                                                       /* 0x000fe200000e00ff */
        /*0010*/                   LD.E.128.SYS R12, [R10] ;           /* 0x000000000a0c7980 */
                                                                       /* 0x000ea4000010ed00 */
        /*0020*/                   DADD R14, R12, R14 ;                /* 0x000000000c0e7229 */
                                                                       /* 0x004fc8000000000e */
        /*0030*/                   DMUL R12, R8, R12 ;                 /* 0x0000000c080c7228 */
                                                                       /* 0x000e100000000000 */
        /*0040*/                   DFMA R14, R8, R14, -R12 ;           /* 0x0000000e080e722b */
                                                                       /* 0x001e12000000080c */
        /*0050*/                   ST.E.128.SYS [R4], R12 ;            /* 0x0000000004007385 */
                                                                       /* 0x0011e4000010ed0c */
        /*0060*/                   RET.ABS.NODEC R20 0x0 ;             /* 0x0000000014007950 */
                                                                       /* 0x001fea0003e00000 */
        /*0070*/                   BRA 0x70;                           /* 0xfffffff000007947 */
                                                                       /* 0x000fc0000383ffff */

3.

                Function : _ZN5gProc7dhstestIN6alpaka12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEN8alsimple7complexIdEERKT_dRS8_
        .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;       /* 0x000000fffffff389 */
                                                                       /* 0x000fe200000e00ff */
        /*0010*/                   LD.E.128.SYS R12, [R10] ;           /* 0x000000000a0c7980 */
                                                                       /* 0x000ea4000010ed00 */
        /*0020*/                   DFMA R14, R8, R14, 1 ;              /* 0x3ff00000080e742b */
                                                                       /* 0x004e08000000000e */
        /*0030*/                   DMUL R12, R8, R12 ;                 /* 0x0000000c080c7228 */
                                                                       /* 0x000fc80000000000 */
        /*0040*/                   DADD R14, R14, -1 ;                 /* 0xbff000000e0e7429 */
                                                                       /* 0x001e120000000000 */
        /*0050*/                   ST.E.128.SYS [R4], R12 ;            /* 0x0000000004007385 */
                                                                       /* 0x0011e4000010ed0c */
        /*0060*/                   RET.ABS.NODEC R20 0x0 ;             /* 0x0000000014007950 */
                                                                       /* 0x001fea0003e00000 */
        /*0070*/                   BRA 0x70;                           /* 0xfffffff000007947 */
                                                                       /* 0x000fc0000383ffff */

So the first observation is that (3) is evidently a fused multiply-add and a plain multiply and is complied in that way, and infact (2)is also compiled with fma and multiply. And (1) is two plain multiplies. I guess the performance change has something to do with the availability of the units to process those instructions.

smithdh commented 2 years ago

But generally how can I spot this? Can one spot this by eye (given experience) or should one rely on a tool?

valassi commented 2 years ago

Hi @smithdh thanks a lot, this is asuper interesting! (As we discussed over coffee yesterday too).

One question to be sure: this is all only about cuda on the nvidia gpu, right? (For curiosity it would be interesting to see if we get anything vaguely similar on Intel CPUs, where the availability of FMA units may also be a bottleneck).

Another question for my ignorance: how do you produce those cubin disassemblies? What is the exact command you must type?

That said, on the specific issue, it's hard to understand how one can spot that and it's even harder to understand how we should write the code.

About how to spot it: frankly, I would not know a better suggestion that what you did, I think you did it great here! It's also quite nice that this cubin disassembly is relatively understandable even for someone who has never seen a cubin disassembly before... Thanks a lot for having insisted in getting to the bottom of this question.
About what to do: if this was in auto-generated code, I would probably advice to stick to the simpler code even if slower, as it becomes very unpredictable to know what performance will look like, and your variant 1 is that with the smallest number of arithmetic opertaions. However, this is in a custom library (for complex numbers), so maybe it makes sense to keep the counter-intuitive code change that improves performance? I would suggest testing the two options for different processes however, ggtt, ggttg, ggttgg, ggttggg etc... (but the fact that you get an improvement for the complex process ggttgg is very good)

smithdh commented 2 years ago

Hi Andrea,

The target was our usual nvidia V100 - using alpaka but I believe one can think of it as ending up as cuda compiled with nvcc. (I mean Alpaka is header only, cupla does have some compiled elements; but nvcc is still producing the code). The command to display the assembly from an object file was:

cuobjdump --dump-sass alpCPPProcess.o

then I searched through the output to find my method name; I think there may be a way to only display a given function..

I believe this assembly is called "cubin" and should be what is run on the device. I think with our usual commands we build with PTX only, which looks a bit like assembly (and can also be dumped with cuobjdump --dump-ptx ) but I believe it can have further optimisation done before it becomes cubin (possibly by a JIT at runtime, I don't know if that has different optimisation levels), so it seemed better to look at the cubin. I had to change the nvcc option

-arch=compute_70

to

-gencode arch=compute_70,code=sm_70

I noticed the final resulting binary seemed to run without problem, although I only used this "gencode" option to run cuobjdump --dump-sass on the .o file.

smithdh commented 2 years ago

Andrea has asked:

: this is all only about cuda on the nvidia gpu, right?

Ah perhaps you are wondering if this is something we can gain on with plain CUDA (no alpaka) too. I don't know. My plain CUDA of goldenX4 for this process (which is using thrust for the complex numbers) is

EvtsPerSec[MatrixElems] (3) = ( 3.908857e+05

which is still faster than what I've reached with alpaka (when if I try thrust with alpaka, which gave 2.94e+05 ). It could be something can be gained with changes to cxsmpl (from plain cuda on master) in a similar way, but I can't confirm it.

valassi commented 2 years ago

Andrea has asked:

: this is all only about cuda on the nvidia gpu, right?

Ah perhaps you are wondering if this is something we can gain on with plain CUDA (no alpaka) too. I don't know.

Sorry I was not clear: no, I am wondering if something similar can be seen in C++ on Intel CPUs (no GPUs, no Nvidia, no Alpaka, no CUDA). But you also have a point, maybe rewriting the expressions in a similar way would also help in direct CUDA (no alpaka) on nvidia GPUs. I opned #409 about that

smithdh commented 2 years ago

Thanks. Good idea.

smithdh commented 2 years ago

Actually I suppose I haven't proved it is connected to FMA but there seems to be so little else different between those cubin codes.

smithdh commented 2 years ago

Hi,

I'm almost certainly wrong, it's really nothing directly to do with the multiplications (fused or plain): Looking at the difference in the list of functions in my object file (alpCPPProcess.o) between (1) and (3)

***************
*** 1,5 ****
--- 1,7 ----
        Function : _ZN5gProc7dhstestIN6alpaka12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEN8alsimple7complexIdEERKT_dRS8_
        Function : __cuda_sm20_div_f64_slowpath_v2
+       Function : _ZN6MG5_sm6FFV1_2IN6alpaka12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEvRKT_PKN8alsimple7complexIdEESD_SB_ddPSB_
+       Function : _ZN6MG5_sm6FFV1_1IN6alpaka12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEvRKT_PKN8alsimple7complexIdEESD_SB_ddPSB_
        Function : __cuda_sm20_dsqrt_rn_f64_mediumpath_v1
        Function : _ZN5gProc23calculate_wavefunctionsIN6alpaka12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEvRKT_iPKdPd
        Function : _ZNK5gProc8sigmaKinclIN6alpaka12AccGpuCudaRtISt17integral_constantImLm3EEjEEEEvRKT_PKdPd

i.e. in the faster code FFV1_1 and FFV1_2 are not inlined, whereas in the slower variant they appear to be inlined. If I force no-inling in those two functions by adding attribute((noinline)) to their declaration in HelAmps_sm.h, then the timing for the two method variants show above are:

EvtsPerSec[MatrixElems] () = ( 3.707419e+05 ) sec^-1
EvtsPerSec[MatrixElems] () = ( 3.670428e+05 ) sec^-1

so actually they are quite similar, possibly with the simplest form in (1) being best. It was rather unexpected (to me) that the difference between (1)+(3) would change the inlining decision of those FFV functions, but good to have learnt that it does. This is not answer the question of what is better in the case two 2 functions are not-inlined, and if it applies only for those two functions, or all of them, and if this would necessarily be the case on other devices etc etc.

smithdh commented 2 years ago

Will close this issue now. If there are other problems or I learn interesting facts about the inlining effects I'll open another issue. Thanks.

madgraph5 / madgraph4gpu

what is happening here? (performance around some multiplications) #408