Performance Analysis workshop for PSycloneBench

LonelyCat124 commented 1 year ago

I figured I'd make an issue to detail results with different tools we're using as part of the PA workshop.

Probably we'll look at both NemoLite (incl distributed memory) and tracer advection.

First tool is Vtune, and i'll post (and update) things as I get results. @vkatkinson If you also have results to share it would be good to place here as well probably.

LonelyCat124 commented 1 year ago

Tracer Advection + vtune: Problem size:

This seems to primarily be memory bound - particularly with cross NUMA accesses being a significant amount of memory accesses. I'll show results here with both. I disabled the checksum and IO routine at the end of the simulation as it throws off the results and massively warps the results.

export OMP_PLACES=cores
export OMP_PROC_BIND=close/spread

close proc bind:

A couple of things I note for this: NUMA remote accesses are very high (66%) Parallel region time is good (99.4%) - mostly with high physical core utilisation I compiled this example with -xCORE_AVX512 but vectorisation is still only 256-bit, not sure why. CPI is 7.843, which is pretty bad since in theory I think we could be looking at <1.

spread proc bind:

It looks like either way the NUMA behaviour is not good, which may be why we're seeing only 256-bit vectorisation as maybe the compiler doesn't see it as worthwhile. I will try to do a memory analysis with close and see what we get.

Looking at the memory analysis it seems like loops that access the mydomain array have higher average latency (in cycles) than other loops, though its not clear why that might be.

The worst offender is:

    !$omp parallel default(shared), private(idx_12,idx_13,jk)
    !$omp do schedule(auto)
    do jk = 2, jpk - 1, 1
      do idx_12 = LBOUND(zwx, 2), UBOUND(zwx, 2), 1
        do idx_13 = LBOUND(zwx, 1), UBOUND(zwx, 1), 1
          zwx(idx_13,idx_12,jk) = tmask(idx_13,idx_12,jk) * (mydomain(idx_13,idx_12,jk - 1) - mydomain(idx_13,idx_12,jk))
        enddo
      enddo
    enddo
    !$omp end do
    !$omp end parallel

The reported average latency in cycles in this loops is 1288 (the best loop has an average of 167). The next worse has 1030 average:

    !$omp parallel default(shared), private(ji,jj,jk,ztra)
    !$omp do schedule(auto)
    do jk = 1, jpk - 1, 1
      do jj = 2, jpj - 1, 1
        do ji = 2, jpi - 1, 1
          ztra = -zbtr * (zwx(ji,jj,jk) - zwx(ji,jj,jk + 1))
          mydomain(ji,jj,jk) = ztra
        enddo
      enddo
    enddo
    !$omp end do
    !$omp end parallel

Both loops have relatively low arithmetic intensity, but are by no means the lowest. I'm still not clear why we have 63% of remote accesses - the arrays are sensibly initialised, so the only option is if schedule(auto) does something unusual. I will try to chance the schedule to a static schedule and see how that affects the memory access patterns.

Switching to schedule(auto) completely changes the memory analysis. Cross NUMA memory accesses drop to almost 0:

LLC miss rate increases slightly, but the worst loop's average latency drops by almost half to 759. I'm going to rerun the base analysis now to see how the CPI changes from this too.

CPI down to 5.005 with this change, I'll see how the actual runtime changed.

Runtime with static:

Thread   0
Initialisation                        1  0.31572E+00  0.31572E+00  0.31572E+00  0.00E+00
Time-stepping                         1  0.90016E+02  0.90016E+02  0.18003E+00  0.00E+00

Runtime with auto:

Thread   0
Initialisation                        1  0.32153E+00  0.32153E+00  0.32153E+00  0.00E+00
Time-stepping                         1  0.13415E+03  0.13415E+03  0.26830E+00  0.00E+00

Looks like swapping from auto to static with intel compiler is a 40% runtime improvement for this case @sergisiso

I'm going to see if I can generate a roofline next.

Tracer advection roofline: Matches exactly what vtune shows - the tracer advection benchmark is DRAM limited - the main performance improvement would be finding a way to improve cache efficiency or higher arithmetic intensity, but given the nature of the problem it seems unlikely.

The largest nodes are:

The parallel region at 226 (starting with z0u = SIGN(0.5d0, pun(ji,jj,jk)))
The parallel region at 343 (starting with z0w = SIGN(0.5d0, pwn(ji,jj,jk + 1))) - this one shows some gap in the roofline between the performance and the roofline too.
The parallel region at 170 (pretty small loop in terms of work).

The 2nd loop is the one with a peel and remainder loop, hard for PSyclone to do much about that though I expect for now.

sergisiso commented 1 year ago

Which version of psyclone are use using: master or latest release? What script do you use for generation?

I thought in master we generate no schedule clause by default instead of auto (that we did for a short period). To change it you can use: omp_loop_trans.omp_schedule = "none" or "static". By none we mean that we don't want a clause at all (which is implementation defined and different compilers do different things depending on architecture)

LonelyCat124 commented 1 year ago

I'm on master (downloaded yesterday), with the default make tra_adv_omp_cpu script.

From looking in PSyclone, the OMPLoopTrans has this init flag: def __init__(self, omp_directive="do", omp_schedule="auto"): so I guess the default here is auto, even though for the directive its set to none.

LonelyCat124 commented 1 year ago

NemoLite2D has a very different characteristic as far as I can tell.

Problem size: 4096x4096, 1k timesteps.

I tried to remove a lot of the serial writes and things, but there still appears to be a lot present.

I need to dive down a bit more at some point into this, but:

OpenMP imbalance could speedup the OpenMP sections by ~8% - which doesn't seem too bad.
Presumably NemoLite2D arrays are not initialised with OpenMP, so we have terrible NUMA behaviour (which we can probably mitigate with 2 MPI procs per rank).
Much less memory bound (I assume higher arithmetic intensity)
Almost none of NemoLite2D vectorises, I'll try to explore why, but at a guess I assume its just too many if statements to be worthwhile vectorising.

From looking at the threading - I'm curious to see how the taskloop version compares as I think most of the imbalance is due to the NUMA accesses in the heavier kernels, but I think probably the imbalance per timestep is low per loop(~3ms), I wonder if tasking has lower imbalance and at what compute cost.

LonelyCat124 commented 1 year ago

Ok - I setup MAQAO on Scafellpike, and testing it with NemoLite2D at the moment.

Command used: maqao.intel64 oneview -R1 --config=nemolite_ov_bsub.lua -xp=ov_nemolite scripts to run: lsf file

#BSUB -J test
#BSUB -e test.err
#BSUB -o test.out
#BSUB -x
#BSUB -n <number_nodes>
#BSUB -R "span[ptile=<number_processes_per_node>]"
#BSUB -W 01:00

export OMP_NUM_THREADS=32
export OMP_PLACES=cores
export OMP_PROC_BIND=close
<mpi_command> <run_command>

MAQAO lua file (minus comments):

experiment_name = nil
comments = nil
executable     = "./nemolite2d.exe"
external_libraries = {}
run_command    = "<executable>"
batch_script    = "maqao.lsf"
batch_command   = "bsub < <batch_script>"
number_processes = 1
number_nodes = 1
number_processes_per_node = 1
mpi_command    = "mpirun -n <number_processes>"

filter = {
   type = "number",
   value = 15,
}

-- !! Uncomment the table you want to use !!
-- If the profiling should begin after a given percentage of the application time.
-- A first run of the application is automatically performed to time the application.
--profile_start = {
--   unit = "p",      -- 'p' for 'percentage'
--   value = 30,         -- delay in percentage of the total application time
--}
-- If the profiling should begin when the application is started
--profile_start = {
--   unit = "none",   -- Specify that no delay is needed
--}

-- If the profiling should be triggered from the application using source probes
-- Use help command $maqao oneview --help-guided-profile
-- to learn how to add probes to the source code
--profile_start = {
--   unit = "probe",  -- Uses LPROF probes
--}

-- If the profiling should begin after a given time (in second)
--profile_start = {
--   unit = "s",      -- 's' for 'seconds'
--   value = 0,         -- delay in seconds
--}

dataset        = ""
dataset_handler= "link"

The results seem to match pretty reasonably with VTune's suggestions - vectorisation is low:

Given this is essentially due to boundary condition checks in loops it seems hard to remove for now. Short of that, performance is pretty reasonable.

One loop that doesn't obviously look non-vectorisable is the continuity code loop:

      rtmp1 = (sshn_u(ji,jj) + hu(ji,jj)) * un(ji,jj)
      rtmp2 = (sshn_u(ji - 1,jj) + hu(ji - 1,jj)) * un(ji - 1,jj)
      rtmp3 = (sshn_v(ji,jj) + hv(ji,jj)) * vn(ji,jj)
      rtmp4 = (sshn_v(ji,jj - 1) + hv(ji,jj - 1)) * vn(ji,jj - 1)
      ssha(ji,jj) = sshn(ji,jj) + (rtmp2 - rtmp1 + rtmp4 - rtmp3) * rdt / e12t(ji,jj)

Most of the accesses here are reasonably strided, so I'd think it could be vectorised fairly efficiently, however something is preventing the compiler from working things out. Optimisation report:

      remark #15329: vectorization support: non-unit strided store was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]
      remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler   [ psy.f90(149,7) ]

I might try enable -ipo and see if that affects it at all. Edit: ipo doesn't help, not sure.

Edit 2: This comes from not telling the compiler that these arrays are contiguous. @sergisiso Should these arrays be contiguous/is it reasonable to make that assumption in PSyclone/for NemoLite? Slash is this something we can detect? If I declare that then we hit unaligned access issues, which require compiler-specific directives to solve I think :/. But the optimisation report goes from estimating a 1.01x speedup from vectorising that loop to 2.53x

Tracer Advection Setup roughly the same

I didn't recompile with -qopt-zmm-usage=high for this, which is one of its primary suggestions (mostly because i'd have to remove the whole folder to rebuild it). That might help. The other issue is again unaligned memory accesses (or the compiler not knowing if memory is aligned), at least according to the optimisation report.

LonelyCat124 commented 1 year ago

Also ran a scalability run with it and NemoLite2D r0 i don't know what ran - error on my side i assume r1 = 1 rank 32 OMP r2 = 32 rank 1 OMP r3 = 2 rank 16 OMP r4 = 2 rank 32 OMP - 2nodes r5 = 4 rank 16 OMP - 2 nodes

Nemolite clearly scales much better with MPI only than with OpenMP.

I do note now however this must be with a very tiny example so I will try to change it.

Same sized example as the vtune results shows slightly better overall scaling with OpenMP:

I think r0 is meant to be the base case? I'm not sure.

arporter commented 1 year ago

This comes from not telling the compiler that these arrays are contiguous

Is this an artefact of the way we access the arrays from within a field object in NEMOLite2D? Now that I come to think about it, we did quite an in-depth look at the performance of NEMOLite2D in a previous paper. I'll see if I can find it.

arporter commented 1 year ago

Also, which version of the OMP-enabled build are you running with (as I think there are different options)?

LonelyCat124 commented 1 year ago

Is this an artefact of the way we access the arrays from within a field object in NEMOLite2D? Now that I come to think about it, we did quite an in-depth look at the performance of NEMOLite2D in a previous paper. I'll see if I can find it.

Yes I believe so, though I think it might just be any time you access an array inside a function since the compiler in fortran doesn't know if its been passed in as fullarray or array_section(1:3, 4:5) for example.

I think I'm just doing make nemolite2d_omp and make nemolite2d_hybrid

arporter commented 1 year ago

That paper is here: https://gmd.copernicus.org/articles/11/3447/2018/gmd-11-3447-2018.pdf

stfc / PSycloneBench

Performance Analysis workshop for PSycloneBench #97