Open LonelyCat124 opened 1 year ago
Tracer Advection + vtune: Problem size:
This seems to primarily be memory bound - particularly with cross NUMA accesses being a significant amount of memory accesses. I'll show results here with both. I disabled the checksum and IO routine at the end of the simulation as it throws off the results and massively warps the results.
export OMP_PLACES=cores
export OMP_PROC_BIND=close/spread
close proc bind:
A couple of things I note for this:
NUMA remote accesses are very high (66%)
Parallel region time is good (99.4%) - mostly with high physical core utilisation
I compiled this example with -xCORE_AVX512
but vectorisation is still only 256-bit, not sure why.
CPI is 7.843, which is pretty bad since in theory I think we could be looking at <1.
spread proc bind:
It looks like either way the NUMA behaviour is not good, which may be why we're seeing only 256-bit vectorisation as maybe the compiler doesn't see it as worthwhile. I will try to do a memory analysis with close
and see what we get.
Looking at the memory analysis it seems like loops that access the mydomain
array have higher average latency (in cycles) than other loops, though its not clear why that might be.
The worst offender is:
!$omp parallel default(shared), private(idx_12,idx_13,jk)
!$omp do schedule(auto)
do jk = 2, jpk - 1, 1
do idx_12 = LBOUND(zwx, 2), UBOUND(zwx, 2), 1
do idx_13 = LBOUND(zwx, 1), UBOUND(zwx, 1), 1
zwx(idx_13,idx_12,jk) = tmask(idx_13,idx_12,jk) * (mydomain(idx_13,idx_12,jk - 1) - mydomain(idx_13,idx_12,jk))
enddo
enddo
enddo
!$omp end do
!$omp end parallel
The reported average latency in cycles in this loops is 1288 (the best loop has an average of 167). The next worse has 1030 average:
!$omp parallel default(shared), private(ji,jj,jk,ztra)
!$omp do schedule(auto)
do jk = 1, jpk - 1, 1
do jj = 2, jpj - 1, 1
do ji = 2, jpi - 1, 1
ztra = -zbtr * (zwx(ji,jj,jk) - zwx(ji,jj,jk + 1))
mydomain(ji,jj,jk) = ztra
enddo
enddo
enddo
!$omp end do
!$omp end parallel
Both loops have relatively low arithmetic intensity, but are by no means the lowest.
I'm still not clear why we have 63% of remote accesses - the arrays are sensibly initialised, so the only option is if schedule(auto)
does something unusual. I will try to chance the schedule to a static schedule and see how that affects the memory access patterns.
Switching to schedule(auto)
completely changes the memory analysis. Cross NUMA memory accesses drop to almost 0:
LLC miss rate increases slightly, but the worst loop's average latency drops by almost half to 759. I'm going to rerun the base analysis now to see how the CPI changes from this too.
CPI down to 5.005 with this change, I'll see how the actual runtime changed.
Runtime with static:
Thread 0
Initialisation 1 0.31572E+00 0.31572E+00 0.31572E+00 0.00E+00
Time-stepping 1 0.90016E+02 0.90016E+02 0.18003E+00 0.00E+00
Runtime with auto:
Thread 0
Initialisation 1 0.32153E+00 0.32153E+00 0.32153E+00 0.00E+00
Time-stepping 1 0.13415E+03 0.13415E+03 0.26830E+00 0.00E+00
Looks like swapping from auto to static with intel compiler is a 40% runtime improvement for this case @sergisiso
I'm going to see if I can generate a roofline next.
Tracer advection roofline: Matches exactly what vtune shows - the tracer advection benchmark is DRAM limited - the main performance improvement would be finding a way to improve cache efficiency or higher arithmetic intensity, but given the nature of the problem it seems unlikely.
The largest nodes are:
z0u = SIGN(0.5d0, pun(ji,jj,jk))
)z0w = SIGN(0.5d0, pwn(ji,jj,jk + 1))
) - this one shows some gap in the roofline between the performance and the roofline too.The 2nd loop is the one with a peel and remainder loop, hard for PSyclone to do much about that though I expect for now.
Which version of psyclone are use using: master or latest release? What script do you use for generation?
I thought in master we generate no schedule clause by default instead of auto (that we did for a short period).
To change it you can use: omp_loop_trans.omp_schedule = "none"
or "static". By none we mean that we don't want a clause at all (which is implementation defined and different compilers do different things depending on architecture)
I'm on master (downloaded yesterday), with the default make tra_adv_omp_cpu
script.
From looking in PSyclone, the OMPLoopTrans
has this init flag:
def __init__(self, omp_directive="do", omp_schedule="auto"):
so I guess the default here is auto
, even though for the directive its set to none
.
NemoLite2D has a very different characteristic as far as I can tell.
Problem size: 4096x4096, 1k timesteps.
I tried to remove a lot of the serial writes and things, but there still appears to be a lot present.
I need to dive down a bit more at some point into this, but:
From looking at the threading - I'm curious to see how the taskloop version compares as I think most of the imbalance is due to the NUMA accesses in the heavier kernels, but I think probably the imbalance per timestep is low per loop(~3ms), I wonder if tasking has lower imbalance and at what compute cost.
Ok - I setup MAQAO on Scafellpike, and testing it with NemoLite2D at the moment.
Command used: maqao.intel64 oneview -R1 --config=nemolite_ov_bsub.lua -xp=ov_nemolite
scripts to run:
lsf file
#BSUB -J test
#BSUB -e test.err
#BSUB -o test.out
#BSUB -x
#BSUB -n <number_nodes>
#BSUB -R "span[ptile=<number_processes_per_node>]"
#BSUB -W 01:00
export OMP_NUM_THREADS=32
export OMP_PLACES=cores
export OMP_PROC_BIND=close
<mpi_command> <run_command>
MAQAO lua file (minus comments):
experiment_name = nil
comments = nil
executable = "./nemolite2d.exe"
external_libraries = {}
run_command = "<executable>"
batch_script = "maqao.lsf"
batch_command = "bsub < <batch_script>"
number_processes = 1
number_nodes = 1
number_processes_per_node = 1
mpi_command = "mpirun -n <number_processes>"
filter = {
type = "number",
value = 15,
}
-- !! Uncomment the table you want to use !!
-- If the profiling should begin after a given percentage of the application time.
-- A first run of the application is automatically performed to time the application.
--profile_start = {
-- unit = "p", -- 'p' for 'percentage'
-- value = 30, -- delay in percentage of the total application time
--}
-- If the profiling should begin when the application is started
--profile_start = {
-- unit = "none", -- Specify that no delay is needed
--}
-- If the profiling should be triggered from the application using source probes
-- Use help command $maqao oneview --help-guided-profile
-- to learn how to add probes to the source code
--profile_start = {
-- unit = "probe", -- Uses LPROF probes
--}
-- If the profiling should begin after a given time (in second)
--profile_start = {
-- unit = "s", -- 's' for 'seconds'
-- value = 0, -- delay in seconds
--}
dataset = ""
dataset_handler= "link"
The results seem to match pretty reasonably with VTune's suggestions - vectorisation is low:
Given this is essentially due to boundary condition checks in loops it seems hard to remove for now. Short of that, performance is pretty reasonable.
One loop that doesn't obviously look non-vectorisable is the continuity code loop:
rtmp1 = (sshn_u(ji,jj) + hu(ji,jj)) * un(ji,jj)
rtmp2 = (sshn_u(ji - 1,jj) + hu(ji - 1,jj)) * un(ji - 1,jj)
rtmp3 = (sshn_v(ji,jj) + hv(ji,jj)) * vn(ji,jj)
rtmp4 = (sshn_v(ji,jj - 1) + hv(ji,jj - 1)) * vn(ji,jj - 1)
ssha(ji,jj) = sshn(ji,jj) + (rtmp2 - rtmp1 + rtmp4 - rtmp3) * rdt / e12t(ji,jj)
Most of the accesses here are reasonably strided, so I'd think it could be vectorised fairly efficiently, however something is preventing the compiler from working things out. Optimisation report:
remark #15329: vectorization support: non-unit strided store was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
remark #15328: vectorization support: non-unit strided load was emulated for the variable <at (149:7)>, stride is unknown to compiler [ psy.f90(149,7) ]
I might try enable -ipo and see if that affects it at all. Edit: ipo doesn't help, not sure.
Edit 2: This comes from not telling the compiler that these arrays are contiguous
. @sergisiso Should these arrays be contiguous/is it reasonable to make that assumption in PSyclone/for NemoLite? Slash is this something we can detect?
If I declare that then we hit unaligned access issues, which require compiler-specific directives to solve I think :/. But the optimisation report goes from estimating a 1.01x speedup from vectorising that loop to 2.53x
Tracer Advection Setup roughly the same
I didn't recompile with -qopt-zmm-usage=high
for this, which is one of its primary suggestions (mostly because i'd have to remove the whole folder to rebuild it). That might help.
The other issue is again unaligned memory accesses (or the compiler not knowing if memory is aligned), at least according to the optimisation report.
Also ran a scalability run with it and NemoLite2D r0 i don't know what ran - error on my side i assume r1 = 1 rank 32 OMP r2 = 32 rank 1 OMP r3 = 2 rank 16 OMP r4 = 2 rank 32 OMP - 2nodes r5 = 4 rank 16 OMP - 2 nodes
Nemolite clearly scales much better with MPI only than with OpenMP.
I do note now however this must be with a very tiny example so I will try to change it.
Same sized example as the vtune results shows slightly better overall scaling with OpenMP:
I think r0 is meant to be the base case? I'm not sure.
This comes from not telling the compiler that these arrays are
contiguous
Is this an artefact of the way we access the arrays from within a field object in NEMOLite2D? Now that I come to think about it, we did quite an in-depth look at the performance of NEMOLite2D in a previous paper. I'll see if I can find it.
Also, which version of the OMP-enabled build are you running with (as I think there are different options)?
Is this an artefact of the way we access the arrays from within a field object in NEMOLite2D? Now that I come to think about it, we did quite an in-depth look at the performance of NEMOLite2D in a previous paper. I'll see if I can find it.
Yes I believe so, though I think it might just be any time you access an array inside a function since the compiler in fortran doesn't know if its been passed in as fullarray
or array_section(1:3, 4:5)
for example.
I think I'm just doing make nemolite2d_omp
and make nemolite2d_hybrid
That paper is here: https://gmd.copernicus.org/articles/11/3447/2018/gmd-11-3447-2018.pdf
I figured I'd make an issue to detail results with different tools we're using as part of the PA workshop.
Probably we'll look at both NemoLite (incl distributed memory) and tracer advection.
First tool is Vtune, and i'll post (and update) things as I get results. @vkatkinson If you also have results to share it would be good to place here as well probably.