Confused about results on SFP due to vectorisation

LonelyCat124 commented 11 months ago

I've benchmarked the OpenMP task, OpenMP loop and serial versions of the code now on SFP. All results are with -O3 -g -xCORE_AVX512 -fno-omit-frame-pointer -no-inline-min-size -no-inline-max-per-compile -no-inline-factor -qopt-report=5 -qopt-report-phase=loop,vec

The full node results are mostly uninteresting/expected (OpenMP loop scales better than OpenMP task but node performance is similar enough, and with MPI the story changes more), but I get very different results at low/serial thread counts. Results for 1 thread, 2048x2048, 100 its:

Parallel Option	Runtime (s)
OpenMP loop	62.3520
OpenMP task	31.711
Serial	~60

The checksums are identical (so I assume correct).

My original conclusion is that one of the "additional" transformations (InlineTrans, ChunkLoopTrans) that occurs with OpenMP task must be responsible, so I applied both of these transformations to the Serial version, but the runtime didn't improve.

I checked both versions then with VTune, and what appears to happen is that the compiler decides to vectorise all?/most of the loops in the OpenMP task version (VTune reports 99.6% of FP operations to be "Packed") when compared to the other versions (0.9% of Packed FP operations).

Looking at the opt report for the task version, this is reflect - it complains about all the unaligned accesses, but then determines that vectorising these loops is both possible and performant (albeit it only expects ~1.3x from the two momentum loops, more for the other loops)

The compiler output for the (chunked) momentum loops instead says: loop was not vectorized: loop with multiple exits cannot be vectorized unless it meets search loop idiom criteria [ psy.f90(148,11) ]

This is strange, because the code in the vectorising task-based version is identical apart from a giant task directive after the j_el_inner computation of this code:

      DO j_out_var = ua%internal%ystart, ua%internal%ystop, 32
        j_el_inner = MIN(j_out_var + (32 - 1), ua%internal%ystop)
# TASK DIRECTIVE GOES HERE
        DO j = j_out_var, j_el_inner, 1
          DO i = ua%internal%xstart, ua%internal%xstop, 1

Essentially we're failing to vectorise code that the compiler sometimes believes it is able to vectorise depending on surrounding statements? It would be interesting if an OpenMP loop statement also frees the compiler up to make this choice but alas I cannot do this yet.

@arporter @sergisiso any ideas?

LonelyCat124 commented 11 months ago

I'll also try running the manual openmp and manual serial versions with this newer compiler and see how those compare.

arporter commented 11 months ago

That is weird. I'd have put money on it being the inlining transformation as, previously, we've found that to be essential to recover performance (compared to an ancient, 'original' version of the code before it was split into kernels). Presumably the compiler is doing a better job now if inlining doesn't help?

LonelyCat124 commented 11 months ago

I tested the manual openmp version (Fortran) and on a single thread it gets similar (slightly quicker, ~27-33s, there was quite a bit of variation). On 32 threads however its slower than the psyclone generated version (~3.5 seconds vs 3.2 with psyclone generated code).

I guess this is due to the compiler somehow knowing to vectorise things if there's a pragma in between (presumably the compiler does something different during code generation before vectorisation), wheras the manual version has !dir$ vector always

I'm curious as to what this means limits the performance of the version that vectorise to be similar to (or worse?) than the non-vectorised versions, perhaps clock speed reduction or load imbalanced is worsened by this or something.

arporter commented 11 months ago

I would also have expected us to be at the memory-bandwidth limit and thus vectorisation wouldn't help very much. What problem size are you running?

LonelyCat124 commented 11 months ago

I'm running the 2048x2048 problem that was used in the benchmarks Sergi showed me - I guess in serial that we won't max out the memory bandwidth limit and thus vectorisation is useful, and might be why the non-vectorised version ends up performing better at 32 threads.

This also means when scaling with OpenMP + MPI, at 8 threads per MPI rank (which seems to be about best out of 8/16/32 for both tasking & non-tasking versions) the tasking version scales to multiple nodes better, but this is all just an artefact of the other behaviour I think. Fililng 2 nodes I get: OpenMP loop: 0.014492 (averaged across ranks) per timestep OpenMP task: 0.009157 per timestep.

The min & max rank times also are faster with tasking.

I don't remember how to profile OpenMP imbalance (especially when running with MPI) - the performance gains on those rank counts could be down to better imbalance inside ranks.

sergisiso commented 11 months ago

I guess in serial that we won't max out the memory bandwidth limit and thus vectorisation is useful, and might be why the non-vectorised version ends up performing better at 32 threads.

My impression is that all of them max out mem bandwidth. Can you do the run with the -no-vec and see what is the effect of vectorization alone for each version?

-qopt-report-phase=loop,vec

Is there report-phase for inlining? I used in the part the report level 5 that shows a sentence if each call was inlined or not.

LonelyCat124 commented 11 months ago

So I added -no-vec Tasking version (single thread) is now closer to 60 seconds (~58s) vs ~65 for the loop parallelised version with novec. On 32 threads , the task version is ~3.5s (faster than the results I have with vectorisation which were closer to 3.8s) vs ~3.15s with the loop parallelised version.

Inlining report details suggests calls are being inlined for the loop parallelised version:

LonelyCat124 commented 11 months ago

So the summary seems to be: Vectorisation is good when the number of OpenMP threads is lower, else no-vec is better when scaling OpenMP to a full node. I'm not quite sure I can explain why 8 threads with openmp + 4 mpi ranks means that vectorisation is good however.

The best MPI or hybrid performance comes from MPI-only with the vectorisation (I used the tasking version compiled and then run with 1 OpenMP thread & MPI only) - this was 0.008299 (mean across ranks) s/step on 2 nodes. With 8 MPI + 8 OpenMP threads per rank, the tasking version with vectorisation could only achieve 0.009157 s/step on 2 nodes, without vectorisation it was 0.014 s/step (and going to 4x16 was slower still). The PSyclone generated OpenMP Loop parallelised hybrid version was at ~0.014 s/step at 8x8 as well.

LonelyCat124 commented 11 months ago

Adding inlining to the OpenMP loop code with vectorisation results in a runtime of 25.6s for a single thread, which is the fastest we have. I'm not sure why it was so much faster since it should be almost the same code as the task version, but what the task version does under the hood vs the loop parallelised version I'm unsure. I guess the overhead of dependency computation may be part of it.

arporter commented 11 months ago

Adding inlining to the OpenMP loop code

As in, PSyclone doing it or asking the compiler to do it?

LonelyCat124 commented 11 months ago

PSyclone doing it.

stfc / PSycloneBench

Confused about results on SFP due to vectorisation #98