kokkos / mdspan

Reference implementation of mdspan targeting C++23
Other
413 stars 69 forks source link

Vectorization issue Intel compiler #308

Open EnricoDeg opened 10 months ago

EnricoDeg commented 10 months ago

Hello, I'm having problem vectorizing this very simple loop where dolic_c and tke_Lmix are mdspan objects:

for (int jc = start_index; jc <= end_index; jc++)
    tke_Lmix(blockNo, dolic_c(blockNo, jc), jc) = 0.0;

I'm using intel compiler and from the report I get the following output:

   remark #15389: vectorization support: reference dolic_c.__members.__members[blockNo*dolic_c.__members.__members.__members.__members.__members.__members.__members[1]+jc] has unaligned access   [ /home/k/k202136/YAOP/src/backends/CPU/cpu_kernels.hpp(68,38) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
   remark #15329: vectorization support: irregularly indexed store was emulated for the variable <tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe>, part of index is read from memory   [ /home/k/k202136/YAOP/src/backends/CPU/cpu_kernels.hpp(68,21) ]
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 0.049
   remark #15450: unmasked unaligned unit stride loads: 1
   remark #15463: unmasked indexed (or scatter) stores: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 11
   remark #15477: vector cost: 25.750
   remark #15478: estimated potential speedup: 0.420
   remark #15488: --- end vector cost summary ---
   remark #25439: unrolled with remainder by 2

I think that this is simple enough that should be handled by the compiler and I also don't understand the "irregularly indexed store" issue mentioned in the report

crtrott commented 10 months ago

So it looks like you are using a 2D View to index into a 3D view: that is where it goes off the rails. If we break it down into something simpler (not using mdspan) you got this:

int* idx = ....;
int N = ...;
int block_offset_idx = ...;
int block_offset_a = ...;

for(int jc = start; jc<= end; jc++) a[block_offset_a + N*idx[block_offset_idx + jc] + jc];

So the idx[block_offset_idx] is read from memory, and that is what they consider an irregular index access. It is not clear that this loop can be vectorized with non-scatter store. The compiler actually identified everything correct: it sees a unit-stride load for dolic_c(blockNo, jc) and then correctly says that you have a indexed (scatter) store into tke_Lmix. Using scatter stores here is (depending on the architecture) more expensive than simply executing this loop in scalar. And that is what the compiler things would happen, so it will not vectorize the code for you, unless you tell it to do so even in cases where the compiler estimates that it would slow your code down (with -vec-threshold0).

EnricoDeg commented 10 months ago

Okay thanks for the explanation but in principle the loop is over jc and the iteration is over a contiguous part of the memory so shouldn't it vectorize it anyway? And the vectorization shouldn't be better than scalar execution?

EnricoDeg commented 10 months ago

Using the option you suggested the loop is vectorized but with

estimated potential speedup: 0.490

which is bad. However when I try something like this:

for (int jc = start_index; jc <= end_index; jc++) {
    tke_Lmix(blockNo, 0, jc) = 0.0;
    tke_Lmix(blockNo, dolic_c(blockNo, jc), jc) = 0.0;
}

I get the following output from the compiler reports:

   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed OUTPUT dependence between tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe (66:21) and tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe (67:21)
   remark #15346: vector dependence: assumed OUTPUT dependence between tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe (67:21) and tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe (66:21)
   remark #25439: unrolled with remainder by 2

Do you have any other suggestions?

crtrott commented 10 months ago

Okay thanks for the explanation but in principle the loop is over jc and the iteration is over a contiguous part of the memory so shouldn't it vectorize it anyway? And the vectorization shouldn't be better than scalar execution?

It is not over a contiguous part of memory because you don't just change the last index in every iteration, you also change the middle index of tke_Lmix for every iteration.

Consider this:

blockNo = 0;

dolic_c(0, 0) = 0;
dolic_c(0, 1) = 1;

// then:

&tke_Lmix(0,dolic_c(0,jc),jc) == tkeLmix.data() + jc * tkeLmix.extent(2) + jc;

So if you increment jc by 1 your memory jumps by tkeLmix.extent(2) + 1

Adding a second statement in there also adds a potential OUTPUT dependency, because dolic_c(blockNo, jc) could be zero. So your second statement may overwrite the first one in an iteration. But if it gets vectorized there is a potential ordering issue. The compiler is not clever enough to realize that you are writing 0 in both cases so it wouldn't matter which way you overwrite stuff.

EnricoDeg commented 10 months ago

Okay this makes sense, thanks for the explanation