Inefficient Load Instruction Generation on TGL

Consider the following SGEMM kernel which computers C = A*B where A is pre-tiled and C and B are row-major. For the most part the kernel does everything right: SIMD-8 with float4 with all loads and stores being 128b. Out of our 128 registers, 64 are accumulators.

#pragma OPENCL EXTENSION cl_intel_subgroups : enable

#define block_read4f(p) as_float4(intel_sub_group_block_read4((__global uint *) p))
#define block_write4f(p, v) intel_sub_group_block_write4((__global uint *) p, as_uint4(v))

__attribute__((intel_reqd_sub_group_size(8)))
__kernel void
mm(__global const float* restrict a,
   __global const float* restrict b,
   __global float* restrict c,
   int m, int n, int k,
   int lda, int ldb, int ldc)
{
    // Each thread does sixteen rows
    int g_row = 16*get_global_id(1);

    // Each sub-group of eight threads does 32 columns
    int g_col = 32*get_global_id(0) / 8;

    // Pre-displace our arguments
    a += g_row*lda;
    b += g_col;
    c += g_row*ldc + g_col;

    float4 a_sub[16], b_sub[4], c_acc[16], temp;

    for (int i = 0; i < 16; i++)
        c_acc[i] = 0;

    for (int tk = 0; tk < k / 4; tk++, a += 32)
    {
        temp = block_read4f(a);
        #pragma unroll
        for (int i = 0; i < 8; i++)
            a_sub[i] = intel_sub_group_shuffle(temp, i);

        temp = block_read4f(a + 8*lda);
        #pragma unroll
        for (int i = 0; i < 8; i++)
            a_sub[i + 8] = intel_sub_group_shuffle(temp, i);

        #pragma unroll
        for (int i = 0; i < 4; i++, b += ldb)
            b_sub[i] = block_read4f(b);

        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s0*b_sub[0];
        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s1*b_sub[1];
        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s2*b_sub[2];
        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s3*b_sub[3];
    }

    // Write out the result
    #pragma unroll
    for (int i = 0; i < 16; i++, c += ldc)
        block_write4f(c, c_acc[i]);
}

With the right local work group size the kernel performs well (~73% of theoretical peak), however, on my 96 EU TGL system performance is held back by main loop. This consists of 6 loads (two from A and four from B), 6 pointer additions, 1 counter increment, 1 comparison, and 256 MAD's. Ideally, this should come out to less than 280 instructions. However, as can be seen by the attached assembly compiled on a 64-bit Linux system, we're at ~325 instructions (label L1712 to label L4648, many of which are ALU operations).

I am not sure what is tripping the compiler up, however a few things stand out:

We are using send.dc1 for the loads which takes a 64-bit address. However, even though the documentation suggests we can do quadword addition we are issuing a 32-bit addc, mov, add chain. I suspect this is also responsible for a bunch of other unnecessary mov's being issued.
I would expect all of the pointer addition instructions to be (1|M0) but some are (8|M0).
I am unsure where the csel's and predications on the send instructions are coming from. Even if they are needed I do not know why the flags need to be continually recomputed for each load.

sgemm-16x4.asm.txt

I ran that assembly through my old script that checks for register usage (liveness / fragmentation). It was for GEN9 and needed some workarounds for the new IGC syntax, so I'm not sure whether it worked correctly for the current IGC / GEN12 output, but at least according it, there seems to be 20% free space in register file.

reg-use.txt

I ran that assembly through my old script that checks for register usage (liveness / fragmentation). It was for GEN9 and needed some workarounds for the new IGC syntax, so I'm not sure whether it worked correctly for the current IGC / GEN12 output, but at least according it, there seems to be 20% free space in register file.

reg-use.txt

I can believe that. Given:

64 C accumulator registers;
8 A load registers;
16 B load registers;
4 Pointer registers (A, A, B, C);
2 registers for ldb and ldc (could combine if we are sneaky).

We are at 94/128.

On paper a 96 accumulator version is possible, specifically by loading in A piece-by-piece whence:

96 C accumulator registers;
4 A load registers (used three times during each loop iteration);
16 B load registers;
5 Pointer registers (A, A, A, B, C)
2 registers for ldb and ldc (could combine if we are sneaky),

we are at 123 registers. This does not give us much slack for the loop counter and other arguments, but by stashing them into specific lanes I am pretty sure I could make it work in assembly. (A 12 A load registers and 4 B load registers variant is also possible if the register pressure really is too much.) We also have either two or eight general purpose accumulator registers (the source code is unclear for TGL) which could be used if we're desperate.

Of course, all of this assumes that no funny games are required due to bank conflicts.

Of course, all of this assumes that no funny games are required due to bank conflicts.

Or due to register file fragmentation. Send require continuous block of regs for its parameters (or 2 blocks, in case it's split to two with ext send message, but if I remember correctly, using that consumes one extra reg).

Of course, all of this assumes that no funny games are required due to bank conflicts.

Or due to register file fragmentation. Send require continuous block of regs for its parameters (or 2 blocks, in case it's split to two with ext send message, but if I remember correctly, using that consumes one extra reg).

I think everything is fine there, at least for the loads from A and B. Or rather, I can not think of a reason (beyond bank conflicts) why we can't assign, say, r0-r3 for A loading and r4-r19 for B loading. The accumulators can then go in any order we like, as the cost of defragmenting them at the end should be a rounding error for large values of k. (Indeed, the compiler does this in the current code leading to the massive mov chain from label L4664; I do not know if this is a genius move to avoid bank conflicts in the main loop by using an odd arrangement of registers or a simple missed trick.)

Consider the following SGEMM kernel which computers C = A*B where A is pre-tiled and C and B are row-major. For the most part the kernel does everything right: SIMD-8 with float4 with all loads and stores being 128b. Out of our 128 registers, 64 are accumulators.
#pragma OPENCL EXTENSION cl_intel_subgroups : enable

#define block_read4f(p) as_float4(intel_sub_group_block_read4((__global uint *) p))
#define block_write4f(p, v) intel_sub_group_block_write4((__global uint *) p, as_uint4(v))

__attribute__((intel_reqd_sub_group_size(8)))
__kernel void
mm(__global const float* restrict a,
   __global const float* restrict b,
   __global float* restrict c,
   int m, int n, int k,
   int lda, int ldb, int ldc)
{
    // Each thread does sixteen rows
    int g_row = 16*get_global_id(1);

    // Each sub-group of eight threads does 32 columns
    int g_col = 32*get_global_id(0) / 8;

    // Pre-displace our arguments
    a += g_row*lda;
    b += g_col;
    c += g_row*ldc + g_col;

    float4 a_sub[16], b_sub[4], c_acc[16], temp;

    for (int i = 0; i < 16; i++)
        c_acc[i] = 0;

    for (int tk = 0; tk < k / 4; tk++, a += 32)
    {
        temp = block_read4f(a);
        #pragma unroll
        for (int i = 0; i < 8; i++)
            a_sub[i] = intel_sub_group_shuffle(temp, i);

        temp = block_read4f(a + 8*lda);
        #pragma unroll
        for (int i = 0; i < 8; i++)
            a_sub[i + 8] = intel_sub_group_shuffle(temp, i);

        #pragma unroll
        for (int i = 0; i < 4; i++, b += ldb)
            b_sub[i] = block_read4f(b);

        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s0*b_sub[0];
        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s1*b_sub[1];
        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s2*b_sub[2];
        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s3*b_sub[3];
    }

    // Write out the result
    #pragma unroll
    for (int i = 0; i < 16; i++, c += ldc)
        block_write4f(c, c_acc[i]);
}
With the right local work group size the kernel performs well (~73% of theoretical peak), however, on my 96 EU TGL system performance is held back by main loop. This consists of 6 loads (two from A and four from B), 6 pointer additions, 1 counter increment, 1 comparison, and 256 MAD's. Ideally, this should come out to less than 280 instructions. However, as can be seen by the attached assembly compiled on a 64-bit Linux system, we're at ~325 instructions (label L1712 to label L4648, many of which are ALU operations).

I am not sure what is tripping the compiler up, however a few things stand out:

We are using send.dc1 for the loads which takes a 64-bit address. However, even though the documentation suggests we can do quadword addition we are issuing a 32-bit addc, mov, add chain. I suspect this is also responsible for a bunch of other unnecessary mov's being issued.

I would expect all of the pointer addition instructions to be (1|M0) but some are (8|M0).

I am unsure where the csel's and predications on the send instructions are coming from. Even if they are needed I do not know why the flags need to be continually recomputed for each load.

sgemm-16x4.asm.txt

Of course, all of this assumes that no funny games are required due to bank conflicts.

Or due to register file fragmentation. Send require continuous block of regs for its parameters (or 2 blocks, in case it's split to two with ext send message, but if I remember correctly, using that consumes one extra reg).

I think everything is fine there, at least for the loads from A and B. Or rather, I can not think of a reason (beyond bank conflicts) why we can't assign, say, r0-r3 for A loading and r4-r19 for B loading. The accumulators can then go in any order we like, as the cost of defragmenting them at the end should be a rounding error for large values of k. (Indeed, the compiler does this in the current code leading to the massive mov chain from label L4664; I do not know if this is a genius move to avoid bank conflicts in the main loop by using an odd arrangement of registers or a simple missed trick.)

On TGL, there are two issues in that main loop body. TGL doesn’t support native 64-bit integer instructions. L197 in your dumped assembly is one of such emulated 64-bit integer operations. In addition, due to the layout difference between the send instruction and the emulated 64-bit integer, L209-L210 pack/unpack data to the proper layout. Another issue is that we have to generate extra code (like L208, L221-L223) to work around an HW issue, which is specific to TGL only. Overall, those two HW-related issues contribute to the additional instructions on TGL.

On TGL, there are two issues in that main loop body. TGL doesn’t support native 64-bit integer instructions. L197 in your dumped assembly is one of such emulated 64-bit integer operations. In addition, due to the layout difference between the send instruction and the emulated 64-bit integer, L209-L210 pack/unpack data to the proper layout. Another issue is that we have to generate extra code (like L208, L221-L223) to work around an HW issue, which is specific to TGL only. Overall, those two HW-related issues contribute to the additional instructions on TGL.

What is the trick used when we write to C to avoid the 64-bit integer instructions? I see it uses a slightly different instruction that does not have the a64 prefix. Can the read version of this instruction not be used for loading A and B?

Another source code trick is not to modify input arguments a, b, and c. Instead, using a + offset for each block read/write. On TGL, it would likely get rid of some of 64 bit address calculation.

For offset/id, using uint would help sometimes to get rid of 64 bit ALU.

Another source code trick is not to modify input arguments a, b, and c. Instead, using a + offset for each block read/write. On TGL, it would likely get rid of some of 64 bit address calculation.

For offset/id, using uint would help sometimes to get rid of 64 bit ALU.

Works a treat!

#pragma OPENCL EXTENSION cl_intel_subgroups : enable

#define block_read4f(p) as_float4(intel_sub_group_block_read4((__global uint *) p))
#define block_write4f(p, v) intel_sub_group_block_write4((__global uint *) p, as_uint4(v))

__attribute__((intel_reqd_sub_group_size(8)))
__kernel void
mm(__global const float* restrict a,
   __global const float* restrict b,
   __global float* restrict c,
   int m, int n, int k,
   int lda, int ldb, int ldc)
{
    // Each thread does sixteen rows
    uint g_row = 16*get_global_id(1);

    // Each sub-group of eight threads does 32 columns
    uint g_col = 32*get_global_id(0) / 8;

    // If we have no work to do then return
    if (g_col >= n || g_row >= m)
        return;

    // Displacements
    uint aoff = g_row*lda;
    uint boff = g_col;
    uint coff = g_row*ldc + g_col;

    float4 a_sub[16], b_sub[4], c_acc[16], temp;

    for (int i = 0; i < 16; i++)
        c_acc[i] = 0;

    for (int tk = 0; tk < k / 4; tk++, aoff += 32)
    {
        temp = block_read4f(a + aoff);
        #pragma unroll
        for (int i = 0; i < 8; i++)
            a_sub[i] = intel_sub_group_shuffle(temp, i);

        temp = block_read4f(a + aoff + 8*lda);
        #pragma unroll
        for (int i = 0; i < 8; i++)
            a_sub[i + 8] = intel_sub_group_shuffle(temp, i);

        #pragma unroll
        for (int i = 0; i < 4; i++, boff += ldb)
            b_sub[i] = block_read4f(b + boff);

        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s0*b_sub[0];
        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s1*b_sub[1];
        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s2*b_sub[2];
        #pragma unroll
        for (int i = 0; i < 16; i++)
            c_acc[i] += a_sub[i].s3*b_sub[3];
    }

    // Write out the result
    #pragma unroll
    for (int i = 0; i < 16; i++, coff += ldc)
        block_write4f(c + coff, c_acc[i]);
}

sgemm-16x4.asm.txt

With the assembly attached. Inner loop is now done to ~301 instructions. I'm unsure how many of them relate to hardware errata and if anything smarter can be done there (I do not think such defects are public?).

We also still have some code duplication in the preamble (two different ways of zeroing 64 registers, one using immediates the other register to register copies).

intel / intel-graphics-compiler

Inefficient Load Instruction Generation on TGL #254