Question about the impact of loop-reordering on matrix tiling

Hello, after completing the various parts of the cs149gpt assignment, I observed some interesting performance numbers for my solutions vs. the reference implementations that I'd like to better understand. If anybody would be able to help provide more insight into why I'm seeing the metrics that I'm seeing given my code, that would be highly appreciated!

For the NAIVE ATTENTION implementation expected in part1, I wrote fairly standard nested looping code with the only exception being that I also applied loop-reordering to ensure sequential access of the matrices manipulated within the inner-most loops.

// loop over Batch Size
for (int b = 0; b < B; b++)
{
    // loop over Heads
    for (int h = 0; h < H; h++)
    {
        // calculate softmax(QK_t)
        for (int i = 0; i < N; i++)
        {
            float rowSum = 0.0;
            // calculate exp(QK_t)
            for (int j = 0; j < N; j++)
            {
                float val = 0.0;
                // sum dot product QK
                for (int k = 0; k < d; k++)
                {
                    float qVal = fourDimRead(Q, b, h, i, k, H, N, d);
                    float kVal = fourDimRead(K, b, h, j, k, H, N, d);
                    val += qVal * kVal;
                }
                val = exp(val);
                twoDimWrite(QK_t, i, j, N, val);
                rowSum += val;
            }
            // divide by rowSum
            for (int j = 0; j < N; j++)
            {
                float val = twoDimRead(QK_t, i, j, N) / rowSum;
                twoDimWrite(QK_t, i, j, N, val);
            }
        }
        // QK_t @ V and store in O
        for (int i = 0; i < N; i++)
        {
            for (int k = 0; k < N; k++)
            {
                float qkVal = twoDimRead(QK_t, i, k, N);
                // re-ordered for loop to ensure inner loop operates column-wise on both O & V matrices
                for (int j = 0; j < d; j++)
                {
                    float oVal = fourDimRead(O, b, h, i, j, H, N, d);
                    float vVal = fourDimRead(V, b, h, k, j, H, N, d);
                    oVal += qkVal * vVal;
                    fourDimWrite(O, b, h, i, j, H, N, d, oVal);
                }
            }
        }
    }
}

The above code produces metrics that are significantly faster (~2x) than the reference:

Running Part 1 Test: Naive Unfused Attention                                                                                                                                                               

-----RUNNING REFERENCE IMPLEMENTATION-----                                                                                                                                                                        

STAGE:2024-01-02 04:19:25 1419:1419 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-02 04:19:25 1419:1419 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-02 04:19:25 1419:1419 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
manual attention == pytorch attention True
Manual Execution Time:  0.20405173301696777 

-------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                           Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                    aten::empty         0.02%      34.000us         0.02%      34.000us      11.333us       5.00 Mb       5.00 Mb             3  
    REFERENCE - NAIVE ATTENTION        99.19%     202.432ms        99.98%     204.032ms     204.032ms       4.50 Mb      -1.00 Mb             1  
                    aten::zeros         0.01%      28.000us         0.48%     975.000us     487.500us       4.50 Mb           0 b             2  
                    aten::clone         0.02%      31.000us         0.28%     572.000us     286.000us       1.00 Mb           0 b             2  
                model_inference         0.02%      43.000us       100.00%     204.075ms     204.075ms     512.00 Kb      -4.00 Mb             1  
                  aten::flatten         0.02%      39.000us         0.20%     399.000us      79.800us     512.00 Kb           0 b             5  
               aten::empty_like         0.00%       6.000us         0.00%      10.000us      10.000us     512.00 Kb           0 b             1  
            aten::empty_strided         0.01%      15.000us         0.01%      15.000us      15.000us     512.00 Kb     512.00 Kb             1  
                    aten::zero_         0.01%      27.000us         0.45%     917.000us     458.500us           0 b           0 b             2  
                    aten::fill_         0.44%     890.000us         0.44%     890.000us     445.000us           0 b           0 b             2  
-------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 204.075ms

REFERENCE - NAIVE ATTENTION statistics
cpu time:  204.032ms
mem usage:  4718592 bytes

-----RUNNING STUDENT IMPLEMENTATION-----

STAGE:2024-01-02 04:36:45 2049:2049 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-02 04:36:45 2049:2049 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-02 04:36:45 2049:2049 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
manual attention == pytorch attention True
Manual Execution Time:  0.09789061546325684 

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                  aten::empty         0.01%      10.000us         0.01%      10.000us       3.333us       5.00 Mb       5.00 Mb             3  
    STUDENT - NAIVE ATTENTION        98.83%      96.772ms        99.95%      97.867ms      97.867ms       4.50 Mb      -1.00 Mb             1  
                  aten::zeros         0.02%      18.000us         0.48%     473.000us     236.500us       4.50 Mb           0 b             2  
                  aten::clone         0.02%      18.000us         0.61%     601.000us     300.500us       1.00 Mb           0 b             2  
              model_inference         0.05%      49.000us       100.00%      97.916ms      97.916ms     512.00 Kb      -4.00 Mb             1  
                aten::flatten         0.02%      17.000us         0.28%     271.000us      54.200us     512.00 Kb           0 b             5  
             aten::empty_like         0.00%       3.000us         0.00%       4.000us       4.000us     512.00 Kb           0 b             1  
          aten::empty_strided         0.01%       8.000us         0.01%       8.000us       8.000us     512.00 Kb     512.00 Kb             1  
                  aten::zero_         0.01%       8.000us         0.46%     446.000us     223.000us           0 b           0 b             2  
                  aten::fill_         0.45%     438.000us         0.45%     438.000us     219.000us           0 b           0 b             2  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 97.916ms

STUDENT - NAIVE ATTENTION statistics
cpu time:  97.867ms
mem usage:  4718592 bytes

This speed-up carries through for larger values of N (e.g. -N 4096):

Running Part 1 Test: Naive Unfused Attention                                                                                                                                                                      

-----RUNNING REFERENCE IMPLEMENTATION-----                                                                                                                                                                        

STAGE:2024-01-02 04:30:05 1483:1483 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-02 04:30:08 1483:1483 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-02 04:30:08 1483:1483 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
manual attention == pytorch attention True
Manual Execution Time:  3.351040840148926 

-------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                           Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                    aten::empty         0.00%      44.000us         0.00%      44.000us      14.667us      68.00 Mb      68.00 Mb             3  
    REFERENCE - NAIVE ATTENTION        99.57%        3.337s        99.85%        3.346s        3.346s      66.00 Mb      -4.00 Mb             1  
                    aten::zeros         0.00%      35.000us         0.24%       8.061ms       4.030ms      66.00 Mb           0 b             2  
                    aten::clone         0.00%      50.000us         0.04%       1.292ms     646.000us       4.00 Mb           0 b             2  
                model_inference         0.15%       4.930ms       100.00%        3.351s        3.351s       2.00 Mb     -64.00 Mb             1  
                  aten::flatten         0.00%      50.000us         0.02%     796.000us     159.200us       2.00 Mb           0 b             5  
               aten::empty_like         0.00%       8.000us         0.00%      14.000us      14.000us       2.00 Mb           0 b             1  
            aten::empty_strided         0.00%      40.000us         0.00%      40.000us      40.000us       2.00 Mb       2.00 Mb             1  
                    aten::zero_         0.00%      28.000us         0.24%       7.988ms       3.994ms           0 b           0 b             2  
                    aten::fill_         0.24%       7.960ms         0.24%       7.960ms       3.980ms           0 b           0 b             2  
-------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 3.351s

REFERENCE - NAIVE ATTENTION statistics
cpu time:  3346.141ms
mem usage:  69206016 bytes

-----RUNNING STUDENT IMPLEMENTATION-----

STAGE:2024-01-02 04:38:41 2106:2106 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-02 04:38:43 2106:2106 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-02 04:38:43 2106:2106 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
manual attention == pytorch attention True
Manual Execution Time:  1.6160633563995361 

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
    STUDENT - NAIVE ATTENTION        99.21%        1.603s        99.68%        1.611s        1.611s      66.00 Mb      -4.00 Mb             1  
                  aten::zeros         0.00%      21.000us         0.40%       6.544ms       3.272ms      66.00 Mb           0 b             2  
                  aten::empty         0.00%      44.000us         0.00%      44.000us      14.667us      66.00 Mb      66.00 Mb             3  
                  aten::clone         0.00%      27.000us         0.07%       1.090ms     545.000us       4.00 Mb           0 b             2  
              model_inference         0.32%       5.132ms       100.00%        1.616s        1.616s       2.00 Mb     -64.00 Mb             1  
                aten::flatten         0.00%      24.000us         0.03%     507.000us     101.400us       2.00 Mb           0 b             5  
             aten::empty_like         0.00%       4.000us         0.00%       5.000us       5.000us       2.00 Mb       2.00 Mb             1  
          aten::empty_strided         0.00%      42.000us         0.00%      42.000us      42.000us       2.00 Mb       2.00 Mb             1  
                  aten::zero_         0.00%       9.000us         0.40%       6.480ms       3.240ms           0 b           0 b             2  
                  aten::fill_         0.40%       6.471ms         0.40%       6.471ms       3.236ms           0 b           0 b             2  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.616s

STUDENT - NAIVE ATTENTION statistics
cpu time:  1610.961ms
mem usage:  69206016 bytes

Now my understanding is that a speedup is expected given that re-ordering the inner loops makes better use of the CPU cache by operating column-wise on matrices (and therefore having far more cache-hits). What's interesting however is when I then run the blocked version of the same code for part2:

int TILE_SIZE = 64;
for (int b = 0; b < B; b++)
{
    for (int h = 0; h < H; h++)
    {
        for (int ti = 0; ti < N; ti += TILE_SIZE)
        {
            for (int tj = 0; tj < N; tj += TILE_SIZE)
            {
                for (int tk = 0; tk < d; tk += TILE_SIZE)
                {
                    for (int i = ti; i < std::min(ti + TILE_SIZE, N); i++)
                    {
                        for (int j = tj; j < std::min(tj + TILE_SIZE, N); j++)
                        {
                            float val = twoDimRead(QK_t, i, j, N);
                            for (int k = tk; k < std::min(tk + TILE_SIZE, d); k++)
                            {
                                float qVal = fourDimRead(Q, b, h, i, k, H, N, d);
                                float kVal = fourDimRead(K, b, h, j, k, H, N, d);
                                val += qVal * kVal;
                            }
                            twoDimWrite(QK_t, i, j, N, val);
                        }
                    }
                }
            }
        }

        // softmax(QK_t)
        for (int i = 0; i < N; i++)
        {
            float rowSum = 0.0;
            for (int j = 0; j < N; j++)
            {
                float val = exp(twoDimRead(QK_t, i, j, N));
                twoDimWrite(QK_t, i, j, N, val);
                rowSum += val;
            }
            for (int j = 0; j < N; j++)
            {
                float val = twoDimRead(QK_t, i, j, N) / rowSum;
                twoDimWrite(QK_t, i, j, N, val);
            }
        }

        for (int ti = 0; ti < N; ti += TILE_SIZE)
        {
            for (int tk = 0; tk < N; tk += TILE_SIZE)
            {
                for (int tj = 0; tj < d; tj += TILE_SIZE)
                {
                    for (int i = ti; i < std::min(ti + TILE_SIZE, N); i++)
                    {
                        for (int k = tk; k < std::min(tk + TILE_SIZE, N); k++)
                        {
                            float val = twoDimRead(QK_t, i, k, N);
                            for (int j = tj; j < std::min(tj + TILE_SIZE, d); j++)
                            {
                                float oVal = fourDimRead(O, b, h, i, j, H, N, d);
                                float vVal = fourDimRead(V, b, h, k, j, H, N, d);
                                oVal += val * vVal;
                                fourDimWrite(O, b, h, i, j, H, N, d, oVal);
                            }
                        }
                    }
                }
            }
        }
    }
}

I see the following performance metrics:

Running Part 2 Test: Unfused Attention with Blocked Matmul                                                                                                                                                        

-----RUNNING REFERENCE IMPLEMENTATION-----                                                                                                                                                                        

STAGE:2024-01-02 04:40:05 2162:2162 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-02 04:40:05 2162:2162 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-02 04:40:05 2162:2162 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
manual attention == pytorch attention True
Manual Execution Time:  0.1801304817199707 

------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                     aten::empty         0.02%      34.000us         0.02%      34.000us      11.333us       5.00 Mb       5.00 Mb             3  
    REFERENCE - BLOCKED MATMUL + UNFUSED SOFTMAX        98.57%     177.574ms        99.97%     180.110ms     180.110ms       4.50 Mb      -1.00 Mb             1  
                                     aten::zeros         0.44%     791.000us         0.53%     955.000us     477.500us       4.50 Mb           0 b             2  
                                     aten::clone         0.02%      35.000us         0.42%     763.000us     381.500us       1.00 Mb           0 b             2  
                                 model_inference         0.03%      46.000us       100.00%     180.156ms     180.156ms     512.00 Kb      -4.00 Mb             1  
                                   aten::flatten         0.02%      42.000us         0.25%     443.000us      88.600us     512.00 Kb           0 b             5  
                                aten::empty_like         0.00%       6.000us         0.01%      10.000us      10.000us     512.00 Kb           0 b             1  
                             aten::empty_strided         0.01%      15.000us         0.01%      15.000us      15.000us     512.00 Kb     512.00 Kb             1  
                                     aten::zero_         0.02%      29.000us         0.50%     893.000us     446.500us           0 b           0 b             2  
                                     aten::fill_         0.48%     864.000us         0.48%     864.000us     432.000us           0 b           0 b             2  
------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 180.156ms

REFERENCE - BLOCKED MATMUL + UNFUSED SOFTMAX statistics
cpu time:  180.11ms
mem usage:  4718592 bytes

-----RUNNING STUDENT IMPLEMENTATION-----

STAGE:2024-01-02 04:40:12 2162:2162 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-02 04:40:12 2162:2162 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-02 04:40:12 2162:2162 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
manual attention == pytorch attention True
Manual Execution Time:  0.09045290946960449 

----------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
----------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                   aten::empty         0.01%      10.000us         0.01%      10.000us       3.333us       5.00 Mb       5.00 Mb             3  
    STUDENT - BLOCKED MATMUL + UNFUSED SOFTMAX        98.81%      89.405ms        99.95%      90.431ms      90.431ms       4.50 Mb      -1.00 Mb             1  
                                   aten::zeros         0.02%      17.000us         0.53%     477.000us     238.500us       4.50 Mb           0 b             2  
                                   aten::clone         0.02%      14.000us         0.58%     524.000us     262.000us       1.00 Mb           0 b             2  
                               model_inference         0.05%      47.000us       100.00%      90.478ms      90.478ms     512.00 Kb      -4.00 Mb             1  
                                 aten::flatten         0.02%      19.000us         0.30%     273.000us      54.600us     512.00 Kb           0 b             5  
                              aten::empty_like         0.00%       3.000us         0.00%       4.000us       4.000us     512.00 Kb           0 b             1  
                           aten::empty_strided         0.01%       5.000us         0.01%       5.000us       5.000us     512.00 Kb     512.00 Kb             1  
                                   aten::zero_         0.01%       8.000us         0.50%     451.000us     225.500us           0 b           0 b             2  
                                   aten::fill_         0.49%     443.000us         0.49%     443.000us     221.500us           0 b           0 b             2  
----------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 90.478ms

STUDENT - BLOCKED MATMUL + UNFUSED SOFTMAX statistics
cpu time:  90.431ms
mem usage:  4718592 bytes

And for larger -N 4096:

Running Part 2 Test: Unfused Attention with Blocked Matmul                                                                                                                                                        

-----RUNNING REFERENCE IMPLEMENTATION-----                                                                                                                                                                        

STAGE:2024-01-02 04:42:51 2275:2275 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-02 04:42:54 2275:2275 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-02 04:42:54 2275:2275 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
manual attention == pytorch attention True
Manual Execution Time:  2.871680498123169 

------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                     aten::empty         0.00%      57.000us         0.00%      57.000us      19.000us      68.00 Mb      68.00 Mb             3  
    REFERENCE - BLOCKED MATMUL + UNFUSED SOFTMAX        99.54%        2.858s        99.84%        2.867s        2.867s      66.00 Mb      -4.00 Mb             1  
                                     aten::zeros         0.00%      42.000us         0.25%       7.315ms       3.658ms      66.00 Mb           0 b             2  
                                     aten::clone         0.00%      45.000us         0.04%       1.214ms     607.000us       4.00 Mb           0 b             2  
                                 model_inference         0.16%       4.681ms       100.00%        2.872s        2.872s       2.00 Mb     -64.00 Mb             1  
                                   aten::flatten         0.00%      50.000us         0.03%     762.000us     152.400us       2.00 Mb           0 b             5  
                                aten::empty_like         0.00%       8.000us         0.00%      13.000us      13.000us       2.00 Mb           0 b             1  
                             aten::empty_strided         0.00%      37.000us         0.00%      37.000us      37.000us       2.00 Mb       2.00 Mb             1  
                                     aten::zero_         0.00%      36.000us         0.25%       7.221ms       3.611ms           0 b           0 b             2  
                                     aten::fill_         0.25%       7.185ms         0.25%       7.185ms       3.592ms           0 b           0 b             2  
------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.872s

REFERENCE - BLOCKED MATMUL + UNFUSED SOFTMAX statistics
cpu time:  2867.031ms
mem usage:  69206016 bytes

-----RUNNING STUDENT IMPLEMENTATION-----

STAGE:2024-01-02 04:43:12 2275:2275 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-01-02 04:43:14 2275:2275 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-01-02 04:43:14 2275:2275 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
manual attention == pytorch attention True
Manual Execution Time:  1.500117540359497 

----------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
----------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                   aten::empty         0.00%      64.000us         0.00%      64.000us      21.333us      68.00 Mb      68.00 Mb             3  
    STUDENT - BLOCKED MATMUL + UNFUSED SOFTMAX        99.09%        1.486s        99.65%        1.495s        1.495s      66.00 Mb      -4.00 Mb             1  
                                   aten::zeros         0.00%      23.000us         0.47%       7.070ms       3.535ms      66.00 Mb           0 b             2  
                                   aten::clone         0.00%      29.000us         0.09%       1.325ms     662.500us       4.00 Mb           0 b             2  
                               model_inference         0.35%       5.259ms       100.00%        1.500s        1.500s       2.00 Mb     -64.00 Mb             1  
                                 aten::flatten         0.00%      25.000us         0.05%     698.000us     139.600us       2.00 Mb           0 b             5  
                              aten::empty_like         0.00%       5.000us         0.00%      13.000us      13.000us       2.00 Mb           0 b             1  
                           aten::empty_strided         0.00%      40.000us         0.00%      40.000us      40.000us       2.00 Mb       2.00 Mb             1  
                                   aten::zero_         0.00%      11.000us         0.47%       6.991ms       3.495ms           0 b           0 b             2  
                                   aten::fill_         0.47%       6.980ms         0.47%       6.980ms       3.490ms           0 b           0 b             2  
----------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.500s

STUDENT - BLOCKED MATMUL + UNFUSED SOFTMAX statistics
cpu time:  1494.889ms
mem usage:  69206016 bytes

There are a couple things to note here that I find confusing.

First, the reference solution itself shows a much smaller speedup in my execution environment between part 1 & 2 than indicated in the README.md (12~15% vs. >30%). I assume this has something to do with the difference in underlying hardware, as I ran the test scripts on a rented cloud machine with an AMD EPYC 7302P 16-core processor, though I'm not really sure this is true. Would there be any other explanations/reasons for why the reference solution doesn't show similar performance improvements as indicated in the repo README?

Second, the blocked version of my solution (with loop re-ordering) shows even less speedup than the reference - small enough that it seems almost negligible for both smaller and larger values of N (6~8%). This is of course in relative comparison to the loop re-ordering which, while extremely simple to implement, led to a >2x speedup against the reference. Implementing cache-aware blocked matmul seems to provide almost no additional benefit when loop re-ordering is in place. Is this to be expeceted?

Even though loop re-ordering addresses the issue of high cache misses in row-wise matmul, I'd have expected that, given the small size of L1 caches, blocking would offer fewer overall accesses to main memory and therefore measurably faster processing times? From my experiments & observations thus far, it seems matrix tiling is not worth the extra effort if the inner loops are already re-ordered to be CPU cache-friendly.

Any further guidance on this topic would be very helpful!

stanford-cs149 / cs149gpt

Question about the impact of loop-reordering on matrix tiling #1