Ensure LHS remains in L1 cache in 2 successive UKernel calls

iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.

http://iree.dev/

Apache License 2.0

2.57k stars 575 forks source link

Ensure LHS remains in L1 cache in 2 successive UKernel calls #15659

Open RoboTux opened 10 months ago

RoboTux commented 10 months ago

Request description

Linux perf report for the Bert Large F32 Batch1 on Arm Neoverse-V1 shows 2 instructions as accounting for about 60% of the cycles in the mmt4d UKernel. Looking at a detailed profile of a reduced version of the 384x1024 by 1024x4096 matmul shows loads of LHS end up in L1 cache miss, even when the LHS is the same as the previous UKernel call due to the inner loop in iree_uk_mmt4d_using_tile_func().

The cause likely lies in the size of the UKernel workload: a single UKernel call loads 64KiB worth of data to access the LHS and RHS alone (8x1024 F32 values for LHS and similarly for RHS) while Arm Neoverse-V1 has a 64KiB L1 cache. A solution I can think of would be to split the reduction into several partial reductions and pack the data so that the data for the i-th partial reduction is sequential in memory. This way the inner loop mentioned above would do only sequential access and a new loop would be inserted around it to iterate over all the partial reductions.

What component(s) does this issue relate to?

Compiler, Runtime

Additional context

No response

stellaraccident commented 10 months ago

Adding Benoit because obviously :)

Also asking Julian because I think they are looking at some related things and this is a well scoped task.

bjacob commented 10 months ago

good topic! Experimentation welcome.

I used to maintain a self-contained whole-matmul benchmark based on ukernels here: https://github.com/openxla/iree/blob/main/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark.c . If you're interested in experimenting or just proving a theory in a self-contained, hackable environment, this is the place to be.
I'm open to anything that empirically helps wall-clock-time performance.
perf annotate style instruction-level profiling is notoriously inaccurate on Arm CPUs. I have rarely found it interpretable when perf claimed this kind of blame on a few instructions.
The mmt4d ukernel operates on tiled matrix data (what we call "data tiling"), allowing it to traverse only a single contiguous data stream of LHS data, and another one of RHS data. These long stretches of contiguous accesses tend to be well predicted and prefetched. If data is cycling beyond L1 cache size, it should be able to come back in from L2 in time for its next use, and performance shouldn't suffer. Except perhaps if we run into the bottleneck on L1->L2 victim eviction on in-order cores? The PR description only says Arm Neoverse-V1, I'm curious if it refers to in-order (like A510) or out-of-order (like A710 or X2).
Since so much of cache-friendly behavior on modern CPUs rests of having these long stretches of contiguous memory accesses, splitting the inner reduction loop can easily be counter-productive. Everyone used to be splitting it, the GPU folks are still doing it, but I've stopped doing it ~ 5 years ago on CPU and I would find it surprising if it was beneficial to bring back now. It does mean exceeding L1 cache size in deep matmuls and that should be OK for the reason said above.
PMU counter such as L1 cache refill are hard to interpret. We have some thoughts about that here. In any case, minimizing the L1 cache refill count is a non-goal.

RoboTux commented 10 months ago

good topic! Experimentation welcome.

* I used to maintain a self-contained whole-matmul benchmark based on ukernels here: https://github.com/openxla/iree/blob/main/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark.c . If you're interested in experimenting or just proving a theory in a self-contained, hackable environment, this is the place to be.

Thanks.

* I'm open to anything that empirically helps wall-clock-time performance.

* `perf annotate` style instruction-level profiling is notoriously inaccurate on Arm CPUs. I have rarely found it interpretable when `perf` claimed this kind of blame on a few instructions.

Yes that is my experience as well. Which is why I took the assembly generated for the 384x1024 by 1024x4096 matmul and ran the result in some sort of CPU simulator for the Arm Neoverse-V1. That confirmed me that cache misses are at play despite the sequential memory layout, even after several calls to the UKernel with the same LHS operand.

* The mmt4d ukernel operates on tiled matrix data (what we call "data tiling"), allowing it to traverse only a single contiguous data stream of LHS data, and another one of RHS data. These long stretches of contiguous accesses tend to be well predicted and prefetched. If data is cycling beyond L1 cache size, it should be able to come back in from L2 in time for its next use, and performance shouldn't suffer. Except perhaps if we run into the bottleneck on L1->L2 victim eviction on in-order cores? The PR description only says Arm Neoverse-V1, I'm curious if it refers to in-order (like A510) or out-of-order (like A710 or X2).

Arm Neoverse-V1 is an out of order core aimed at HPC and AI/ML workload. Sequential access can indeed be very well prefetched but it can take some time to prime the prefetcher. In the case of the RHS it's not a problem because the inner most loop calling the mmt4d ukernel will mean that from one ukernel call to the next the data is still sequential so we are reading N (as in iree_uk_mmt4d_params_t.N which was 32 in my testing) sequences of 1024 loads of 8 F32 values. The Arm Neoverse-V1 simulator I mentioned above shows that indeed by the second ukernel call we are hitting the L1 cache. However at each start of ukernel we start from the beginning of LHS instead of continuing where we left off. That seems to throw the prefetcher enough that no matter which iteration of the inner loop the ukernel always start with some L1 cache misses for the LHS accesses which seems to last until the end of the ukernel when the reduction dimension is 1024. In other words, close to half of the loads from the ukernel end up in a L1 cache miss.

* Since so much of cache-friendly behavior on modern CPUs rests of having these long stretches of contiguous memory accesses, splitting the inner reduction loop can easily be counter-productive. Everyone used to be splitting it, the GPU folks are still doing it, but I've stopped doing it ~ 5 years ago on CPU and I would find it surprising if it was beneficial to bring back now. It does mean exceeding L1 cache size in deep matmuls and that should be OK for the reason said above.

I might be missing something but how would splitting the inner reduction lead to non contiguous memory? For the LHS=384x1024 RHS=1024x4096 matmul found in BERT Large IREE currently packs into LHS=48x1024x8x1 (MxKxM0xK0) and RHS=512x1024x8x1 (NxKxN0xK0). Instead we could pack into LHS=48x2x512x8 (MxKxK0xM0) and RHS=2x512x8x8 (KxNxK0xM0). The reduction loop in the ukernel would then iterate 512 times. The inner loop calling that ukernel would iterate 32 times (in my testing) over N which for RHS would be contiguous. LHS would remain the same. That is the inner loop calling the ukernel does the same i-th partial reduction for a bunch of tiles. Then a new loop would iterate over the K partial reduction so we move to the next LHS which is sequential. RHS isn't though, because 32 < 512 but that happens once per 32 ukernel call. Obviously that would require a revamped mmt4d to work with that new layout. Anyway it's just a though, there might be more clever way to do it.

* PMU counter such as L1 cache refill are hard to interpret. We have some thoughts about that [here](https://github.com/openxla/iree/blob/main/docs/website/docs/developers/performance/profiling-cpu-events.md#interpreting-cpu-event-counts). In any case, minimizing the L1 cache refill count is a non-goal.

I was not looking at PMU counter but a trace of the CPU simulation. That tool should simulate the CPU accurately, including the caches and the prefetecher.

bjacob commented 10 months ago

good topic! Experimentation welcome.

* I used to maintain a self-contained whole-matmul benchmark based on ukernels here: https://github.com/openxla/iree/blob/main/runtime/src/iree/builtins/ukernel/tools/e2e_matmul_benchmark.c . If you're interested in experimenting or just proving a theory in a self-contained, hackable environment, this is the place to be.

Thanks.

Actually, for what you are looking at here, it may be simpler and closer to what you're interested in to use the raw mmt4d ukernel microbenchmark instead:

https://github.com/openxla/iree/blob/main/runtime/src/iree/builtins/ukernel/tools/mmt4d_benchmark.c

This is running just the mmt4d ukernel (on pre-packed input data). By contrast, the above e2e_matmul_benchmark was doing a user-facing end-to-end matmul, on row-major data, but that's not actually what you were discussing. Also, it relies on having ukernels for pack and unpack, which we have for certain common element types and tile sizes but not for all cases, as we are not using these ukernels in IREE codegen (so they are here mostly for historical reasons).

* I'm open to anything that empirically helps wall-clock-time performance.

* `perf annotate` style instruction-level profiling is notoriously inaccurate on Arm CPUs. I have rarely found it interpretable when `perf` claimed this kind of blame on a few instructions.
Yes that is my experience as well. Which is why I took the assembly generated for the 384x1024 by 1024x4096 matmul and ran the result in some sort of CPU simulator for the Arm Neoverse-V1. That confirmed me that cache misses are at play despite the sequential memory layout, even after several calls to the UKernel with the same LHS operand.

Ah, that's super interesting! I've never played with a CPU simulator.

* The mmt4d ukernel operates on tiled matrix data (what we call "data tiling"), allowing it to traverse only a single contiguous data stream of LHS data, and another one of RHS data. These long stretches of contiguous accesses tend to be well predicted and prefetched. If data is cycling beyond L1 cache size, it should be able to come back in from L2 in time for its next use, and performance shouldn't suffer. Except perhaps if we run into the bottleneck on L1->L2 victim eviction on in-order cores? The PR description only says Arm Neoverse-V1, I'm curious if it refers to in-order (like A510) or out-of-order (like A710 or X2).
Arm Neoverse-V1 is an out of order core aimed at HPC and AI/ML workload. Sequential access can indeed be very well prefetched but it can take some time to prime the prefetcher. In the case of the RHS it's not a problem because the inner most loop calling the mmt4d ukernel will mean that from one ukernel call to the next the data is still sequential so we are reading N (as in iree_uk_mmt4d_params_t.N which was 32 in my testing) sequences of 1024 loads of 8 F32 values. The Arm Neoverse-V1 simulator I mentioned above shows that indeed by the second ukernel call we are hitting the L1 cache. However at each start of ukernel we start from the beginning of LHS instead of continuing where we left off. That seems to throw the prefetcher enough that no matter which iteration of the inner loop the ukernel always start with some L1 cache misses for the LHS accesses which seems to last until the end of the ukernel when the reduction dimension is 1024. In other words, close to half of the loads from the ukernel end up in a L1 cache miss.

That's really interesting, thanks for the explanation!

Unfortunate, as one would think that an "out of order core aimed at HPC and AI/ML workload" would find it easy to speculate ahead the control flow in this simple loop nest and be able to prefetch both LHS and RHS data.

If this is what we are working around, my first attempt would be prfm instructions?

* Since so much of cache-friendly behavior on modern CPUs rests of having these long stretches of contiguous memory accesses, splitting the inner reduction loop can easily be counter-productive. Everyone used to be splitting it, the GPU folks are still doing it, but I've stopped doing it ~ 5 years ago on CPU and I would find it surprising if it was beneficial to bring back now. It does mean exceeding L1 cache size in deep matmuls and that should be OK for the reason said above.

I might be missing something but how would splitting the inner reduction lead to non contiguous memory?

I was referring to memory accesses not just layout and by contiguous I meant contiguous and monotonically increasing. So what I mean is that as long as we are in the middle of a for loop reading N bytes and incrementing pointers by N, we're in the ideal case to help the CPU prefetch data, so I try to stay in that case as much as possible --- so that's one of my theories why it helps to avoid splitting the reduction inner loop.

Actually, there is a bit of an unresolved mystery here that we've never fully understood, so maybe you'll be able to shed light. The performance of matrix multiplications with non-split inner reduction loop, on various Arm CPUs I've experimented with, increases greatly with the K reduction size, beyond just what I could explain by better prefetching. There is just something about a long inner reduction loop that the CPU loves, that I never understood, and was never able to correlate with PMU counters or anything. It's been an empirical fact.

For the LHS=384x1024 RHS=1024x4096 matmul found in BERT Large IREE currently packs into LHS=48x1024x8x1 (MxKxM0xK0) and RHS=512x1024x8x1 (NxKxN0xK0). Instead we could pack into LHS=48x2x512x8 (MxKxK0xM0) and RHS=2x512x8x8 (KxNxK0xM0).

Wait, there's a little misunderstanding here. The LHS layout will remain (MxKxM0xK0), and that's always understood as row-major, meaning that the last-enumerated dimension K0 is the one that's inner-most in the layout. The inner tile size (LHS M0xK0, RHS N0xK0) is dictated by the SIMD ISA - it is the shape of tiles that we can simply load from memory into SIMD registers and feed SIMD arithmetic instructions. It currently is (M0=8, N0=8, K0=1) because Arm NEON offers the right instructions to implement an outer-product kernel (K0=1) and the choice of (M0=8, N0=8) is the largest we can do while sticking to powers of two, under the constraint that there are only 32 NEON registers. (2 for LHS, 2 for RHS, 16 for accumulator, total 20 used out of 32).

Splitting the reduction dimension, if we went down that route (which again I don't expect will be beneficial) would not entail any change to these shapes (and layouts, implicitly dictated by shape). It would be a change of how we traverse that data, not to its layout.

* PMU counter such as L1 cache refill are hard to interpret. We have some thoughts about that [here](https://github.com/openxla/iree/blob/main/docs/website/docs/developers/performance/profiling-cpu-events.md#interpreting-cpu-event-counts). In any case, minimizing the L1 cache refill count is a non-goal.
I was not looking at PMU counter but a trace of the CPU simulation. That tool should simulate the CPU accurately, including the caches and the prefetecher.

That's really exciting to hear. We might learn things here that I was never able to!

bjacob commented 10 months ago

Here are some old slides decks from when I first introduced mmt4d. Much has changed since then (no code was written yet) but they mostly still apply:

Proposed lowering path for linalg.matmul to CPU codegen
- This appendix is about why we don't recommend splitting the reduction dimension.
A practical matmul tiling example

RoboTux commented 10 months ago

Arm Neoverse-V1 is an out of order core aimed at HPC and AI/ML workload. Sequential access can indeed be very well prefetched but it can take some time to prime the prefetcher. In the case of the RHS it's not a problem because the inner most loop calling the mmt4d ukernel will mean that from one ukernel call to the next the data is still sequential so we are reading N (as in iree_uk_mmt4d_params_t.N which was 32 in my testing) sequences of 1024 loads of 8 F32 values. The Arm Neoverse-V1 simulator I mentioned above shows that indeed by the second ukernel call we are hitting the L1 cache. However at each start of ukernel we start from the beginning of LHS instead of continuing where we left off. That seems to throw the prefetcher enough that no matter which iteration of the inner loop the ukernel always start with some L1 cache misses for the LHS accesses which seems to last until the end of the ukernel when the reduction dimension is 1024. In other words, close to half of the loads from the ukernel end up in a L1 cache miss.

That's really interesting, thanks for the explanation!

Unfortunate, as one would think that an "out of order core aimed at HPC and AI/ML workload" would find it easy to speculate ahead the control flow in this simple loop nest and be able to prefetch both LHS and RHS data.

Good point and yes, it absolutely does speculate ahead of the control flow. I was focused on the delay the FMLA have to wait due to their dependency on the loads. However you made me realize that it is irrelevant because they don't block the next load from starting to execute. So what matters is, given the limited number of load/store units, does the load of a given iteration need to wait for the one from the previous iteration to complete. And it seems after a few ukernel calls and a few loop iteration in the ukernel that it does wait but not much. So there might not be much gain.

If this is what we are working around, my first attempt would be prfm instructions?

prfm are hints to the CPU. Some CPU can ignore them if their prefetch unit is generally better than programmers at placing them judiciously. Remember that a big part of the problem is prefetching early enough for the data to have reached L1 cache by the time it's loaded, but just about so that we are not wasting L1 cache or worse the data gets evicted before it is used.

I might be missing something but how would splitting the inner reduction lead to non contiguous memory?
I was referring to memory accesses not just layout and by contiguous I meant contiguous and monotonically increasing. So what I mean is that as long as we are in the middle of a for loop reading N bytes and incrementing pointers by N, we're in the ideal case to help the CPU prefetch data, so I try to stay in that case as much as possible --- so that's one of my theories why it helps to avoid splitting the reduction inner loop.

That's what I meant as well. With the right transpose you can have the reduction dimention slice be of higher order than the tile columns so that iterating over the tile columns for a given partial reduction is purely monotonically increasing addresses.

Actually, there is a bit of an unresolved mystery here that we've never fully understood, so maybe you'll be able to shed light. The performance of matrix multiplications with non-split inner reduction loop, on various Arm CPUs I've experimented with, increases greatly with the K reduction size, beyond just what I could explain by better prefetching. There is just something about a long inner reduction loop that the CPU loves, that I never understood, and was never able to correlate with PMU counters or anything. It's been an empirical fact.

It takes some time for a prefetcher to get primed. As mentioned above it's not just about deciding to prefetch but finding when to initiate a memory read for it to hit the L1 cache just before the data is needed by a load instruction. So for quite a few iteration prefetch is happening but it still arrives too late in the L1 cache. You get a shorter wait but still a wait. At some point the prefetcher is fully primed and loads are immediate but the majority of the loads that have happened so far had to wait. As you do more loads you increase the proportion of loads that didn't have to wait and I would expect that after long enough you reach some sort of plateau in the average load time for a given loop.

For the LHS=384x1024 RHS=1024x4096 matmul found in BERT Large IREE currently packs into LHS=48x1024x8x1 (MxKxM0xK0) and RHS=512x1024x8x1 (NxKxN0xK0). Instead we could pack into LHS=48x2x512x8 (MxKxK0xM0) and RHS=2x512x8x8 (KxNxK0xM0).

Typo here, it should have been 2x512x512x8 (KxNxK0xN0 obviously)

Wait, there's a little misunderstanding here. The LHS layout will remain (MxKxM0xK0), and that's always understood as row-major, meaning that the last-enumerated dimension K0 is the one that's inner-most in the layout. The inner tile size (LHS M0xK0, RHS N0xK0) is dictated by the SIMD ISA - it is the shape of tiles that we can simply load from memory into SIMD registers and feed SIMD arithmetic instructions. It currently is (M0=8, N0=8, K0=1) because Arm NEON offers the right instructions to implement an outer-product kernel (K0=1) and the choice of (M0=8, N0=8) is the largest we can do while sticking to powers of two, under the constraint that there are only 32 NEON registers. (2 for LHS, 2 for RHS, 16 for accumulator, total 20 used out of 32).

Splitting the reduction dimension, if we went down that route (which again I don't expect will be beneficial) would not entail any change to these shapes (and layouts, implicitly dictated by shape). It would be a change of how we traverse that data, not to its layout.

Without changing layout it I completely agree it would not be beneficial. How about 5D matrices then? LHS=MxKxK1xM0xK0 where K is the splitting factor and RHS=KxNxK1xN0xK0.

Anyway, it might not be worth after all as said above since that cache miss does not seem to incur much delay from one iteration's load to the next thanks to out of order execution. I do see a little stall but not much.

I was not looking at PMU counter but a trace of the CPU simulation. That tool should simulate the CPU accurately, including the caches and the prefetecher.

That's really exciting to hear. We might learn things here that I was never able to!

Absolutely. If you are having performance issue on important workload that you have a hard time to understand please do reach out to us and we might be able to help.

bjacob commented 10 months ago

Good point and yes, it absolutely does speculate ahead of the control flow. I was focused on the delay the FMLA have to wait due to their dependency on the loads. However you made me realize that it is irrelevant because they don't block the next load from starting to execute. So what matters is, given the limited number of load/store units, does the load of a given iteration need to wait for the one from the previous iteration to complete. And it seems after a few ukernel calls and a few loop iteration in the ukernel that it does wait but not much. So there might not be much gain.

Right - this is one of the dimensions in which I was expecting it not to matter much - like you say, the next load isn't blocked on previous instructions - but in addition to this, I was also thinking that speculative execution should hit the first load well ahead of actual execution, making even the first load fast even if the data it's loading wasn't already in cache --- the cache miss would be experienced only by speculative execution, not by actual execution.

This has been one of the main sticking points for me in trying to understand any CPU cache event counts --- if a cache miss is experienced only by speculative execution, not by actual execution, is it counted as a cache miss? From a latency perspective, one would like it to be ignored. But from a power-measurement perspective, one would like it to be counted, as the power cost of a DRAM access is still just as high, regardless of whether the latency was hidden from actual execution. So this seems like an impossible problem to me: people designing these event counters --- whether in a CPU simulator or in the actual CPU's PMU --- have to decide on a meaning of these counters that will be misleading one way or another. Ultimately it's because these abstractions don't account for the nuance of speculative vs actual execution. One way to resolve that would be to duplicate every cache event type for actual vs speculative execution. By contrast, current event types in e.g. simpleperf only have -spec suffixed events for certain events and not for all the cache events that we're looking at here: https://android.googlesource.com/platform/system/extras/+/master/simpleperf/event_type_table.h

Without changing layout it I completely agree it would not be beneficial. How about 5D matrices then? LHS=MxKxK1xM0xK0 where K is the splitting factor and RHS=KxNxK1xN0xK0.

Regarding LHS=MxKxK1xM0xK0, note that when two consecutive dimensions are both K-something, as far as the layout is concerned, this describes the same layout as a single coalesced K-dimension. So here the two dimensions KxK1 are the same as a single (product) dimension, layout-wise.

Regarding RHS=KxNxK1xN0xK0... I see, yes, that is a logical thing to do if we take seriously split-reductions for cache-friendliness purposes. The idea there is to iterate over the whole N dimension (all columns) before moving on to the next split-reduction group (outer K dimension).

I can't rule out that this would be beneficial in some cases. I can see how that's more cache-friendly on RHS accesses. On the other hand, this means that horizontal slices of the accumulator matrix get traversed repeatedly, which may still be cache-friendly given well chosen tile sizes. There remains the inherent cost of additional repeated accumulator accesses, and the aforementioned mysterious penalty for lower inner loop trip counts. As mentioned earlier, this idea of splitting reductions for cache friendliness has been around since the original Goto paper (at least). I've chosen to ignore that legacy advice and enjoy the simplified mental model and immediate performance benefits, at the risk of missing out on some cache friendliness in some cases.

Absolutely. If you are having performance issue on important workload that you have a hard time to understand please do reach out to us and we might be able to help.

Thanks, it's good to have that possibility. What really helps me the most though is publicly available information that I can link to, read, build my own improved understanding on. So nothing would be as impactful as if CPU vendors started publicly documenting more of the fine performance behavior of specific CPUs -- the current CPU software optimization guides are rather light on information, particularly on memory system performance. Without necessarily asking for a full model, at least understanding concretely which addresses will alias in the L1 cache would help, and I don't believe that this can be inferred from the limited cache documentation (e.g. just saying "4-way set associative VIPT etc" doesn't seem to allow me to infer that, or if it does, then I would be very interesting to learn the formula).

It's also great to hear that there exists a performance-accurate CPU simulator. It would be great if it could be publicly available, easy to install and run, and free of any strings attached that would limit how I can use what I learn from it in open source contexts.

RoboTux commented 9 months ago

Good point and yes, it absolutely does speculate ahead of the control flow. I was focused on the delay the FMLA have to wait due to their dependency on the loads. However you made me realize that it is irrelevant because they don't block the next load from starting to execute. So what matters is, given the limited number of load/store units, does the load of a given iteration need to wait for the one from the previous iteration to complete. And it seems after a few ukernel calls and a few loop iteration in the ukernel that it does wait but not much. So there might not be much gain.

Right - this is one of the dimensions in which I was expecting it not to matter much - like you say, the next load isn't blocked on previous instructions - but in addition to this, I was also thinking that speculative execution should hit the first load well ahead of actual execution, making even the first load fast even if the data it's loading wasn't already in cache --- the cache miss would be experienced only by speculative execution, not by actual execution.

I don't follow, speculative execution exists because it speeds up code execution. If your speculative execution is waiting on a cache miss, you are not speculating something else (e.g. instructions using the result of that load) and your code ends up being slower (potentially as if you didn't have speculative execution for a short while).

This has been one of the main sticking points for me in trying to understand any CPU cache event counts --- if a cache miss is experienced only by speculative execution, not by actual execution, is it counted as a cache miss? From a latency perspective, one would like it to be ignored. But from a power-measurement perspective, one would like it to be counted, as the power cost of a DRAM access is still just as high, regardless of whether the latency was hidden from actual execution. So this seems like an impossible problem to me: people designing these event counters --- whether in a CPU simulator or in the actual CPU's PMU --- have to decide on a meaning of these counters that will be misleading one way or another. Ultimately it's because these abstractions don't account for the nuance of speculative vs actual execution. One way to resolve that would be to duplicate every cache event type for actual vs speculative execution. By contrast, current event types in e.g. simpleperf only have -spec suffixed events for certain events and not for all the cache events that we're looking at here: https://android.googlesource.com/platform/system/extras/+/master/simpleperf/event_type_table.h

Sure.

Without changing layout it I completely agree it would not be beneficial. How about 5D matrices then? LHS=MxKxK1xM0xK0 where K is the splitting factor and RHS=KxNxK1xN0xK0.

Regarding LHS=MxKxK1xM0xK0, note that when two consecutive dimensions are both K-something, as far as the layout is concerned, this describes the same layout as a single coalesced K-dimension. So here the two dimensions KxK1 are the same as a single (product) dimension, layout-wise.

Yes, I could have had a single K (and thus different rank for LHS and RHS) and teach ukernel to do only go over part of that dimension. It was just simpler and more explicit to split them: K1 is for the ukernel, K is iterated over by the caller.

Regarding RHS=KxNxK1xN0xK0... I see, yes, that is a logical thing to do if we take seriously split-reductions for cache-friendliness purposes. The idea there is to iterate over the whole N dimension (all columns) before moving on to the next split-reduction group (outer K dimension).

I can't rule out that this would be beneficial in some cases. I can see how that's more cache-friendly on RHS accesses. On the other hand, this means that horizontal slices of the accumulator matrix get traversed repeatedly, which may still be cache-friendly given well chosen tile sizes. There remains the inherent cost of additional repeated accumulator accesses, and the aforementioned mysterious penalty for lower inner loop trip counts. As mentioned earlier, this idea of splitting reductions for cache friendliness has been around since the original Goto paper (at least). I've chosen to ignore that legacy advice and enjoy the simplified mental model and immediate performance benefits, at the risk of missing out on some cache friendliness in some cases.

Absolutely. If you are having performance issue on important workload that you have a hard time to understand please do reach out to us and we might be able to help.

Thanks, it's good to have that possibility. What really helps me the most though is publicly available information that I can link to, read, build my own improved understanding on. So nothing would be as impactful as if CPU vendors started publicly documenting more of the fine performance behavior of specific CPUs -- the current CPU software optimization guides are rather light on information, particularly on memory system performance. Without necessarily asking for a full model, at least understanding concretely which addresses will alias in the L1 cache would help, and I don't believe that this can be inferred from the limited cache documentation (e.g. just saying "4-way set associative VIPT etc" doesn't seem to allow me to infer that, or if it does, then I would be very interesting to learn the formula).

Doesn't it? As per Wikipedia, having the index beyond the page boundary (e.g. beyond bit 12 for a 4KiB page) would require that bit to be both in the index and in the tag since they could be different in the case of VIPT. Also given spacial locality one could assume that consecutive cache lines to not be in the same set and thus the index bits to be the one just after the offset within a cache line. It should also be possible to construct a test that check that: if you make sure things would fit in terms of cache size you can check timing and determine if a cache line was overwritten or not due to associativity.

I might have got things wrong, I just thought about it quickly.

It's also great to hear that there exists a performance-accurate CPU simulator. It would be great if it could be publicly available, easy to install and run, and free of any strings attached that would limit how I can use what I learn from it in open source contexts.

Sure.

bjacob commented 9 months ago

Right - this is one of the dimensions in which I was expecting it not to matter much - like you say, the next load isn't blocked on previous instructions - but in addition to this, I was also thinking that speculative execution should hit the first load well ahead of actual execution, making even the first load fast even if the data it's loading wasn't already in cache --- the cache miss would be experienced only by speculative execution, not by actual execution.

I don't follow, speculative execution exists because it speeds up code execution. If your speculative execution is waiting on a cache miss, you are not speculating something else (e.g. instructions using the result of that load) and your code ends up being slower (potentially as if you didn't have speculative execution for a short while).

Oh, that isn't what I was trying to say. Speculative execution doesn't wait on anything --- that's the point of speculative execution. On a cache miss, it just carries on without actual data, while initiating a cache refill in the background, so that hopefully the data will be in cache by the time actual execution needs it.

So I was just explaining another reason why this might not matter much. Not only, like you pointed out, subsequent loads are not blocked on the first load retiring, but also, the actual data movement from RAM is typically initiated earlier than even the first load in actual execution, it is typically initiated by speculative execution / hardware prefetching.

RoboTux commented 8 months ago

Oh, that isn't what I was trying to say. Speculative execution doesn't wait on anything --- that's the point of speculative execution. On a cache miss, it just carries on without actual data, while initiating a cache refill in the background, so that hopefully the data will be in cache by the time actual execution needs it.

Sure, but only if there is no data dependency. It's not possible to execute an instruction that use the result of a load as one of the operand if the load hasn't completed, unless you have value speculation as well. If there is a dependency, all speculation for things with a dependency on it will be stalled. At some point the buffer holding instructions with non ready operand might be full in which case I think no instruction can go down that unit even if all their operands are ready. That's only my understanding and I might be wrong.