Open dfyz opened 6 months ago
an alternative approach that would make use of the cache is doing something like let level = Ord::min(level, 3)
an alternative approach that would make use of the cache is doing something like
let level = Ord::min(level, 3)
I just tried that, but it appears to be trickier than I thought at first:
fs::read_dir()
doesn't guarantee any particular order, and on my system, .../level4
comes before .../level3
. Since ties are currently resolved by the cache line size (and they are are all the same on my machine), the L3 cache "wins" and overwrites the last slot in the cache hierarchy.lscpu
code path (which preserves ordering), I noticed no performance improvement on a large matmul (more specifically, 8K×8K×8K f32 NN GEMM). Either I did something wrong, or the large macropanel size somehow doesn't help (I double-checked that it increased from 2736
to 8196
).Perhaps it makes sense to merge the fix for the crashes first, and then think of exploiting the L4 cache. By the way, I also added an additional commit that prevents the lscpu
code path from crashing (my bad, I completely forgot about it).
I recently found out that this is a thing when trying to run a
candle
program (which depends ongemm
) on this machine:The Linux-specific code path that probes cache sizes via
lscpu
andsysfs
assumes thatlevel
can't be greater than 3, so without this PR anything usinggemm
crashes like this:This PR fixes this by adding a guard identical to the one existing in the generic X86 cache size probing code.
(an interesting theoretical question is whether it is possible to somehow exploit this gigantic 128 MiB cache instead of ignoring it)