Don't crash on Linux machines with L4 cache

dfyz commented 4 months ago

I recently found out that this is a thing when trying to run a candle program (which depends on gemm) on this machine:

# grep 'model name' /proc/cpuinfo
model name      : Intel(R) Core(TM) i7-4770R CPU @ 3.20GHz
...
# cat /sys/devices/system/cpu/cpu*/cache/index4/level
4
...
# lscpu -B -C=type,level,ways,coherency-size,one-size
TYPE        LEVEL WAYS COHERENCY-SIZE  ONE-SIZE
Data            1    8             64     32768
Instruction     1    8             64     32768
Unified         2    8             64    262144
Unified         3   12             64   6291456
Unified         4   16             64 134217728

The Linux-specific code path that probes cache sizes via lscpu and sysfs assumes that level can't be greater than 3, so without this PR anything using gemm crashes like this:

index out of bounds: the len is 3 but the index is 3

This PR fixes this by adding a guard identical to the one existing in the generic X86 cache size probing code.

(an interesting theoretical question is whether it is possible to somehow exploit this gigantic 128 MiB cache instead of ignoring it)

sarah-quinones commented 4 months ago

an alternative approach that would make use of the cache is doing something like let level = Ord::min(level, 3)

dfyz commented 4 months ago

an alternative approach that would make use of the cache is doing something like let level = Ord::min(level, 3)

I just tried that, but it appears to be trickier than I thought at first:

fs::read_dir() doesn't guarantee any particular order, and on my system, .../level4 comes before .../level3. Since ties are currently resolved by the cache line size (and they are are all the same on my machine), the L3 cache "wins" and overwrites the last slot in the cache hierarchy.
When I worked around that by temporarily falling back to the lscpu code path (which preserves ordering), I noticed no performance improvement on a large matmul (more specifically, 8K×8K×8K f32 NN GEMM). Either I did something wrong, or the large macropanel size somehow doesn't help (I double-checked that it increased from 2736 to 8196).

Perhaps it makes sense to merge the fix for the crashes first, and then think of exploiting the L4 cache. By the way, I also added an additional commit that prevents the lscpu code path from crashing (my bad, I completely forgot about it).

sarah-quinones / gemm

Don't crash on Linux machines with L4 cache #28