Open goastler opened 1 month ago
yes it will be diff per cpu as the cache size varies, so the cache hits for different step sizes for unrolled loops are going to be different. Unroll too much and too many cpu instructions outweight the benefits, too few and you don't gain cache locality and branch overhead reduction
e.g. 8 works nicely on my pc, 4 less so, 16 very much less so