Add prefetch code, configurable, disabled right now

lemire commented 3 years ago

I am skeptical that a prefetch is useful when reading in memory sequentially. The only case where I think it can help is when crossing pages, but only if you prefetch really far, and you don't need to prefetch very often (only every 4kB).

https://lemire.me/blog/2018/04/30/is-software-prefetching-__builtin_prefetch-useful-for-performance/

A software prefetch is not free. It counts as an instruction, and it requires work to complete.

If you are not reading sequentially, then it is something else...

hkratz commented 3 years ago

I am skeptical that a prefetch is useful when reading in memory sequentially. The only case where I think it can help is when crossing pages, but only if you prefetch really far, and you don't need to prefetch very often (only every 4kB).

I was skeptical as well but it seems that the hardware prefetcher on my machine (Comet Lake) is deficient. Prefetching makes a measurable, consistent positive difference, if the algorithm is fed non-ASCII data. The difference is more pronounced if the data does not have 64-byte alignment (which is the cache line size of current x86-64 machines and the block size of the algorithm). On AMD Zen 2 prefetching is a tiny bit slower.

The only difference in the following benchmarks is commenting out the call to the prefetch intrinsic.

Intel Comet Lake, aligned to 0 mod 64

group                with_prefetch_0mod64                   without_prefetch_0mod64
-----                --------------------                   -----------------------
2-cyrillic/065536    1.00      3.8±0.25µs    16.0 GB/sec    1.05      4.0±0.22µs    15.2 GB/sec
2-cyrillic/131072    1.00      7.6±0.47µs    16.1 GB/sec    1.05      8.0±0.57µs    15.3 GB/sec
3-chinese/065538     1.00      3.8±0.10µs    16.2 GB/sec    1.07      4.0±0.32µs    15.2 GB/sec
3-chinese/131072     1.00      7.6±0.43µs    16.0 GB/sec    1.05      8.0±0.54µs    15.2 GB/sec
4-emoji/065538       1.00      3.8±0.21µs    16.0 GB/sec    1.06      4.0±0.31µs    15.1 GB/sec
4-emoji/131072       1.00      7.6±0.48µs    16.0 GB/sec    1.04      8.0±0.33µs    15.3 GB/sec

Intel Comet Lake, misaligned to 8 mod 64

group                with_prefetch_8mod64                   without_prefetch_8mod64
-----                --------------------                   -----------------------
2-cyrillic/065536    1.00      4.0±0.20µs    15.2 GB/sec    1.14      4.6±0.09µs    13.3 GB/sec
2-cyrillic/131072    1.00      8.0±0.50µs    15.2 GB/sec    1.15      9.2±0.38µs    13.2 GB/sec
3-chinese/065538     1.00      4.0±0.19µs    15.1 GB/sec    1.14      4.6±0.11µs    13.3 GB/sec
3-chinese/131072     1.00      8.0±0.44µs    15.2 GB/sec    1.14      9.1±0.18µs    13.4 GB/sec
4-emoji/065538       1.00      4.0±0.18µs    15.2 GB/sec    1.13      4.5±0.07µs    13.5 GB/sec
4-emoji/131072       1.00      8.0±0.44µs    15.3 GB/sec    1.14      9.1±0.23µs    13.5 GB/sec

AMD Zen 2, aligned to 0 mod 64

group                with_prefetch_0mod64                   without_prefetch_0mod64
-----                --------------------                   -----------------------
2-cyrillic/065536    1.02      3.7±0.01µs    16.6 GB/sec    1.00      3.6±0.01µs    16.9 GB/sec
2-cyrillic/131072    1.01      7.3±0.04µs    16.7 GB/sec    1.00      7.2±0.01µs    16.9 GB/sec
3-chinese/065538     1.01      3.6±0.02µs    16.7 GB/sec    1.00      3.6±0.01µs    16.9 GB/sec
3-chinese/131072     1.00      7.3±0.02µs    16.8 GB/sec    1.00      7.2±0.01µs    16.9 GB/sec
4-emoji/065538       1.00      3.6±0.00µs    16.9 GB/sec    1.00      3.6±0.01µs    16.9 GB/sec
4-emoji/131072       1.01      7.2±0.01µs    16.8 GB/sec    1.00      7.2±0.02µs    16.9 GB/sec```

AMD Zen 2, misaligned to 8 mod 64

group                with_prefetch_8mod64                   without_prefetch_8mod64
-----                --------------------                   -----------------------
2-cyrillic/065536    1.00      3.6±0.00µs    16.9 GB/sec    1.01      3.6±0.01µs    16.9 GB/sec
2-cyrillic/131072    1.00      7.2±0.03µs    16.9 GB/sec    1.00      7.2±0.01µs    16.9 GB/sec
3-chinese/065538     1.00      3.6±0.01µs    16.9 GB/sec    1.00      3.6±0.01µs    16.8 GB/sec
3-chinese/131072     1.00      7.2±0.01µs    16.9 GB/sec    1.00      7.2±0.02µs    16.9 GB/sec
4-emoji/065538       1.00      3.6±0.00µs    16.9 GB/sec    1.00      3.6±0.01µs    16.8 GB/sec
4-emoji/131072       1.00      7.2±0.01µs    16.9 GB/sec    1.00      7.2±0.02µs    16.9 GB/sec

lemire commented 3 years ago

Prefetching makes a measurable, consistent positive difference, if the algorithm is fed non-ASCII data.

If loads were lagging, would you not expect this effect to first hit ASCII which is faster and thus more memory bound?

hkratz commented 3 years ago

If loads were lagging, would you not expect this effect to first hit ASCII which is faster and thus more memory bound?

My current hypothesis is that the hardware prefetcher is just working a lot better for the pure ASCII loop, but I don't know why.

What I have checked so far:

I have confirmed by looking at the performance counters in VTune, that there are many more L1 data cache misses without the prefetch.
I have looked at the assembly and there are no loads/stores in the non-ASCII loop. Just the two 32-byte loads and then calculation.

rusticstuff / simdutf8