Closed hkratz closed 3 years ago
I am skeptical that a prefetch is useful when reading in memory sequentially. The only case where I think it can help is when crossing pages, but only if you prefetch really far, and you don't need to prefetch very often (only every 4kB).
I was skeptical as well but it seems that the hardware prefetcher on my machine (Comet Lake) is deficient. Prefetching makes a measurable, consistent positive difference, if the algorithm is fed non-ASCII data. The difference is more pronounced if the data does not have 64-byte alignment (which is the cache line size of current x86-64 machines and the block size of the algorithm). On AMD Zen 2 prefetching is a tiny bit slower.
The only difference in the following benchmarks is commenting out the call to the prefetch intrinsic.
group with_prefetch_0mod64 without_prefetch_0mod64
----- -------------------- -----------------------
2-cyrillic/065536 1.00 3.8±0.25µs 16.0 GB/sec 1.05 4.0±0.22µs 15.2 GB/sec
2-cyrillic/131072 1.00 7.6±0.47µs 16.1 GB/sec 1.05 8.0±0.57µs 15.3 GB/sec
3-chinese/065538 1.00 3.8±0.10µs 16.2 GB/sec 1.07 4.0±0.32µs 15.2 GB/sec
3-chinese/131072 1.00 7.6±0.43µs 16.0 GB/sec 1.05 8.0±0.54µs 15.2 GB/sec
4-emoji/065538 1.00 3.8±0.21µs 16.0 GB/sec 1.06 4.0±0.31µs 15.1 GB/sec
4-emoji/131072 1.00 7.6±0.48µs 16.0 GB/sec 1.04 8.0±0.33µs 15.3 GB/sec
group with_prefetch_8mod64 without_prefetch_8mod64
----- -------------------- -----------------------
2-cyrillic/065536 1.00 4.0±0.20µs 15.2 GB/sec 1.14 4.6±0.09µs 13.3 GB/sec
2-cyrillic/131072 1.00 8.0±0.50µs 15.2 GB/sec 1.15 9.2±0.38µs 13.2 GB/sec
3-chinese/065538 1.00 4.0±0.19µs 15.1 GB/sec 1.14 4.6±0.11µs 13.3 GB/sec
3-chinese/131072 1.00 8.0±0.44µs 15.2 GB/sec 1.14 9.1±0.18µs 13.4 GB/sec
4-emoji/065538 1.00 4.0±0.18µs 15.2 GB/sec 1.13 4.5±0.07µs 13.5 GB/sec
4-emoji/131072 1.00 8.0±0.44µs 15.3 GB/sec 1.14 9.1±0.23µs 13.5 GB/sec
group with_prefetch_0mod64 without_prefetch_0mod64
----- -------------------- -----------------------
2-cyrillic/065536 1.02 3.7±0.01µs 16.6 GB/sec 1.00 3.6±0.01µs 16.9 GB/sec
2-cyrillic/131072 1.01 7.3±0.04µs 16.7 GB/sec 1.00 7.2±0.01µs 16.9 GB/sec
3-chinese/065538 1.01 3.6±0.02µs 16.7 GB/sec 1.00 3.6±0.01µs 16.9 GB/sec
3-chinese/131072 1.00 7.3±0.02µs 16.8 GB/sec 1.00 7.2±0.01µs 16.9 GB/sec
4-emoji/065538 1.00 3.6±0.00µs 16.9 GB/sec 1.00 3.6±0.01µs 16.9 GB/sec
4-emoji/131072 1.01 7.2±0.01µs 16.8 GB/sec 1.00 7.2±0.02µs 16.9 GB/sec```
group with_prefetch_8mod64 without_prefetch_8mod64
----- -------------------- -----------------------
2-cyrillic/065536 1.00 3.6±0.00µs 16.9 GB/sec 1.01 3.6±0.01µs 16.9 GB/sec
2-cyrillic/131072 1.00 7.2±0.03µs 16.9 GB/sec 1.00 7.2±0.01µs 16.9 GB/sec
3-chinese/065538 1.00 3.6±0.01µs 16.9 GB/sec 1.00 3.6±0.01µs 16.8 GB/sec
3-chinese/131072 1.00 7.2±0.01µs 16.9 GB/sec 1.00 7.2±0.02µs 16.9 GB/sec
4-emoji/065538 1.00 3.6±0.00µs 16.9 GB/sec 1.00 3.6±0.01µs 16.8 GB/sec
4-emoji/131072 1.00 7.2±0.01µs 16.9 GB/sec 1.00 7.2±0.02µs 16.9 GB/sec
Prefetching makes a measurable, consistent positive difference, if the algorithm is fed non-ASCII data.
If loads were lagging, would you not expect this effect to first hit ASCII which is faster and thus more memory bound?
If loads were lagging, would you not expect this effect to first hit ASCII which is faster and thus more memory bound?
My current hypothesis is that the hardware prefetcher is just working a lot better for the pure ASCII loop, but I don't know why.
What I have checked so far:
I am skeptical that a prefetch is useful when reading in memory sequentially. The only case where I think it can help is when crossing pages, but only if you prefetch really far, and you don't need to prefetch very often (only every 4kB).
https://lemire.me/blog/2018/04/30/is-software-prefetching-__builtin_prefetch-useful-for-performance/
A software prefetch is not free. It counts as an instruction, and it requires work to complete.
If you are not reading sequentially, then it is something else...