On Intel CPU Skylake and newer __builtin_prefetch doesn't need to be protected with a null check

google / tcmalloc

Apache License 2.0

4.4k stars 480 forks source link

On Intel CPU Skylake and newer __builtin_prefetch doesn't need to be protected with a null check #67

Open goldsteinn opened 3 years ago

goldsteinn commented 3 years ago

See: https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linked_list.h#L48

From Intel Manual:

The cache hierarchy of the Skylake microarchitecture has the following enhancements: • Higher Cache bandwidth compared to previous generations. • Simultaneous handling of more loads and stores enabled by enlarged buffers. • Processor can do two page walks in parallel compared to one in Haswell microarchitecture and earlier generations. • Page split load penalty down from 100 cycles in previous generation to 5 cycles. • L3 write bandwidth increased from 4 cycles per line in previous generation to 2 per line. • Support for the CLFLUSHOPT instruction to flush cache lines and manage memory ordering of flushed data using SFENCE. • Reduced performance penalty for a software prefetch that specifies a NULL pointer. • L2 associativity changed from 8 ways to 4 ways.

Did a quick test on my machine not seeing any DTLB_LOAD_MISSES on prefetch NULL (or any address less than 4096 for that matter).

rojkov commented 3 years ago

I wonder what's the best way to detect Skylake and newer CPUs at compile time. GCC seems to set some useful defines

$ gcc -march=native -dM -E - < /dev/null | grep -i skylake
#define __skylake_avx512__ 1
#define __skylake_avx512 1
#define __tune_skylake_avx512__ 1

But clang doesn't do that. Both compilers set defines for Skylake features like __AVX512VL__ though.

goldsteinn commented 3 years ago

I think that will only cover skylake server. Aka:

$ gcc -march=skylake -dM -E - < /dev/null | grep -i skylake
#define __skylake__ 1
#define __tune_skylake__ 1
#define __skylake 1

vs.

$ gcc -march=skylake-avx512 -dM -E - < /dev/null | grep -i skylake
#define __skylake_avx512__ 1
#define __skylake_avx512 1
#define __tune_skylake_avx512__ 1

But you could probably hack it together with something along the lines of:

#if (__skylake__ || __AVX512F__) && !__knl__ // Can skip the !__knl__ if Knights Landing also optimizes out prefetch NULL
#define USE_PREFETCH_NULL 1
#endif

AFAIK Any non Knight's Landing micro-arch with AVX512 is Skylake or newer so __AVX512F__ + __skylake__ should work. There may be something I'm missing though.

That said rather than a GCC hack. Probably best bet is to use CPUID.