Closed abbriggs closed 1 month ago
Can confirm that I'm also seeing a ~5% average bump in PerformanceTest
when compiled with clang-cl (LLVM 17) and running on an AMD 7940HS.
Oddly enough I'm not seeing much of a difference at all when compiled with MSVC (v14.39-17.9).
Hello,
I did some comparisons on my i9-12900H (6 P, 8 E-cores) and the general gain seems to be around 5-10% (both on clang and MSVC). The gap seems to increase as the E-cores start doing their work:
(note that this data is just from a single run so there's some variation in it)
Nice work! 🍾
Summary
Currently, the bookkeeping variables used by
QuadTree
to access the storage of itsFixedSizeFreeList
allocator can reside in the same cache line asatomic
variables which are modified on allocation. If one thread allocates a new object and another thread callsmAllocator.Get()
, a cross-thread cache sync can occur, which causes a stall and reduces performance in multi-threaded use cases.This PR reorders some variables in
QuadTree
andFixedSizeFreeList
such that the hot path in most use cases (FixedSizeFreeList::Get()
) should avoid loading unrelated/contentiousatomic
variables.Performance Improvement
PerformanceTest
before changes (AMD RZ9-7950X)PerformanceTest
after changes (AMD RZ9-7950X)Testing across thread counts on modern desktop x86 processors using the built-in
PerformanceTest
(compiled with LLVM 17 via Clang-CL on Windows), this seems to give around a 5% average improvement in the general case. On other workloads I've tested which are heavily memory-bound, I've seen improvements of up to 15% at specific thread counts. This has also been tested on Apple silicon with no regressions found.