jrouwe / JoltPhysics

A multi core friendly rigid body physics and collision detection library. Written in C++. Suitable for games and VR applications. Used by Horizon Forbidden West.
MIT License
6k stars 374 forks source link

QuadTree / FixedSizeFreeList: Reorder variable layout to reduce false sharing & thread syncs #1136

Closed abbriggs closed 1 month ago

abbriggs commented 1 month ago

Summary

Currently, the bookkeeping variables used by QuadTree to access the storage of its FixedSizeFreeList allocator can reside in the same cache line as atomic variables which are modified on allocation. If one thread allocates a new object and another thread calls mAllocator.Get(), a cross-thread cache sync can occur, which causes a stall and reduces performance in multi-threaded use cases.

This PR reorders some variables in QuadTree and FixedSizeFreeList such that the hot path in most use cases (FixedSizeFreeList::Get()) should avoid loading unrelated/contentious atomic variables.

Performance Improvement

PerformanceTest before changes (AMD RZ9-7950X)

PerformanceTest after changes (AMD RZ9-7950X)

Testing across thread counts on modern desktop x86 processors using the built-in PerformanceTest (compiled with LLVM 17 via Clang-CL on Windows), this seems to give around a 5% average improvement in the general case. On other workloads I've tested which are heavily memory-bound, I've seen improvements of up to 15% at specific thread counts. This has also been tested on Apple silicon with no regressions found.

CLAassistant commented 1 month ago

CLA assistant check
All committers have signed the CLA.

mihe commented 1 month ago

Can confirm that I'm also seeing a ~5% average bump in PerformanceTest when compiled with clang-cl (LLVM 17) and running on an AMD 7940HS.

Oddly enough I'm not seeing much of a difference at all when compiled with MSVC (v14.39-17.9).

jrouwe commented 1 month ago

Hello,

I did some comparisons on my i9-12900H (6 P, 8 E-cores) and the general gain seems to be around 5-10% (both on clang and MSVC). The gap seems to increase as the E-cores start doing their work:

image

(note that this data is just from a single run so there's some variation in it)

Nice work! 🍾