Align functions on a 16 bytes boundary

sylveon commented 4 years ago

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16-byte aligned

Should help the prefetcher, and also required for #7 because SetProcessValidCallTargets wants 16-byte alignment. (Should we pad thunk objects to the nearest multiple of 16 with nops too?)

HeapAlloc can't provide custom alignment, but VirtualAlloc2 (which can) allocates in blocks of 4KB, so is hugely innefficient (although it allows us to put back write protection once the thunk is generated). Can continue using HeapAlloc but pad allocations, or implement own heap on top of VirtualAlloc2 pages.

ARM should be fine with that alignment too

sylveon commented 4 years ago

Use alignas(16) on base, add assert on derived to make sure respected, could std::align work for allocator?

sylveon commented 4 years ago

Jesse Natalie said on the DirectX server that HeapAlloc is 16 bytes aligned on x64, but did not specify for other architectures. Since this isn't really documented, not sure how much we can rely on this.

sylveon commented 4 years ago

Also, we actually need to pad with int3 or ud2 after an indirect jump to stop speculative execution, and to avoid potential pipeline stalls or fetching the next (also junk) 16-bytes block (in case it decodes the junk to a multi-byte instruction that extends to the next 16-bytes block).

No idea about ARM, need investigation.

sylveon / member_thunk

Align functions on a 16 bytes boundary #10