256-bit wide AVX register optimizations

walbourn commented 4 years ago

These add some __m256 use cases for XMMatrixMultiply, XMMatrixMultiplyTranspose, and XMMatrixTranspose plus the Stream methods.

While these implementations can technically work on systems with just AVX in some cases, this class of hardware often doesn't have a fully 256-bit wide bus so it's not really much of a win. Therefore, I'm only doing __m256 register usage when building for /arch:AVX2.

It might be worthwhile to look at XMMatrixInverse for this optimization, but for now I'm just focused on the scenarios above.

walbourn commented 4 years ago

I did not do __m256 versions of XMVector3*Stream functions as it requires having 8 XMFLOAT3 values in flight to really be efficient. Those will just use the existing 128 paths that have 3 in flight at a time.

walbourn commented 4 years ago

256-bit register versions of:

XMMatrixMultiply
XMMatrixMultiplyTranspose
XMMatrixTranspose
XMVector2TransformStream
XMVector2TransformCoordStream
XMVector2TransformNormalStream
XMVector4TransformStream

microsoft / DirectXMath

256-bit wide AVX register optimizations #101