Closed walbourn closed 4 years ago
I did not do __m256
versions of XMVector3*Stream
functions as it requires having 8 XMFLOAT3
values in flight to really be efficient. Those will just use the existing 128 paths that have 3 in flight at a time.
256-bit register versions of:
XMMatrixMultiply
XMMatrixMultiplyTranspose
XMMatrixTranspose
XMVector2TransformStream
XMVector2TransformCoordStream
XMVector2TransformNormalStream
XMVector4TransformStream
These add some
__m256
use cases forXMMatrixMultiply
,XMMatrixMultiplyTranspose
, andXMMatrixTranspose
plus theStream
methods.While these implementations can technically work on systems with just AVX in some cases, this class of hardware often doesn't have a fully 256-bit wide bus so it's not really much of a win. Therefore, I'm only doing
__m256
register usage when building for/arch:AVX2
.