Open GoogleCodeExporter opened 8 years ago
ESUM isn't called enough to make a difference speed-wise.
so small optimizations probably won't be noticeable.
i'll keep this open incase any other dev wants to implement the code.
Original comment by cottonvibes
on 2 Feb 2009 at 8:27
Yes. Dot Product use scaler-add would be faster than vector-add(except dpps).
the SHUFPS and PSHUFD are bad latency on Core 2 Duo(Conroe) only.
but using MOVSHDUP can get less latency.
__declspec(naked) void Vec4Dot_SSE2() {
__asm {
mulps xmm0, xmm0
movhlps xmm1, xmm0
addps xmm1, xmm0
pshufd xmm0, xmm1, 00000001b
addss xmm0, xmm1
movss [eax] ,xmm0
ret
}
}
__declspec(naked) void Vec4Dot_SSSE3() {
__asm {
mulps xmm0, xmm0
movshdup xmm1, xmm0
addps xmm1, xmm0
movhlps xmm0, xmm1
addss xmm0, xmm1
movss [eax] ,xmm0
ret
}
}
Original comment by w0w71...@gmail.com
on 3 Feb 2009 at 3:48
Original issue reported on code.google.com by
w0w71...@gmail.com
on 1 Feb 2009 at 10:55