gonum / internal

Internal routines for the gonum project [DEPRECATED]
21 stars 9 forks source link

asm: unroll 4x DdotUnitary #19

Closed vladimir-ch closed 8 years ago

vladimir-ch commented 8 years ago

PTAL, @kortschak

DdotUnitaryN1       3.59ns ± 0%  4.13ns ± 0%  +15.04%  (p=0.029 n=4+4)
DdotUnitaryN2       3.69ns ± 0%  4.65ns ± 0%  +26.02%  (p=0.029 n=4+4)
DdotUnitaryN3       4.14ns ± 0%  5.20ns ± 0%  +25.48%  (p=0.029 n=4+4)
DdotUnitaryN4       4.59ns ± 0%  4.57ns ± 0%   -0.49%  (p=0.029 n=4+4)
DdotUnitaryN10      6.07ns ± 0%  6.53ns ± 0%   +7.58%  (p=0.029 n=4+4)
DdotUnitaryN100     38.2ns ± 0%  24.4ns ± 0%  -36.13%  (p=0.029 n=4+4)
DdotUnitaryN1000     424ns ± 0%   219ns ± 0%  -48.35%  (p=0.029 n=4+4)
DdotUnitaryN10000   4.31µs ± 0%  2.79µs ± 0%  -35.30%  (p=0.029 n=4+4)
DdotUnitaryN100000  45.5µs ± 1%  40.7µs ± 0%  -10.50%  (p=0.029 n=4+4)
kortschak commented 8 years ago

LGTM with a little trepidation. It would be good to have the logic changes separate from the comment and code reordering.

vladimir-ch commented 8 years ago

In general I try to submit small, comprehensive commits, I don't know why I didn't do it here. Sorry. I will split the changes and let you know when it's ready.

vladimir-ch commented 8 years ago

Ok, it's ready. PTAL @kortschak

A small remark: using an extra register (X8) is the key to get an improved performance from this 4x unrolling (due to instruction pipelining). If the line

ADDPD  X2, X8

were replaced by

ADDPD  X2, X7

this PR would bring no benefit.

kortschak commented 8 years ago

Thanks for that note. Makes sense.

kortschak commented 8 years ago

LGTM

vladimir-ch commented 8 years ago

One more remark: since the assembly loop and the Go loop accumulate the products in different order, their results will not be identical due to non-associativity of floating-point arithmetic. Perhaps obvious, but I wanted to point it out explicitly.