Closed vladimir-ch closed 8 years ago
LGTM with a little trepidation. It would be good to have the logic changes separate from the comment and code reordering.
In general I try to submit small, comprehensive commits, I don't know why I didn't do it here. Sorry. I will split the changes and let you know when it's ready.
Ok, it's ready. PTAL @kortschak
A small remark: using an extra register (X8
) is the key to get an improved performance from this 4x unrolling (due to instruction pipelining). If the line
ADDPD X2, X8
were replaced by
ADDPD X2, X7
this PR would bring no benefit.
Thanks for that note. Makes sense.
LGTM
One more remark: since the assembly loop and the Go loop accumulate the products in different order, their results will not be identical due to non-associativity of floating-point arithmetic. Perhaps obvious, but I wanted to point it out explicitly.
PTAL, @kortschak