Add TransposePerLane function. It can call the standard transpose functions but the test needs to make sure, that both lanes are transposed the same way. This can be used to calculate multiple horizontal sums faster by calculating the sum per lane first.
Keep in mind: For rectangular matrices with more rows than columns, the horizontal sum can often be calculated faster than transposing and adding the results.
Actually, transposing first and then adding the results is slower than a "swizzle -> add -> swizzle -> add -> ..." approach. The necessary number of additions stays the same, but the necessary number of ow swizzle operations decreases more and more with increasing matrix size.
[x] Rename Sum functions to RegisterSum and MultiRegisterSum
[x] rename sum.h to registerSum.h
[x] Use multiple accumulators ---> calculate independent sums to minimize latency, e.g t1=a+b; t2=c+d; r=t1+t2;
Add
TransposePerLane
function. It can call the standard transpose functions but the test needs to make sure, that both lanes are transposed the same way. This can be used to calculate multiple horizontal sums faster by calculating the sum per lane first.Keep in mind: For rectangular matrices with more rows than columns, the horizontal sum can often be calculated faster than transposing and adding the results.Actually, transposing first and then adding the results is slower than a "swizzle -> add -> swizzle -> add -> ..." approach. The necessary number of additions stays the same, but the necessary number of ow swizzle operations decreases more and more with increasing matrix size.
Sum
functions toRegisterSum
andMultiRegisterSum
sum.h
toregisterSum.h
t1=a+b; t2=c+d; r=t1+t2;