Closed unixpickle closed 7 years ago
Single precision routines do not have ASM primitive implementations to speed them up; look at the implemetations of asm.?{axpyIncaxpyUnitaryTo,dotUnitary}
. There is a long-standing open PR to add float32 ASM support and other things.
Ah, that makes sense. Ironically, blas32.Gemm could probably be sped up in the meantime by casting to float64 and then using blas64.Gemm.
The conversion of a the data to []float32 from a []float64 would likely mitigate any wins.
@kortschak not for matrix-matrix multiplications. Imagine multiplying two 300x300 matrices. There are 300^3 = 27M operations required to do the product, but only 2300^2 = 180K conversions. The conversions are by far* not the bottleneck. This is also why using the GPU for matrix multiplications is feasible in the first place (i.e. why the memory bandwidth isn't the bottleneck).
The following benchmark is an example of a matrix-matrix multiply where converting to 64-bit and then using blas64 is about twice as fast as using blas32.
Output:
Benchmarking 32-bit
Took 9000517 nanos
Benchmarking 32-bit (conversion)
Took 4312173 nanos
Code:
package main
import (
"fmt"
"math/rand"
"time"
"github.com/gonum/blas"
"github.com/gonum/blas/blas32"
"github.com/gonum/blas/blas64"
)
const Iterations = 5
func convert64(in []float32) []float64 {
res := make([]float64, len(in))
for i, x := range in {
res[i] = float64(x)
}
return res
}
func main() {
in1 := make([]float64, 300*300)
in2 := make([]float64, 300*300)
for i := range in1 {
in1[i] = rand.NormFloat64()
in2[i] = rand.NormFloat64()
}
in1f32 := make([]float32, 300*300)
in2f32 := make([]float32, 300*300)
outf32 := make([]float32, 300*300)
for i, x := range in1 {
in1f32[i] = float32(x)
}
for i, x := range in2 {
in2f32[i] = float32(x)
}
fmt.Println("Benchmarking 32-bit")
benchmarkFunc(func() {
blas32.Gemm(blas.NoTrans, blas.NoTrans, 1, blas32.General{
Rows: 300,
Cols: 300,
Stride: 300,
Data: in1f32,
}, blas32.General{
Rows: 300,
Cols: 300,
Stride: 300,
Data: in2f32,
}, 1, blas32.General{
Rows: 300,
Cols: 300,
Stride: 300,
Data: outf32,
})
})
fmt.Println("Benchmarking 32-bit (conversion)")
benchmarkFunc(func() {
blas64.Gemm(blas.NoTrans, blas.NoTrans, 1, blas64.General{
Rows: 300,
Cols: 300,
Stride: 300,
Data: convert64(in1f32),
}, blas64.General{
Rows: 300,
Cols: 300,
Stride: 300,
Data: convert64(in2f32),
}, 1, blas64.General{
Rows: 300,
Cols: 300,
Stride: 300,
Data: convert64(outf32),
})
})
}
func benchmarkFunc(f func()) {
t := time.Now().UnixNano()
for i := 0; i < Iterations; i++ {
f()
}
elapsed := time.Now().UnixNano() - t
fmt.Println("Took", elapsed/Iterations, "nanos")
}
This is true, though the constants are important here.
Right. Before doing the cast, there should be checks to make sure that the overhead of casting would be shadowed by the cost of computing the product.
This is now fixed in master. New benchmark results (slight differences might be due to noise in the benchmarks):
Benchmarking 32-bit
Took 2341706 nanos
Benchmarking 64-bit
Took 4877699 nanos
I ran an adhoc benchmark and found, to my surprise, that blas32.Gemm is slower than blas64.Gemm on my quad core i7. I notice that the native package includes some benchmarks for dgemm, but none for sgemm. Perhaps sgemm was never properly tuned for performance?
Output of benchmark:
Code for benchmark