gonum / blas

A BLAS implementation for Go [DEPRECATED]
172 stars 16 forks source link

Native sgemm slower than dgemm #196

Closed unixpickle closed 7 years ago

unixpickle commented 7 years ago

I ran an adhoc benchmark and found, to my surprise, that blas32.Gemm is slower than blas64.Gemm on my quad core i7. I notice that the native package includes some benchmarks for dgemm, but none for sgemm. Perhaps sgemm was never properly tuned for performance?

Output of benchmark:

Benchmarking 32-bit
Took 12027724 nanos
Benchmarking 64-bit
Took 5117921 nanos

Code for benchmark

package main

import (
    "fmt"
    "math/rand"
    "time"

    "github.com/gonum/blas"
    "github.com/gonum/blas/blas32"
    "github.com/gonum/blas/blas64"
)

const Iterations = 5

func main() {
    in1 := make([]float64, 300*300)
    in2 := make([]float64, 300*300)
    for i := range in1 {
        in1[i] = rand.NormFloat64()
        in2[i] = rand.NormFloat64()
    }
    out := make([]float64, 300*300)
    in1f32 := make([]float32, 300*300)
    in2f32 := make([]float32, 300*300)
    outf32 := make([]float32, 300*300)
    for i, x := range in1 {
        in1f32[i] = float32(x)
    }
    for i, x := range in2 {
        in2f32[i] = float32(x)
    }

    fmt.Println("Benchmarking 32-bit")
    benchmarkFunc(func() {
        blas32.Gemm(blas.NoTrans, blas.NoTrans, 1, blas32.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   in1f32,
        }, blas32.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   in2f32,
        }, 1, blas32.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   outf32,
        })
    })

    fmt.Println("Benchmarking 64-bit")
    benchmarkFunc(func() {
        blas64.Gemm(blas.NoTrans, blas.NoTrans, 1, blas64.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   in1,
        }, blas64.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   in2,
        }, 1, blas64.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   out,
        })
    })
}

func benchmarkFunc(f func()) {
    t := time.Now().UnixNano()
    for i := 0; i < Iterations; i++ {
        f()
    }
    elapsed := time.Now().UnixNano() - t
    fmt.Println("Took", elapsed/Iterations, "nanos")
}
kortschak commented 7 years ago

Single precision routines do not have ASM primitive implementations to speed them up; look at the implemetations of asm.?{axpyIncaxpyUnitaryTo,dotUnitary}. There is a long-standing open PR to add float32 ASM support and other things.

unixpickle commented 7 years ago

Ah, that makes sense. Ironically, blas32.Gemm could probably be sped up in the meantime by casting to float64 and then using blas64.Gemm.

kortschak commented 7 years ago

The conversion of a the data to []float32 from a []float64 would likely mitigate any wins.

unixpickle commented 7 years ago

@kortschak not for matrix-matrix multiplications. Imagine multiplying two 300x300 matrices. There are 300^3 = 27M operations required to do the product, but only 2300^2 = 180K conversions. The conversions are by far* not the bottleneck. This is also why using the GPU for matrix multiplications is feasible in the first place (i.e. why the memory bandwidth isn't the bottleneck).

The following benchmark is an example of a matrix-matrix multiply where converting to 64-bit and then using blas64 is about twice as fast as using blas32.

Output:

Benchmarking 32-bit
Took 9000517 nanos
Benchmarking 32-bit (conversion)
Took 4312173 nanos

Code:

package main

import (
    "fmt"
    "math/rand"
    "time"

    "github.com/gonum/blas"
    "github.com/gonum/blas/blas32"
    "github.com/gonum/blas/blas64"
)

const Iterations = 5

func convert64(in []float32) []float64 {
    res := make([]float64, len(in))
    for i, x := range in {
        res[i] = float64(x)
    }
    return res
}

func main() {
    in1 := make([]float64, 300*300)
    in2 := make([]float64, 300*300)
    for i := range in1 {
        in1[i] = rand.NormFloat64()
        in2[i] = rand.NormFloat64()
    }
    in1f32 := make([]float32, 300*300)
    in2f32 := make([]float32, 300*300)
    outf32 := make([]float32, 300*300)
    for i, x := range in1 {
        in1f32[i] = float32(x)
    }
    for i, x := range in2 {
        in2f32[i] = float32(x)
    }

    fmt.Println("Benchmarking 32-bit")
    benchmarkFunc(func() {
        blas32.Gemm(blas.NoTrans, blas.NoTrans, 1, blas32.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   in1f32,
        }, blas32.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   in2f32,
        }, 1, blas32.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   outf32,
        })
    })

    fmt.Println("Benchmarking 32-bit (conversion)")
    benchmarkFunc(func() {
        blas64.Gemm(blas.NoTrans, blas.NoTrans, 1, blas64.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   convert64(in1f32),
        }, blas64.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   convert64(in2f32),
        }, 1, blas64.General{
            Rows:   300,
            Cols:   300,
            Stride: 300,
            Data:   convert64(outf32),
        })
    })
}

func benchmarkFunc(f func()) {
    t := time.Now().UnixNano()
    for i := 0; i < Iterations; i++ {
        f()
    }
    elapsed := time.Now().UnixNano() - t
    fmt.Println("Took", elapsed/Iterations, "nanos")
}
kortschak commented 7 years ago

This is true, though the constants are important here.

unixpickle commented 7 years ago

Right. Before doing the cast, there should be checks to make sure that the overhead of casting would be shadowed by the cost of computing the product.

unixpickle commented 7 years ago

This is now fixed in master. New benchmark results (slight differences might be due to noise in the benchmarks):

Benchmarking 32-bit
Took 2341706 nanos
Benchmarking 64-bit
Took 4877699 nanos