ryan961 / memo

https://github.com/ryan961/memo/issues
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Go Perf: From slow to SIMD: A Go optimization story #9

Open ryan961 opened 8 months ago

ryan961 commented 8 months ago

🤖 AI Summary

The article titled "From slow to SIMD: A Go optimization story" on Sourcegraph's blog dives into how Camden Cheek optimized a language feature of Go using the SIMD technology. The article details their performance improvement journey, retracing the steps the author took at each stage. It begins by recognizing a slow yet crucial part of code involving cosine similarity calculation. The author then explores loop unrolling, bounds-checking elimination, quantization, and SIMD itself to enhance speed. The article provides helpful resource links for further reading at each step of the optimization process.

🖇️ Details

🔖 Note

This article offers a detailed, content-rich insight into the subject of performance optimization, with its primary focus being the enhancement of a relatively straightforward dot product function.

func DotNaive(a, b []float32) float32 {
    sum := float32(0)
    for i := 0; i < len(a) && i < len(b); i++ {
        sum += a[i] * b[i]
    }
    return sum
}

The definition of the dot product will not be expanded on as individuals who have a basic understanding of Go would comprehend the function depicted above. One may initially puzzle over what possible optimization can be attained with these 7 lines of code entailing multiplication and addition calculations.

Following the first section titled "Loop unrolling", the context begins to get intriguing. The author rewrote the loop, enabling numerous high-latency multiplication instructions to execute concurrently (for related research, refer to Latency vs. Throughput). Furthermore, the method spreads regular loop costs over multiple operations (increments and comparisons), and by employing benchmark testing, the author discerned the optimal iteration count as 4 (within the author's runtime environment). This alteration elevated throughput by 37%.

func DotUnroll4(a, b []float32) float32 {
    sum := float32(0)
    for i := 0; i < len(a); i += 4 {
        s0 := a[i] * b[i]
        s1 := a[i+1] * b[i+1]
        s2 := a[i+2] * b[i+2]
        s3 := a[i+3] * b[i+3]
        sum += s0 + s1 + s2 + s3
    }
    return sum
}

In the ensuing section titled "Bounds-checking elimination", it ventures towards more uncharted territory for me.

In order to keep out-of-bounds slice accesses from being a security vulnerability (like the famous Heartbleed exploit), the go compiler inserts checks before each read. You can check it out in the generated assembly (look for runtime.panic).

The compiled code makes it look like we wrote something like this:

func DotUnroll4(a, b []float32) float32 {
    sum := float32(0)
    for i := 0; i < len(a); i += 4 {
        if i >= cap(b) {
            panic("out of bounds")
        }
        s0 := a[i] * b[i]
        if i+1 >= cap(a) || i+1 >= cap(b) {
            panic("out of bounds")
        }
        s1 := a[i+1] * b[i+1]
        if i+2 >= cap(a) || i+2 >= cap(b) {
            panic("out of bounds")
        }
        s2 := a[i+2] * b[i+2]
        if i+3 >= cap(a) || i+3 >= cap(b) {
            panic("out of bounds")
        }
        s3 := a[i+3] * b[i+3]
        sum += s0 + s1 + s2 + s3
    }
    return sum
}

The author adopted a strategy that asserts equal lengths and shifts all boundary checks to the commencement of the loop, in an attempt to pare down boundary checks.

func DotBCE(a, b []float32) float32 {
    if len(a) != len(b) {
        panic("slices must have equal lengths")
    }

    if len(a)%4 != 0 {
        panic("slice length must be multiple of 4")
    }

    sum := float32(0)
    for i := 0; i < len(a); i += 4 {
        aTmp := a[i : i+4 : i+4]
        bTmp := b[i : i+4 : i+4]
        s0 := aTmp[0] * bTmp[0]
        s1 := aTmp[1] * bTmp[1]
        s2 := aTmp[2] * bTmp[2]
        s3 := aTmp[3] * bTmp[3]
        sum += s0 + s1 + s2 + s3
    }
    return sum
}

Great, the minimizing of bounds checking nets a 9% improvement.

Embarking on the third segment - "Quantization", I was left somewhat perplexed. Here, the focal point is on compressing vectors through type conversion implemented in advance (float32 => int8, for details see: integer quantization).

func DotInt8BCE(a, b []int8) int32 {
    if len(a) != len(b) {
        panic("slices must have equal lengths")
    }

    sum := int32(0)
    for i := 0; i < len(a); i += 4 {
        aTmp := a[i : i+4 : i+4]
        bTmp := b[i : i+4 : i+4]
        s0 := int32(aTmp[0]) * int32(bTmp[0])
        s1 := int32(aTmp[1]) * int32(bTmp[1])
        s2 := int32(aTmp[2]) * int32(bTmp[2])
        s3 := int32(aTmp[3]) * int32(bTmp[3])
        sum += s0 + s1 + s2 + s3
    }
    return sum
}

This change yields a 4x reduction in memory usage at the cost of some accuracy.

Within the fourth segment - "SIMD", the author re-crafted this section of the code utilizing AVX2 instructions, ensuring performance optimization whilst keeping portability intact. The outcome was a whopping 530% increase in throughput! Subsequently, the implementation of superior instructions (VPDPBUSD) triggered a further 21% improvement.

Ultimately, the outcomes of the optimization are truly astounding, presenting a 9.3-fold increase in throughput and a fourfold reduction in memory usage. There's a fitting Chinese idiom "精益求精", akin to the English 'the pursuit of excellence' that aptly illustrates this context. This deeply insightful paper has been a substantial source of learning, along with providing numerous inspirations and fresh perspectives. Thanks are indeed for such a profound sharing of knowledge.

🧭 Related