golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.47k stars 17.59k forks source link

encoding/base64: decoding is slow #19636

Open josselin-c opened 7 years ago

josselin-c commented 7 years ago

What version of Go are you using (go version)?

Go 1.8

What operating system and processor architecture are you using (go env)?

amd64

What did you do?

On my slow computer, using encoding/base64, I can decode data at ~100MB/s. It should be much faster as shown by https://github.com/aklomp/base64

I'm planning to work on this in my spare time. This issue tracks this effort.

gopherbot commented 7 years ago

CL https://golang.org/cl/34950 mentions this issue.

dgryski commented 7 years ago

Also https://github.com/powturbo/TurboBase64 for a fast encoder/decoder that doesn't need assembly.

gopherbot commented 7 years ago

CL https://golang.org/cl/38632 mentions this issue.

josselin-c commented 7 years ago

@dgryski I saw such LUT based and I'm wondering if it's acceptable. What's go philosophy about embedding "large" LUT? Won't such tables trash the L1 when used and penalize the rest of the application? I have an SSE implementation in the pipe that will be another improvement to CL 38632 but it's limited to amd64.

powturbo commented 6 years ago

@josselin-c: I saw such LUT based and I'm wondering if it's acceptable.

Strange rumours, but the short scalar "turbob64encs" encoding uses 64 bytes and the "turbob64decs" decoding function just uses 80 bytes (from 256 bytes LUT) delivering 2 GB/s in practical szenarios, This is less memory than other SIMD/AVX2 base64 functions. The fast "turbob64dec" function needs in fact less than 1k and is decoding faster than other scalar/SSE functions. See benchmark: Turbo-Base64

gopherbot commented 6 years ago

Change https://golang.org/cl/113776 mentions this issue: encoding/base64: slight decoding speed-up

gopherbot commented 5 years ago

Change https://golang.org/cl/151177 mentions this issue: encoding/base64: lift nil check out of decode loop

gopherbot commented 5 years ago

Change https://golang.org/cl/151197 mentions this issue: encoding/json: make decode32/decode64 inlineable

mvdan commented 5 years ago

With the two CLs above, the decoder goes from ~500MB/s to ~630MB/s on an 8KiB input. Note that this is on a 2014 ultrabook locked at 70% cpu frequency, to prevent overheating and throttling.

The only remaining bottleneck in the pure Go code that I can see is how it has more bounds checks than needed per decoded chunk of bytes. A couple for every 8 bytes should be enough, but it currently does at least eight. I think if we can fix that via #28942, it should give another nice 5-10% speed-up.

robfig commented 4 years ago

A recent paper described an algorithm for base64 encoding using 3 AVX512 instructions per 48/64 bytes: https://arxiv.org/pdf/1910.05109.pdf

C code implementing the algorithm is here: https://github.com/WojciechMula/base64simd

It looks like the assembler does not have AVX512 instructions defined.

mvdan commented 4 years ago

I'm not sure if using AVX512 is worth it for the standard library. We'd probably need to write assembly and have to start worrying about hardware compatibility.

The current decoder is already capable of reaching 1GiB/s on a modern CPU. Is that still considered slow for the standard library? If so, I think we should continue looking at pure Go improvements.

powturbo commented 4 years ago

The AVX512 version is requiring AVX512VBMI which is actually available only for cannonlake and ice lake (notebooks). In the paper they state "The speed of the new AVX-512 coded is more than twice that of the state-of-the-art AVX2 codec". However, the AVX2 Turbo-Base64 version is also 2x faster than their AVX2 fastbase64 (see benchmark). For short strings TurboBase64 is 4 times faster than the next fastest base64 SIMD. And recently, Clickhouse has replaced the most popular base64 SIMD on github with Turbo-Base64. Perhaps not for a standard library, but if you're looking at more speed you should consider base64 SIMD like TurboBase64 delivering 26GB/s on a modest clocked 3.4GHz skylake. 1GB/s seems in general fast enough, but can be too slow for databases for ex. as in the clickhouse case.

robfig commented 4 years ago

cc @mvdan

Makes sense, it does seem like a significant / difficult assembly job. I spent some time looking into it, but as a first-time assembly coder I didn't make any progress.

I don't have a use case that requires it to be faster. I assumed that there was interest in making it faster by the existence of an open issue and came upon that when reviewing recent developments.

mvdan commented 4 years ago

Easy wins are always welcome, but I don't think adding hundreds of lines of assembly are in that camp :) A base64 package optimized for performance (instead of safety) should probably be written outside the standard library.