unicode/utf8: optimize utf8.Valid with AVX2 instructions on AMD64

achille-roussel commented 1 year ago

UTF-8 validation is employed in many applications using text formats such as JSON.

An implementation of Ridiculously fast unicode (UTF-8) validation has been available in https://github.com/segmentio/asm/tree/main/utf8 since early 2022; authored by @pelletier and myself, it was published under MIT license.

The algorithm described in the research paper yields up to 10x throughput on UTF-8 validation of medium to large strings, as shown in this preliminary comparison:

goos: linux
goarch: amd64
pkg: unicode/utf8
cpu: Intel(R) Xeon(R) CPU @ 2.60GHz
                            │     base      │                avx2                 │
                            │    sec/op     │   sec/op     vs base                │
ValidTenASCIIChars              5.038n ± 0%   4.156n ± 0%  -17.51% (p=0.000 n=10)
Valid100KASCIIChars            10.736µ ± 1%   3.684µ ± 0%  -65.68% (p=0.000 n=10)
ValidTenJapaneseChars           30.05n ± 0%   30.00n ± 0%   -0.17% (p=0.000 n=10)
ValidLongMostlyASCII           88.697µ ± 0%   3.174µ ± 0%  -96.42% (p=0.000 n=10)
ValidLongJapanese             126.734µ ± 0%   7.712µ ± 0%  -93.91% (p=0.000 n=10)
ValidStringTenASCIIChars        5.044n ± 0%   4.748n ± 0%   -5.88% (p=0.000 n=10)
ValidString100KASCIIChars      10.667µ ± 0%   3.683µ ± 0%  -65.47% (p=0.000 n=10)
ValidStringTenJapaneseChars     30.25n ± 1%   29.59n ± 0%   -2.21% (p=0.000 n=10)
ValidStringLongMostlyASCII     89.232µ ± 1%   3.173µ ± 0%  -96.44% (p=0.000 n=10)
ValidStringLongJapanese       126.678µ ± 0%   7.717µ ± 1%  -93.91% (p=0.000 n=10)
Valid/1kValid                  907.70n ± 0%   83.38n ± 0%  -90.81% (p=0.000 n=10)
Valid/1MValid                  933.39µ ± 1%   80.94µ ± 1%  -91.33% (p=0.000 n=10)
Valid/10ASCII                   4.750n ± 0%   4.152n ± 0%  -12.59% (p=0.000 n=10)
Valid/10Japan                   32.44n ± 0%   32.00n ± 0%   -1.37% (p=0.000 n=10)
Valid/small4096                3710.0n ± 0%   323.3n ± 0%  -91.28% (p=0.000 n=10)
Valid/small32768               29.623µ ± 1%   2.528µ ± 0%  -91.47% (p=0.000 n=10)
Valid/small65536               59.227µ ± 0%   5.052µ ± 0%  -91.47% (p=0.000 n=10)
Valid/small262144              236.84µ ± 0%   20.19µ ± 0%  -91.47% (p=0.000 n=10)
Valid/small4194304             3794.1µ ± 1%   324.6µ ± 0%  -91.44% (p=0.000 n=10)
Valid/small33554432            30.454m ± 0%   2.829m ± 1%  -90.71% (p=0.000 n=10)
Valid/small134217728           121.93m ± 0%   13.81m ± 1%  -88.68% (p=0.000 n=10)
Valid/small268435456           243.73m ± 0%   27.76m ± 1%  -88.61% (p=0.000 n=10)
Valid/tail300                  269.60n ± 0%   30.41n ± 0%  -88.72% (p=0.000 n=10)
Valid/tail316                  283.85n ± 0%   29.59n ± 0%  -89.58% (p=0.000 n=10)
geomean                         11.27µ        1.871µ       -83.41%

                     │     base     │                  avx2                   │
                     │     B/s      │      B/s       vs base                  │
Valid/1kValid          1.051Gi ± 0%   11.437Gi ± 0%   +988.55% (p=0.000 n=10)
Valid/1MValid          1.046Gi ± 1%   12.065Gi ± 1%  +1053.11% (p=0.000 n=10)
Valid/10ASCII          1.961Gi ± 0%    2.243Gi ± 0%    +14.40% (p=0.000 n=10)
Valid/10Japan          881.9Mi ± 0%    894.2Mi ± 0%     +1.40% (p=0.000 n=10)
Valid/small4096        1.028Gi ± 0%   11.798Gi ± 0%  +1047.49% (p=0.000 n=10)
Valid/small32768       1.030Gi ± 1%   12.071Gi ± 0%  +1071.69% (p=0.000 n=10)
Valid/small65536       1.031Gi ± 0%   12.082Gi ± 0%  +1072.40% (p=0.000 n=10)
Valid/small262144      1.031Gi ± 0%   12.090Gi ± 0%  +1072.86% (p=0.000 n=10)
Valid/small4194304     1.030Gi ± 1%   12.034Gi ± 0%  +1068.81% (p=0.000 n=10)
Valid/small33554432    1.026Gi ± 0%   11.044Gi ± 1%   +976.30% (p=0.000 n=10)
Valid/small134217728   1.025Gi ± 0%    9.054Gi ± 1%   +783.14% (p=0.000 n=10)
Valid/small268435456   1.026Gi ± 0%    9.006Gi ± 1%   +777.99% (p=0.000 n=10)
Valid/tail300          1.037Gi ± 0%    9.189Gi ± 0%   +786.53% (p=0.000 n=10)
Valid/tail316          1.037Gi ± 0%    9.948Gi ± 0%   +859.39% (p=0.000 n=10)
geomean                1.067Gi         8.136Gi        +662.20%

The algorithm also scales much better than the current implementation of utf8.Valid as the size of input grows, as shown by this graph plotting the compute time for the two versions for input sizes of 0 to 400 bytes:

The main trade-off here is complexity; while the code has been fuzz-tested against the current version of utf8.Valid, as well as successfully used on production workloads, a bug will always be possible in this implementation or on other platforms that would be added later on.

I looked for prior discussions on improving the performance of unicode/utf8, but could not find much content on the topic. This proposal will likely set a precedent for porting the algorithm to other architectures, if accepted.

Part of the requirements to integrate it with the standard library could be to expand test coverage in the unicode/utf8 package to ensure the correctness of this implementation and hypothetical future additions for other architectures.

ianlancetaylor commented 1 year ago

This is not an API change, so taking it out of the proposal process.

ianlancetaylor commented 1 year ago

The https://go.dev/wiki/AssemblyPolicy page is relevant here.

artyom commented 1 year ago

Probably related to #47120

achille-roussel commented 1 year ago

Thanks for refining the category @ianlancetaylor, and for pointing out the Assembly Policy (somehow I had not found it); the document is extremely useful to help answer out some of the questions I had:

Minimize use of assembly. We'd rather have a small amount of assembly for a 50% speedup rather than twice as much assembly for a 55% speedup. Explain the decision to place the assembly/Go boundary where it is in the commit message, and support it with benchmarks.

Based on that example, it seems that the change would be worth it; ~900% increase in throughput is measurable and likely beneficial to any application that uses utf8.Valid on a non-trivial amount of data. We are adding ~300 lines of assembly, there was no assembly at all in unicode/utf8, but the amount of code remains manageable: there is a research paper documenting the algorithm, and most of it is global masks used in SIMD instructions.

Use higher level programs to generate non-trivial amounts of assembly, either standalone Go programs or go get-able programs, like avo. Output of other reproducible processes (like formally verified code generators) will also be considered. Discuss the implementation strategy on the issue tracker in advance.

The implementation was generated with Avo, I will look for examples of how/if Avo has been used in the stdlib (for example, I'm not sure if we should add the Avo generator to the stdlib code somewhere). If anyone has material to share on this topic it will be very useful.

Use small, testable units (25–75 lines) called from higher-level logic written in Go. If using small, testable functions called from logic written in Go is too slow, use small, testable assembly units with Go-compatible wrappers, so that Go tests can still test the individual units.

Any assembly function needs a reference Go implementation, that’s tested side-by-side with the assembly. Follow golang.org/wiki/TargetSpecific for structure and testing practices.

The interface of the assembly units and of the reference Go implementation must be the same across architectures, unless the platforms have fundamentally different capabilities (such as high-level cryptographic instructions).

This pretty much covers all the questions I had about how to shape the implementation and what kind of testing will be expected 👍

Unless the Go Security team explicitly commits to owning the specific implementation, an external contributor must commit to maintaining it. If changes are required (for example as part of a broader refactor) and the maintainer is not available, the assembly will be removed.

The code must be tested in our CI. This means there need to be builders that support the instructions, and if there are multiple (or fallback) paths they must be tested separately. (Tip: use GODEBUG=cpu.X=off to disable detection of CPU features.)

Document in the Go code why the implementation requires assembly (specific performance benefit, access to instructions, etc), so we can reevaluate as the compiler improves.

seankhliao commented 1 year ago

cc @golang/security

arl commented 1 year ago

@achille-roussel avo has already been used in crypto/internal for example.

achille-roussel commented 1 year ago

Thank @arl, this is exactly the example I was looking for 👍

gopherbot commented 1 year ago

Change https://go.dev/cl/535838 mentions this issue: unicode/utf8.Valid: use AVX2 on amd64

adonovan commented 8 months ago

It would be interesting to instrument utf8.IsValid to build a histogram of the lengths of strings passed to it, run a handful of applications that use IsValid, and multiply each bucket by its expected benefit according to the graph above. If all strings are short, the benefit will be negative, but it wouldn't take many longer strings to make the balance significantly positive.

My guess is that this call in protobuf alone might justify the new implementation.

achille-roussel commented 8 months ago

That's a great suggestion!

Note that the graph I posted above is a bit outdated, in the CL https://go-review.googlesource.com/c/go/+/535838 the implementation was improved and doesn't suffer regressions on small strings anymore.

@pelletier do you have an updated graph that would be more representative of the latest code version?

pelletier commented 8 months ago

Here's an updated graph generated on my machine:

If anyone is interested in reproducing the results, I've uploaded a small python script I've used to generate this plot: https://gist.github.com/pelletier/46320448776c9ba9163270130ef556aa

For context, the graph initially posted on this issue showed the AVX2 version being slower for pure Go for inputs shorter than 32 bytes. This was caused because the initial implementation called the pure Go version of IsValid for those inputs. That behavior was a trade-off we made to simplify the maintenance of this code: our use case seldom had small inputs, and it required less assembly. The version in CL-535838 doesn't have the same problem, as we decided a little bit more assembly was worth not taking the small inputs hit if that code were to be included in the standard library.

Technically the benchmark above reports the AVX2 version being slower for empty inputs, and I haven't benchmarked it on inputs of sizes 1 through 7. Here are the first few lines of the benchstat output:

goos: linux
goarch: amd64
pkg: unicode/utf8
cpu: AMD Ryzen 9 5950X 16-Core Processor
                  │   base.txt   │              avx2.txt              │
                  │    sec/op    │   sec/op     vs base               │
Valid/small0-16      2.087n ± 0%   2.523n ± 0%  +20.94% (p=0.002 n=6)
Valid/small8-16      4.828n ± 0%   4.215n ± 0%  -12.70% (p=0.002 n=6)
Valid/small16-16     9.149n ± 0%   4.210n ± 0%  -53.98% (p=0.002 n=6)

golang / go

unicode/utf8: optimize utf8.Valid with AVX2 instructions on AMD64 #63347