Open achille-roussel opened 1 year ago
This is not an API change, so taking it out of the proposal process.
The https://go.dev/wiki/AssemblyPolicy page is relevant here.
Probably related to #47120
Thanks for refining the category @ianlancetaylor, and for pointing out the Assembly Policy (somehow I had not found it); the document is extremely useful to help answer out some of the questions I had:
- Minimize use of assembly. We'd rather have a small amount of assembly for a 50% speedup rather than twice as much assembly for a 55% speedup. Explain the decision to place the assembly/Go boundary where it is in the commit message, and support it with benchmarks.
Based on that example, it seems that the change would be worth it; ~900% increase in throughput is measurable and likely beneficial to any application that uses utf8.Valid
on a non-trivial amount of data. We are adding ~300 lines of assembly, there was no assembly at all in unicode/utf8
, but the amount of code remains manageable: there is a research paper documenting the algorithm, and most of it is global masks used in SIMD instructions.
- Use higher level programs to generate non-trivial amounts of assembly, either standalone Go programs or go get-able programs, like avo. Output of other reproducible processes (like formally verified code generators) will also be considered. Discuss the implementation strategy on the issue tracker in advance.
The implementation was generated with Avo, I will look for examples of how/if Avo has been used in the stdlib (for example, I'm not sure if we should add the Avo generator to the stdlib code somewhere). If anyone has material to share on this topic it will be very useful.
- Use small, testable units (25–75 lines) called from higher-level logic written in Go. If using small, testable functions called from logic written in Go is too slow, use small, testable assembly units with Go-compatible wrappers, so that Go tests can still test the individual units.
- Any assembly function needs a reference Go implementation, that’s tested side-by-side with the assembly. Follow golang.org/wiki/TargetSpecific for structure and testing practices.
- The interface of the assembly units and of the reference Go implementation must be the same across architectures, unless the platforms have fundamentally different capabilities (such as high-level cryptographic instructions).
This pretty much covers all the questions I had about how to shape the implementation and what kind of testing will be expected 👍
- Unless the Go Security team explicitly commits to owning the specific implementation, an external contributor must commit to maintaining it. If changes are required (for example as part of a broader refactor) and the maintainer is not available, the assembly will be removed.
- The code must be tested in our CI. This means there need to be builders that support the instructions, and if there are multiple (or fallback) paths they must be tested separately. (Tip: use GODEBUG=cpu.X=off to disable detection of CPU features.)
- Document in the Go code why the implementation requires assembly (specific performance benefit, access to instructions, etc), so we can reevaluate as the compiler improves.
cc @golang/security
@achille-roussel avo has already been used in crypto/internal for example.
Thank @arl, this is exactly the example I was looking for 👍
Change https://go.dev/cl/535838 mentions this issue: unicode/utf8.Valid: use AVX2 on amd64
It would be interesting to instrument utf8.IsValid to build a histogram of the lengths of strings passed to it, run a handful of applications that use IsValid, and multiply each bucket by its expected benefit according to the graph above. If all strings are short, the benefit will be negative, but it wouldn't take many longer strings to make the balance significantly positive.
My guess is that this call in protobuf alone might justify the new implementation.
That's a great suggestion!
Note that the graph I posted above is a bit outdated, in the CL https://go-review.googlesource.com/c/go/+/535838 the implementation was improved and doesn't suffer regressions on small strings anymore.
@pelletier do you have an updated graph that would be more representative of the latest code version?
Here's an updated graph generated on my machine:
If anyone is interested in reproducing the results, I've uploaded a small python script I've used to generate this plot: https://gist.github.com/pelletier/46320448776c9ba9163270130ef556aa
For context, the graph initially posted on this issue showed the AVX2 version being slower for pure Go for inputs shorter than 32 bytes. This was caused because the initial implementation called the pure Go version of IsValid
for those inputs. That behavior was a trade-off we made to simplify the maintenance of this code: our use case seldom had small inputs, and it required less assembly. The version in CL-535838 doesn't have the same problem, as we decided a little bit more assembly was worth not taking the small inputs hit if that code were to be included in the standard library.
Technically the benchmark above reports the AVX2 version being slower for empty inputs, and I haven't benchmarked it on inputs of sizes 1 through 7. Here are the first few lines of the benchstat output:
goos: linux
goarch: amd64
pkg: unicode/utf8
cpu: AMD Ryzen 9 5950X 16-Core Processor
│ base.txt │ avx2.txt │
│ sec/op │ sec/op vs base │
Valid/small0-16 2.087n ± 0% 2.523n ± 0% +20.94% (p=0.002 n=6)
Valid/small8-16 4.828n ± 0% 4.215n ± 0% -12.70% (p=0.002 n=6)
Valid/small16-16 9.149n ± 0% 4.210n ± 0% -53.98% (p=0.002 n=6)
UTF-8 validation is employed in many applications using text formats such as JSON.
An implementation of Ridiculously fast unicode (UTF-8) validation has been available in https://github.com/segmentio/asm/tree/main/utf8 since early 2022; authored by @pelletier and myself, it was published under MIT license.
The algorithm described in the research paper yields up to 10x throughput on UTF-8 validation of medium to large strings, as shown in this preliminary comparison:
The algorithm also scales much better than the current implementation of
utf8.Valid
as the size of input grows, as shown by this graph plotting the compute time for the two versions for input sizes of 0 to 400 bytes:The main trade-off here is complexity; while the code has been fuzz-tested against the current version of
utf8.Valid
, as well as successfully used on production workloads, a bug will always be possible in this implementation or on other platforms that would be added later on.I looked for prior discussions on improving the performance of
unicode/utf8
, but could not find much content on the topic. This proposal will likely set a precedent for porting the algorithm to other architectures, if accepted.Part of the requirements to integrate it with the standard library could be to expand test coverage in the
unicode/utf8
package to ensure the correctness of this implementation and hypothetical future additions for other architectures.