utf8: AVX2 implementation of Valid

pelletier commented 2 years ago

This branch is a Go implementation of the Keiser-Lemire "Validating UTF-8 In Less Than One Instruction Per Byte" paper. For inputs under 32 bytes or on machines without AVX2 support, a re-implementation of the stdlib algorithm is used.

For incomplete blocks of 32 bytes, this version still uses the vector registers.

This code exposes two functions Valid([]byte) bool and Validate([]byte) (bool, bool). Valid is a drop-in replacement for the standard library's unicode.Valid. Validate is a more precise function that also returns whether the input was valid ASCII. For small strings, ascii.Valid is used as a first pass, then stdlib's utf8.Valid is used. This is possibly responsible for the overhead we are seeing for inputs < 32 bytes.

Current results:

goos: darwin
goarch: amd64
pkg: github.com/segmentio/asm/utf8
cpu: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz

name                    time/op
Valid/1kValid/AVX-8       80.0ns ± 2%
Valid/1kValid/Stdlib-8     733ns ± 2%
Valid/1MValid/AVX-8       76.8µs ± 2%
Valid/1MValid/Stdlib-8     751µs ± 1%
Valid/10ASCII/Stdlib-8    4.07ns ± 0%
Valid/10ASCII/AVX-8       7.70ns ± 2%
Valid/10Japan/AVX-8       28.6ns ± 1%
Valid/10Japan/Stdlib-8    27.0ns ± 1%

name                    speed
Valid/1kValid/AVX-8     12.8GB/s ± 2%
Valid/1kValid/Stdlib-8  1.40GB/s ± 2%
Valid/1MValid/AVX-8     13.7GB/s ± 2%
Valid/1MValid/Stdlib-8  1.40GB/s ± 1%
Valid/10ASCII/Stdlib-8  2.46GB/s ± 0%
Valid/10ASCII/AVX-8     1.30GB/s ± 2%
Valid/10Japan/AVX-8     1.05GB/s ± 1%
Valid/10Japan/Stdlib-8  1.11GB/s ± 1%

This is my first time writing Go assembly, so I'd appreciate any kind of feedback!

ns/op, for arrays up to 400 bytes (lower is better):

ns/op, for arrays up to 64MiB (lower is better):

data-large

Machine: specs Code used to generate graphs: plot.py

Todo

[x] Generate code with AVO.
[x] Check AVX2 support.
[x] Use lower overhead algorithm for < 32B.
[x] ~Understand why the low overhead algorithm is slower than stdlib.~ Not understood, but after iterating on the code, the low overhead algorithm is as fast as the standard library one of an Intel CPU (not AMD, somehow).
[x] Make the test suite faster.
[x] Also returns whether the input was ascii only.
[x] Fix table generation (see https://github.com/segmentio/asm/pull/58/commits/3716cfde7c718e41aa11ba97b2ba7d2cb079a738)
[x] Reuse stdlib's utf8.first and acceptRanges tables.
[x] ~Cover profile for generated asm code~. I don't think that's possible.

Further work

pelletier commented 2 years ago

First bug found by the Go1.18 fuzzing system:

[tpelletier@thinkpad utf8]$ gotip test -run _ -fuzz ./
warning: starting with empty corpus
fuzz: elapsed: 0s, execs: 0 (0/sec), new interesting: 0 (total: 0)
fuzz: minimizing 1863-byte failing input file
fuzz: elapsed: 0s, minimizing
--- FAIL: FuzzValid (0.41s)
    --- FAIL: FuzzValid (0.00s)
        valid_fuzz_test.go:16: Valid("0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\xc60") = true; want false

    Failing input written to testdata/fuzz/FuzzValid/10d8eaee7858193ed8118cacee74232872e061aa7a7a768ba0792bf7bbb22b72
    To re-run:
    go test -run=FuzzValid/10d8eaee7858193ed8118cacee74232872e061aa7a7a768ba0792bf7bbb22b72
FAIL
exit status 1
FAIL    github.com/segmentio/asm/utf8   0.416s

pelletier commented 2 years ago

The last two remaining tasks are:

Reuse stdlib's utf8.first and acceptRanges tables.
Cover profile for generated asm code.

I haven't found a way to do either. If anyone has an idea I'm open to suggestions!

Otherwise, I think the PR is ready for reviews.

achille-roussel commented 2 years ago

I don't feel bad if we don't reuse the stdlib symbols, taking dependencies on unexpired APIs always has a hire maintenance cost.

pelletier commented 2 years ago

As an experiment, commit https://github.com/segmentio/asm/pull/58/commits/4a7bb03c6c885fc6973db0359900ff454c717ffa shows what it would look like to call the stdlib directly as opposed to re-implementing it. It's slightly slower on the current benchmarks, but the easier maintenance is probably worth it.

pelletier commented 2 years ago

Difference validating inputs using AVX with leftover bytes, between the memory scratch and fully in vector registers:

benchstat out-old.txt out-new.txt
name                 old time/op    new time/op     delta
Valid/tail300/AVX-8    32.4ns ± 2%     28.0ns ± 2%  -13.74%  (p=0.008 n=5+5)
Valid/tail316/AVX-8    32.6ns ± 0%     28.1ns ± 0%  -13.74%  (p=0.008 n=5+5)

name                 old speed      new speed       delta
Valid/tail300/AVX-8  9.26GB/s ± 2%  10.73GB/s ± 2%  +15.93%  (p=0.008 n=5+5)
Valid/tail316/AVX-8  9.69GB/s ± 0%  11.24GB/s ± 0%  +15.93%  (p=0.008 n=5+5)

pelletier commented 2 years ago

Nice to see the duration variations between multiples of 32 being dampened:

data

segmentio / asm

utf8: AVX2 implementation of Valid #58

Todo

Further work