Closed pelletier closed 2 years ago
First bug found by the Go1.18 fuzzing system:
[tpelletier@thinkpad utf8]$ gotip test -run _ -fuzz ./
warning: starting with empty corpus
fuzz: elapsed: 0s, execs: 0 (0/sec), new interesting: 0 (total: 0)
fuzz: minimizing 1863-byte failing input file
fuzz: elapsed: 0s, minimizing
--- FAIL: FuzzValid (0.41s)
--- FAIL: FuzzValid (0.00s)
valid_fuzz_test.go:16: Valid("0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000\xc60") = true; want false
Failing input written to testdata/fuzz/FuzzValid/10d8eaee7858193ed8118cacee74232872e061aa7a7a768ba0792bf7bbb22b72
To re-run:
go test -run=FuzzValid/10d8eaee7858193ed8118cacee74232872e061aa7a7a768ba0792bf7bbb22b72
FAIL
exit status 1
FAIL github.com/segmentio/asm/utf8 0.416s
The last two remaining tasks are:
I haven't found a way to do either. If anyone has an idea I'm open to suggestions!
Otherwise, I think the PR is ready for reviews.
I don't feel bad if we don't reuse the stdlib symbols, taking dependencies on unexpired APIs always has a hire maintenance cost.
As an experiment, commit https://github.com/segmentio/asm/pull/58/commits/4a7bb03c6c885fc6973db0359900ff454c717ffa shows what it would look like to call the stdlib directly as opposed to re-implementing it. It's slightly slower on the current benchmarks, but the easier maintenance is probably worth it.
Difference validating inputs using AVX with leftover bytes, between the memory scratch and fully in vector registers:
benchstat out-old.txt out-new.txt
name old time/op new time/op delta
Valid/tail300/AVX-8 32.4ns ± 2% 28.0ns ± 2% -13.74% (p=0.008 n=5+5)
Valid/tail316/AVX-8 32.6ns ± 0% 28.1ns ± 0% -13.74% (p=0.008 n=5+5)
name old speed new speed delta
Valid/tail300/AVX-8 9.26GB/s ± 2% 10.73GB/s ± 2% +15.93% (p=0.008 n=5+5)
Valid/tail316/AVX-8 9.69GB/s ± 0% 11.24GB/s ± 0% +15.93% (p=0.008 n=5+5)
Nice to see the duration variations between multiples of 32 being dampened:
This branch is a Go implementation of the Keiser-Lemire "Validating UTF-8 In Less Than One Instruction Per Byte" paper. For inputs under 32 bytes or on machines without AVX2 support, a re-implementation of the stdlib algorithm is used.
For incomplete blocks of 32 bytes, this version still uses the vector registers.
This code exposes two functions
Valid([]byte) bool
andValidate([]byte) (bool, bool)
.Valid
is a drop-in replacement for the standard library'sunicode.Valid
.Validate
is a more precise function that also returns whether the input was valid ASCII. For small strings,ascii.Valid
is used as a first pass, then stdlib'sutf8.Valid
is used. This is possibly responsible for the overhead we are seeing for inputs < 32 bytes.Current results:
This is my first time writing Go assembly, so I'd appreciate any kind of feedback!
ns/op, for arrays up to 400 bytes (lower is better):
ns/op, for arrays up to 64MiB (lower is better):
Machine: specs Code used to generate graphs: plot.py
Todo
utf8.first
andacceptRanges
tables.Further work