Open kirillx opened 4 years ago
Thanks for reporting this.
IndexByte
has been optimized significantly more than LastIndexByte
, and it appears to be ~7x faster in comparable benchmarks (i.e. IndexByte(aaaa...ax)
vs IndexLastByte(xa...aaaa)
, when looking for x
).
I guess we could port the same implementation to LastIndex
.
I was just about to create a new Issue for this but notice this one exists already. I came across this as profiling data shows LastIndexByte is used alot nowdays (e.g. in proto code) to account for a good chunk of overall CPU time profiled.
I agree we should optimize strings/bytes.LastIndexByte similar to IndexByte: https://github.com/golang/go/blob/master/src/internal/bytealg/index_amd64.s
Maybe status = NeedsFix?
Assigned myself earlier because I already have prototype that passes ./all.bash but needs benchmarking on a quiet machine and double checking of page boundary handling before sending for review.
Change https://golang.org/cl/266538 mentions this issue: strings, bytes: use SIMD for LastIndexByte on amd64
@martisch are you planning on merging that CL at some point? Do you need any help with testing/review/benchmarking?
I can plan to merge it next cycle.
The last thing I was missing is a test that the page boundary at the beginning of the data is honoured. If someone could ammend the existing test (or helpers) to create a test string/byteslice where before (and after) the data the page is protected that would help. Last time I checked it only tested one direction but the the tests have changed recently and I did not check again.
Having beginning and end with protected pages tested to make sure operations using SIMD do not read to much data would also help existing code and can be an independent CL.
Change https://go.dev/cl/522475 mentions this issue: internal/bytealg: add generic LastIndexByte{,String}
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes.
What operating system and processor architecture are you using (
go env
)?Both Linux/MacOS
What did you do?
I was using multipart.NewReader() to process multi-part responses from Cloud REST API. It turned out that ~1/3 of profile is spent in mime/multipart/multipart.go :: scanUntilBoundary() -> bytes.LastIndexByte().
After looking into it, it is no wonder as bytes.LastIndexByte() is not using any optimisations and compiled into simple loop iterating over bytes, no REP SCASB instruction is used on Intel (nor SSE).
What did you expect to see?
bytes.LastIndexByte() to use SSE or at least REP SCASB optimised code.
What did you see instead?
simple byte to byte loop in asm code.