golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.94k stars 17.66k forks source link

unicode/utf8: potential performance improvement of UTF8 validation #69915

Open romshark opened 3 weeks ago

romshark commented 3 weeks ago

Proposal Details

I've been experimenting with different kinds of optimization techniques in order to speed up utf8.Valid and utf8.ValidString and I'd like to share my findings that appear to be low-hanging fruit, at least on ARM64:

Screenshot 2024-10-16 at 23 42 49

The change is minimal but yields performance improvements of up to 41% on Apple M1 Max for all-UTF8 inputs while not showing any significant negative side-effects. Though on AMD Ryzen 7 5700x the geomean is significantly lower at just -1.43%, with some benchmarks even showing significant performance degradation ⚠️.

The test setup with benchmarks can be found here: https://github.com/romshark/utf8valid

Relation to Other Proposals

It might be related to https://github.com/golang/go/issues/47120 but I haven't had the opportunity to dig into the details of Russ' proposal. At first glance it looks like Russ tackled exactly that problem and my proposal may just be a duplicate.

Compared to https://github.com/golang/go/issues/63347 this proposal doesn't require changes of significant complexity, it's a 9-line code change.

gabyhelp commented 3 weeks ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

ianlancetaylor commented 3 weeks ago

I don't see any API change here, so taking this out of the proposal process.

randall77 commented 3 weeks ago

You've gotten performance improvements but I don't see any description of the changes you made to get those performance improvements (unless the changes are all in that screenshot?).

It would be nice to post a CL, together with some Benchmarks that demonstrate the improvements (if those benchmarks are not already present in the utf8 package).

adonovan commented 2 weeks ago

@randall77 See the linked repo, which contains the patched code and the benchmark results. The diff is hard to see (it's not even in the history of the repo), but it really is just this:

@@ -473,16 +473,22 @@ func Valid(p []byte) bool {
                        return false // Short or invalid.
                }
                accept := acceptRanges[x>>4]
+               // The change is here. Instead of reusing `size` constants 2, 3 and 4
+               // are used which leads to performance improvement of up to 41%.
                if c := p[i+1]; c < accept.lo || accept.hi < c {
                        return false
                } else if size == 2 {
+                       i += 2
                } else if c := p[i+2]; c < locb || hicb < c {
                        return false
                } else if size == 3 {
+                       i += 3
                } else if c := p[i+3]; c < locb || hicb < c {
                        return false
+               } else {
+                       i += 4
                }
-               i += size
+
        }
        return true
 }
@@ -519,16 +525,21 @@ func ValidString(s string) bool {
                        return false // Short or invalid.
                }
                accept := acceptRanges[x>>4]
+               // The change is here. Instead of reusing `size` constants 2, 3 and 4
+               // are used which leads to performance improvement of up to 41%.
                if c := s[i+1]; c < accept.lo || accept.hi < c {
                        return false
                } else if size == 2 {
+                       i += 2
                } else if c := s[i+2]; c < locb || hicb < c {
                        return false
                } else if size == 3 {
+                       i += 3
                } else if c := s[i+3]; c < locb || hicb < c {
                        return false
+               } else {
+                       i += 4
                }
-               i += size
        }
        return true
 }
ldemailly commented 2 weeks ago

I guess the speedup comes from https://cs.opensource.google/go/go/+/refs/tags/go1.23.2:src/unicode/utf8/utf8.go;l=508

        size := int(x & 7)

and the compiler can't know that 5,6,7 are never supposed to happen (vs hardcoding += 4) (thanks to how first[] is constructed)

nor 1 because of

        if si < RuneSelf {
ldemailly commented 2 weeks ago

I think you should make a CL but also add comments to explain why it works / why +4 is the only possible case in the else

also I wonder if you could benchmark that vs a switch (which would be more readable and equivalent I think)

AlexanderYastrebov commented 2 weeks ago

It might be related to #47120 but I haven't had the opportunity to dig into the details of Russ' proposal. At first glance it looks like Russ tackled exactly that problem and my proposal may just be a duplicate.

DFA implementation from #47120 looks much simpler than existing one. It would be interesting to see how this change benchmark results compare to DFA results.

randall77 commented 2 weeks ago

I'm not sure I understand how that change makes things any faster, but if there are benchmarks that demonstrate it I'm ok with it.

An aside, feel free to ignore...

size is already needed in a register for the conditional, so it isn't clear to me why += const is any faster than += register. Both are identical speedwise on all chips I've seen. Maybe because we have to merge from each case back to a single += instruction before jumping to the top of the loop, instead of jumping to the top of the loop directly? It might be worth trying += size in each if block instead of after, just to see what happens. Not that the final code should do that, but just to tease out why there is a speed improvement.