Open dsnet opened 7 years ago
Based on the profiling data that you mention, it looks like the unescape
time is split evenly between runtime.makeslice
and runtime.slicebytetostring
.
The net/url.shouldEscape
function is responsible only for ~7% of the CPU time. LUTs might not help as much as we would like.
Regarding the allocations and slicebytetostring
, see compiler feature request #6714.
/cc @randall77 @cherrymui @josharian @dr2chase
See also: #18990
I'm not sure it's the kind of optimization you want, but I find that if I do some "hand optimization", just eliminating branches, I get pretty significant speed increases:
const ESCAPE_ME = "net/url: optimize unescape and escape"
var s string
func BenchmarkPathEscape(b *testing.B) {
for i := 0; i < b.N; i++ {
s = url.PathEscape(ESCAPE_ME)
}
}
func shouldPathEscape(c byte) bool {
if 'A' <= c && c <= 'Z' || 'a' <= c && c <= 'z' || '0' <= c && c <= '9' {
return false
}
switch c {
case '-', '_', '.', '~', ':', '@', '&', '=', '+', '$':
return false
}
return true
}
func MyPathEscape(s string) string {
hexCount := 0
for i := 0; i < len(s); i++ {
if shouldPathEscape(s[i]) {
hexCount++
}
}
if hexCount == 0 {
return s
}
t := make([]byte, len(s)+2*hexCount)
j := 0
for i := 0; i < len(s); i++ {
c := s[i]
if shouldPathEscape(c) {
t[j] = '%'
t[j+1] = "0123456789ABCDEF"[c>>4]
t[j+2] = "0123456789ABCDEF"[c&15]
j += 3
} else {
t[j] = c
j++
}
}
return string(t)
}
func BenchmarkMyPathEscape(b *testing.B) {
for i := 0; i < b.N; i++ {
s = MyPathEscape(ESCAPE_ME)
}
}
goos: darwin
goarch: amd64
BenchmarkPathEscape-8 3000000 438 ns/op
BenchmarkMyPathEscape-8 10000000 147 ns/op
It seems to me like calls to functions like escape
should probably be inlined when a parameter on which many branches depend is given a const
argument:
func PathEscape(s string) string {
return escape(s, encodePathSegment) // not inlined, but probably should be
}
I also experimented with some optimizations (mostly unescpae) but have not cleaned them up in a CL yet. So i am interested what we can improve too. Maybe i still get around to finish them for 1.10.
I was thinking of creating a table [256]uint8 where byte contains some flags. e.g. shouldPathEscape (if we need more flags this can become uint32) so we can use the same lookup table in multiple functions to quickly look up properties of ASCII characters while at the same time not wasting ram for a lookup table each time. This could be even stored in a separate package so multiple std lib packages (e.g unicode.IsSpace or strings.Fields) can use the same table.
For unhex i prefer using 9*(c>>6)` + (c&15)
vs a LUT. I will have to benchmark this again but will send as a small separate improvement for the std lib occurrences if it still checks out to be faster than the current branch based version.
@ccbrown Some comments while briefly looking at the code from the above comment:
if shouldPathEscape(s[i]) {
hexCount++
}
if this is not already recognized by the compiler to become a setCC instruction with an add on AMD6 4 instead of a branch we should have a look at optimizing that. (maybe there is even an add that just takes the flag without setCC)
Does your new version pass all the tests?
Tried a few more things.
Using a boolean LUT does make things faster:
func MyBoolLUTEscape(s string, shouldEscapeLUT [256]bool) string {
hexCount := 0
for i := 0; i < len(s); i++ {
if shouldEscapeLUT[s[i]] {
hexCount++
}
}
if hexCount == 0 {
return s
}
t := make([]byte, len(s)+2*hexCount)
j := 0
for i := 0; i < len(s); i++ {
c := s[i]
if shouldEscapeLUT[c] {
t[j] = '%'
t[j+1] = "0123456789ABCDEF"[c>>4]
t[j+2] = "0123456789ABCDEF"[c&15]
j += 3
} else {
t[j] = c
j++
}
}
return string(t)
}
BenchmarkPathEscape-8 3000000 438 ns/op
BenchmarkMyPathEscape-8 10000000 144 ns/op
BenchmarkMyBoolLUTPathEscape-8 10000000 128 ns/op
Compacting all of the different encodings into a single [256]byte
LUT is slower, but still reasonably fast (and would potentially save 6*256 bytes in memory):
func MyLUTEscape(s string, mode encoding) string {
hexCount := 0
mask := byte(1 << byte(mode))
for i := 0; i < len(s); i++ {
if shouldEscapeLUT[s[i]]&mask != 0 {
hexCount++
}
}
if hexCount == 0 {
return s
}
t := make([]byte, len(s)+2*hexCount)
j := 0
for i := 0; i < len(s); i++ {
c := s[i]
if shouldEscapeLUT[s[i]]&mask != 0 {
t[j] = '%'
t[j+1] = "0123456789ABCDEF"[c>>4]
t[j+2] = "0123456789ABCDEF"[c&15]
j += 3
} else {
t[j] = c
j++
}
}
return string(t)
}
BenchmarkMyLUTEscape-8 10000000 151 ns/op
Note while these functions can be re-used by all of the encodings, I'm still excluding the c == ' ' && mode == encodeQueryComponent
case for now. If needed, query component encoding can get its own function implementation.
@martisch
t[j+2] could maybe avoid some bounds check here by checking _ = t[j+2] earlier in the block.
we could remember the first position where a to be escaped symbol is and skip ahead to that in the second loop after making a copy of the start up to that symbol. But that would make performance worse when to position is very near the start of the string.
Tried both. Neither really made a measurable difference for me.
if this is not already recognized by the compiler to become a setCC instruction with an add on AMD64 instead of a branch we should have a look at optimizing that.
That function actually got inlined for me. So it didn't use any conditional instructions like that there.
But in the LUT version, maybe that sort of thing could work here:
perf_test.go:207 0x10f004d 4084f8 TESTL DI, AL
perf_test.go:207 0x10f0050 74e3 JE 0x10f0035
perf_test.go:208 0x10f0052 48ffc6 INCQ SI
perf_test.go:207 0x10f0055 ebde JMP 0x10f0035
I'm not really gonna try to optimize this at such a low level yet though.
Change https://golang.org/cl/134296 mentions this issue: net/url: remove an allocation for short strings in escape
While working on 134296 I saw that the benchmark for the String() method still shows about 60 allocations being made. This is because we don't pre-size the strings.Builder and also call escape at least 5 times for different url components.
I think this could be optimized by pre-sizing the builder with the known lengths (scheme, opaque, rawquery) and passing the builder to a new escape function that writes to a builder. The existing escape function could be modified to create a builder and call the new function or the existing callers of escape could do it (QueryEscape, PathEscape etc.)
I can work on this if there is interest in this approach.
@iand, sounds reasonable.
After spending several more minutes examining the String benchmark I realised that the reported timings and allocations are the cumulative totals of all 20 test cases.
I investigated modifying the various escape functions write to a shared strings.Builder but couldn't reduce the allocations and timings by enough to justify the extra complexity.
Also looked at pre-sizing the builder. It's easy enough to calculate the capacity needed for the fixed components and estimate the length of most of the encoded parts, but path is little more tricky. Performing these calculations adds timing overhead which makes the function slower overall, even with the one or two allocations it saves.
I found I could get just as good a benefit by simply growing the builder to a fixed 64 bytes. Some timings for variants on this approach are below. However, it's likely that I'm just fitting to the test cases.
I'm happy to send a CL for a one line change setting the initial builder size to 64 bytes, but I'm not convinced it adds much value.
Current master:
BenchmarkString-8 500000 3393 ns/op 1664 B/op 61 allocs/op
Pre-sized builder.
32 bytes: BenchmarkString-8 500000 2966 ns/op 1632 B/op 43 allocs/op
64 bytes: BenchmarkString-8 500000 2837 ns/op 1648 B/op 33 allocs/op
128 bytes: BenchmarkString-8 500000 2891 ns/op 2800 B/op 32 allocs/op
Change https://golang.org/cl/174998 mentions this issue: net/url: rework shouldEscape func to bitmask flag for performance boost
What is the state of this issue? Looks like it's stalled after 1.13 freeze, but no other problems are mentioned. Kindly ping @iand
I write a new version of FastQueryEscape()
, it's 7.33 times faster than url.QueryEscape().
Of course, I based your codes.
The detail is here: https://www.cnblogs.com/ahfuzhang/p/17473178.html
Change https://go.dev/cl/514335 mentions this issue: net/url: optimize QueryEscape and PathEscape
According to a profiling data from a large number of servers at Google, URL related functions are among the top 150 functions by CPU time. The simplest optimizations I see are around using LUTs in
shouldEscape
,isHex
, andunHex
.