golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.93k stars 17.53k forks source link

net/url: optimize unescape and escape #17860

Open dsnet opened 7 years ago

dsnet commented 7 years ago

According to a profiling data from a large number of servers at Google, URL related functions are among the top 150 functions by CPU time. The simplest optimizations I see are around using LUTs in shouldEscape, isHex, and unHex.

LMMilewski commented 7 years ago

Based on the profiling data that you mention, it looks like the unescape time is split evenly between runtime.makeslice and runtime.slicebytetostring. The net/url.shouldEscape function is responsible only for ~7% of the CPU time. LUTs might not help as much as we would like.

bradfitz commented 7 years ago

Regarding the allocations and slicebytetostring, see compiler feature request #6714.

/cc @randall77 @cherrymui @josharian @dr2chase

bradfitz commented 7 years ago

See also: #18990

ccbrown commented 7 years ago

I'm not sure it's the kind of optimization you want, but I find that if I do some "hand optimization", just eliminating branches, I get pretty significant speed increases:

const ESCAPE_ME = "net/url: optimize unescape and escape"

var s string

func BenchmarkPathEscape(b *testing.B) {
    for i := 0; i < b.N; i++ {
        s = url.PathEscape(ESCAPE_ME)
    }
}

func shouldPathEscape(c byte) bool {
    if 'A' <= c && c <= 'Z' || 'a' <= c && c <= 'z' || '0' <= c && c <= '9' {
        return false
    }
    switch c {
    case '-', '_', '.', '~', ':', '@', '&', '=', '+', '$':
        return false
    }
    return true
}

func MyPathEscape(s string) string {
    hexCount := 0
    for i := 0; i < len(s); i++ {
        if shouldPathEscape(s[i]) {
            hexCount++
        }
    }
    if hexCount == 0 {
        return s
    }
    t := make([]byte, len(s)+2*hexCount)
    j := 0
    for i := 0; i < len(s); i++ {
        c := s[i]
        if shouldPathEscape(c) {
            t[j] = '%'
            t[j+1] = "0123456789ABCDEF"[c>>4]
            t[j+2] = "0123456789ABCDEF"[c&15]
            j += 3
        } else {
            t[j] = c
            j++
        }
    }
    return string(t)
}

func BenchmarkMyPathEscape(b *testing.B) {
    for i := 0; i < b.N; i++ {
        s = MyPathEscape(ESCAPE_ME)
    }
}
goos: darwin
goarch: amd64
BenchmarkPathEscape-8        3000000           438 ns/op
BenchmarkMyPathEscape-8     10000000           147 ns/op

It seems to me like calls to functions like escape should probably be inlined when a parameter on which many branches depend is given a const argument:

func PathEscape(s string) string {
    return escape(s, encodePathSegment) // not inlined, but probably should be
}
martisch commented 7 years ago

I also experimented with some optimizations (mostly unescpae) but have not cleaned them up in a CL yet. So i am interested what we can improve too. Maybe i still get around to finish them for 1.10.

I was thinking of creating a table [256]uint8 where byte contains some flags. e.g. shouldPathEscape (if we need more flags this can become uint32) so we can use the same lookup table in multiple functions to quickly look up properties of ASCII characters while at the same time not wasting ram for a lookup table each time. This could be even stored in a separate package so multiple std lib packages (e.g unicode.IsSpace or strings.Fields) can use the same table.

For unhex i prefer using 9*(c>>6)` + (c&15) vs a LUT. I will have to benchmark this again but will send as a small separate improvement for the std lib occurrences if it still checks out to be faster than the current branch based version.

@ccbrown Some comments while briefly looking at the code from the above comment:

Does your new version pass all the tests?

ccbrown commented 7 years ago

Tried a few more things.

Using a boolean LUT does make things faster:

func MyBoolLUTEscape(s string, shouldEscapeLUT [256]bool) string {
    hexCount := 0
    for i := 0; i < len(s); i++ {
        if shouldEscapeLUT[s[i]] {
            hexCount++
        }
    }
    if hexCount == 0 {
        return s
    }
    t := make([]byte, len(s)+2*hexCount)
    j := 0
    for i := 0; i < len(s); i++ {
        c := s[i]
        if shouldEscapeLUT[c] {
            t[j] = '%'
            t[j+1] = "0123456789ABCDEF"[c>>4]
            t[j+2] = "0123456789ABCDEF"[c&15]
            j += 3
        } else {
            t[j] = c
            j++
        }
    }
    return string(t)
}
BenchmarkPathEscape-8                3000000           438 ns/op
BenchmarkMyPathEscape-8             10000000           144 ns/op
BenchmarkMyBoolLUTPathEscape-8      10000000           128 ns/op

Compacting all of the different encodings into a single [256]byte LUT is slower, but still reasonably fast (and would potentially save 6*256 bytes in memory):

func MyLUTEscape(s string, mode encoding) string {
    hexCount := 0
    mask := byte(1 << byte(mode))
    for i := 0; i < len(s); i++ {
        if shouldEscapeLUT[s[i]]&mask != 0 {
            hexCount++
        }
    }
    if hexCount == 0 {
        return s
    }
    t := make([]byte, len(s)+2*hexCount)
    j := 0
    for i := 0; i < len(s); i++ {
        c := s[i]
        if shouldEscapeLUT[s[i]]&mask != 0 {
            t[j] = '%'
            t[j+1] = "0123456789ABCDEF"[c>>4]
            t[j+2] = "0123456789ABCDEF"[c&15]
            j += 3
        } else {
            t[j] = c
            j++
        }
    }
    return string(t)
}
BenchmarkMyLUTEscape-8              10000000           151 ns/op

Note while these functions can be re-used by all of the encodings, I'm still excluding the c == ' ' && mode == encodeQueryComponent case for now. If needed, query component encoding can get its own function implementation.

@martisch

t[j+2] could maybe avoid some bounds check here by checking _ = t[j+2] earlier in the block.

we could remember the first position where a to be escaped symbol is and skip ahead to that in the second loop after making a copy of the start up to that symbol. But that would make performance worse when to position is very near the start of the string.

Tried both. Neither really made a measurable difference for me.

if this is not already recognized by the compiler to become a setCC instruction with an add on AMD64 instead of a branch we should have a look at optimizing that.

That function actually got inlined for me. So it didn't use any conditional instructions like that there.

But in the LUT version, maybe that sort of thing could work here:

perf_test.go:207    0x10f004d       4084f8          TESTL DI, AL
perf_test.go:207    0x10f0050       74e3            JE 0x10f0035
perf_test.go:208    0x10f0052       48ffc6          INCQ SI
perf_test.go:207    0x10f0055       ebde            JMP 0x10f0035

I'm not really gonna try to optimize this at such a low level yet though.

gopherbot commented 6 years ago

Change https://golang.org/cl/134296 mentions this issue: net/url: remove an allocation for short strings in escape

iand commented 6 years ago

While working on 134296 I saw that the benchmark for the String() method still shows about 60 allocations being made. This is because we don't pre-size the strings.Builder and also call escape at least 5 times for different url components.

I think this could be optimized by pre-sizing the builder with the known lengths (scheme, opaque, rawquery) and passing the builder to a new escape function that writes to a builder. The existing escape function could be modified to create a builder and call the new function or the existing callers of escape could do it (QueryEscape, PathEscape etc.)

I can work on this if there is interest in this approach.

bradfitz commented 6 years ago

@iand, sounds reasonable.

iand commented 6 years ago

After spending several more minutes examining the String benchmark I realised that the reported timings and allocations are the cumulative totals of all 20 test cases.

I investigated modifying the various escape functions write to a shared strings.Builder but couldn't reduce the allocations and timings by enough to justify the extra complexity.

Also looked at pre-sizing the builder. It's easy enough to calculate the capacity needed for the fixed components and estimate the length of most of the encoded parts, but path is little more tricky. Performing these calculations adds timing overhead which makes the function slower overall, even with the one or two allocations it saves.

I found I could get just as good a benefit by simply growing the builder to a fixed 64 bytes. Some timings for variants on this approach are below. However, it's likely that I'm just fitting to the test cases.

I'm happy to send a CL for a one line change setting the initial builder size to 64 bytes, but I'm not convinced it adds much value.

Current master:

BenchmarkString-8        500000       3393 ns/op        1664 B/op   61 allocs/op

Pre-sized builder.

32 bytes:  BenchmarkString-8   500000   2966 ns/op      1632 B/op   43 allocs/op
64 bytes:  BenchmarkString-8   500000   2837 ns/op      1648 B/op   33 allocs/op
128 bytes: BenchmarkString-8   500000   2891 ns/op      2800 B/op   32 allocs/op
gopherbot commented 5 years ago

Change https://golang.org/cl/174998 mentions this issue: net/url: rework shouldEscape func to bitmask flag for performance boost

cristaloleg commented 3 years ago

What is the state of this issue? Looks like it's stalled after 1.13 freeze, but no other problems are mentioned. Kindly ping @iand

ahfuzhang commented 1 year ago

I write a new version of FastQueryEscape(), it's 7.33 times faster than url.QueryEscape(). Of course, I based your codes.

The detail is here: https://www.cnblogs.com/ahfuzhang/p/17473178.html

gopherbot commented 1 year ago

Change https://go.dev/cl/514335 mentions this issue: net/url: optimize QueryEscape and PathEscape