net/textproto: reduce allocations in ReadMIMEHeader

rs commented 4 years ago

What version of Go are you using (`go version`)?

1.14

What did you do?

For our use-case (a HTTP proxy) textproto.ReadMIMEHeader is in the top 3 in term of allocations. In order to ease the GC, we could pre-allocate 512 bytes to be used for small header values. This can reduce by half the number of allocations for this function.

name                              old time/op    new time/op    delta
ReadMIMEHeader/client_headers-16    2.30µs ± 5%    2.11µs ± 2%   -8.21%  (p=0.000 n=10+10)
ReadMIMEHeader/server_headers-16    1.94µs ± 3%    1.85µs ± 3%   -4.56%  (p=0.000 n=10+10)

name                              old alloc/op   new alloc/op   delta
ReadMIMEHeader/client_headers-16    1.53kB ± 0%    1.69kB ± 0%  +10.48%  (p=0.000 n=10+10)
ReadMIMEHeader/server_headers-16    1.09kB ± 0%    1.44kB ± 0%  +32.32%  (p=0.000 n=10+10)

name                              old allocs/op  new allocs/op  delta
ReadMIMEHeader/client_headers-16      14.0 ± 0%       6.0 ± 0%  -57.14%  (p=0.000 n=10+10)
ReadMIMEHeader/server_headers-16      14.0 ± 0%       6.0 ± 0%  -57.14%  (p=0.000 n=10+10)

The patch is pretty small:

index d26e981ae4..6126a6685c 100644
--- a/src/net/textproto/reader.go
+++ b/src/net/textproto/reader.go
@@ -13,6 +13,7 @@ import (
        "strconv"
        "strings"
        "sync"
+       "unsafe"
 )

 // A Reader implements convenience methods for reading requests
@@ -502,6 +503,12 @@ func (r *Reader) ReadMIMEHeader() (MIMEHeader, error) {
                return m, ProtocolError("malformed MIME header initial line: " + string(line))
        }

+       // Create a pre-allocated byte slice for all header values to save
+       // allocations for the first 512 bytes of values, a size that fit typical
+       // small header values, larger ones will get their own allocation.
+       const valuesPreAllocSize = 1 << 9
+       valuesPreAlloc := make([]byte, 0, valuesPreAllocSize)
+
        for {
                kv, err := r.readContinuedLineSlice(mustHaveFieldNameColon)
                if len(kv) == 0 {
@@ -527,7 +534,17 @@ func (r *Reader) ReadMIMEHeader() (MIMEHeader, error) {
                for i < len(kv) && (kv[i] == ' ' || kv[i] == '\t') {
                        i++
                }
-               value := string(kv[i:])
+
+               // Try to fit the value in the pre-allocated buffer to save allocations.
+               var value string
+               if len(kv[i:]) <= valuesPreAllocSize-len(valuesPreAlloc) {
+                       off := len(valuesPreAlloc)
+                       valuesPreAlloc = append(valuesPreAlloc, kv[i:]...)
+                       v := valuesPreAlloc[off:]
+                       value = *(*string)(unsafe.Pointer(&v))
+               } else {
+                       value = string(kv[i:])
+               }

                vv := m[key]
                if vv == nil && len(strs) > 0 {

Please tell me if it's worth submitting a PR with this change.

toothrot commented 4 years ago

/cc @bradfitz @rsc

smasher164 commented 4 years ago

I wonder if it's possible to make use of strings.Builder and pre-allocating with Grow(valuesPreAllocSize), instead of depending on unsafe and only using the buffer when the value fits inside it.

bradfitz commented 4 years ago

We're not adding unsafe usage to textproto.

rs commented 4 years ago

With or without strings.Builder, this trick won't work if you grow the slice, as a copy of the backing array will be performed, which would be a waste: the strings created before the grow would still point on the old backing array, copying data for nothing and wasting the first half of the new grown backing array.

I tried a slightly more complex implementation that would cover requests with arbitrary number header values, but it added no gain from my training set of real-life requests/responses.

~~We could still use strings.Builder to hide the use of unsafe, but I'm not sure to see the point.~~

rs commented 4 years ago

We're not adding unsafe usage to textproto.

@bradfitz would it be accepted if hidden by the use of strings.Builder?

EDIT: scratch that, this can't be implemented with strings.Buffer. So I guess it's a no-go.

rs commented 4 years ago

@bradfitz would you be open to adding the Slice(off, len int) string method to strings.Builder to allow implementation of such tricks without the need of unsafe?

bcmills commented 4 years ago

@rs, could you write all of the values into the strings.Builder and record the offsets, then convert all of the offsets to slices of the final string result?

rs commented 4 years ago

That could work yes, but it would make the code slightly more complex. I would personally prefer the Slice method on strings.Builder, but I guess it will require some more convincing :)

josharian commented 4 years ago

Another option (which would also complicate the code) is to have a fixed size, stack-allocated buffer for pending values to record and for the bytes involved. When a new kv pair is available, append the bytes to the bytes buffer and indices in the bytes buffer to the k/v pairs buffer. When either buffer is too full, convert the byte buffer to a string and add slices of that string to the map. This could all be wrapped up methods in a separate type, for sanity. (This is similar to the strings.Builder idea, but without the extra strings.Builder layer, since a plain ol' byte slice should suffice.)

Also, if the keys are commonly re-used, string interning might help.

rs commented 4 years ago

Good point, you would still pay for one extra copy, but it's better than nothing. String interning might work for some very common header values, I guess it could save a couple alloc per request with common content types and accept header values for instance.

Before spending more time on this, I'd love to hear from @bradfitz about the strings.Builder.Slice idea and if not, what is the probability to having more complex implementation as suggested above accepted.

bcmills commented 4 years ago

@rs, strings.Builder.Slice seems too prone to misuse: much too easy to accidentally create multiple underlying strings (due to growth) when the user really only intended to retain slices of a single big string.

rs commented 4 years ago

Agreed but it would still be safe and you would still save allocations by creating those substring from a common buffer.

kokizzu commented 3 years ago

Possibly related, this allocation grows to 200MB over a day or a week and cannot be reclaimed by GC:

davecheney commented 3 years ago

Possibly related, this allocation grows to 200MB over a week and cannot be reclaimed by GC:

How are you asserting that this allocation cannot be garbage collected?

kokizzu commented 3 years ago

i run github.com/google/gops on the service:

$ gops pprof-heap 1
Profile dump saved to: /tmp/heap_profile363561508
$ gops gc 1 # force GC
$ gops pprof-heap 1
Profile dump saved to: /tmp/heap_profile269530696

copy both heap_profile and compare, it doesn't decrease '__')


File: phoenix
Type: inuse_space
Time: Jan 29, 2021 at 4:10pm (WIB)
Showing nodes accounting for 296.06MB, 96.41% of 307.09MB total
Dropped 81 nodes (cum <= 1.54MB)
      flat  flat%   sum%        cum   cum%
  216.36MB 70.46% 70.46%   236.86MB 77.13%  net/textproto.(*Reader).ReadMIMEHeader
   20.50MB  6.68% 77.13%    20.50MB  6.68%  net/textproto.canonicalMIMEHeaderKey
   16.63MB  5.42% 82.55%    16.63MB  5.42%  github.com/gorilla/context.Set
    9.50MB  3.09% 85.64%       13MB  4.23%  net/http.(*Request).WithContext
    6.50MB  2.12% 87.76%     7.50MB  2.44%  context.WithCancel
       5MB  1.63% 89.39%        5MB  1.63%  context.(*cancelCtx).Done
       5MB  1.63% 91.02%        5MB  1.63%  context.WithValue
       4MB  1.30% 92.32%        4MB  1.30%  syscall.anyToSockaddr
    3.50MB  1.14% 93.46%     3.50MB  1.14%  net/http.cloneURL
    2.50MB  0.81% 94.27%     2.50MB  0.81%  net/http.(*Server).newConn
    2.50MB  0.81% 95.09%     2.50MB  0.81%  net.newFD
    2.01MB  0.65% 95.74%     2.01MB  0.65%  bufio.NewReaderSize
    1.55MB   0.5% 96.24%     1.55MB   0.5%  regexp.(*bitState).reset
    0.50MB  0.16% 96.41%   294.09MB 95.77%  net/http.(*conn).serve
         0     0% 96.41%     2.01MB  0.65%  bufio.NewReader
         0     0% 96.41%     5.03MB  1.64%  github.com/kokizzu/phoenix/handler.Exclusive.Get
         0     0% 96.41%     5.03MB  1.64%  github.com/kokizzu/phoenix/handler.GetHandler
         0     0% 96.41%     2.55MB  0.83%  github.com/kokizzu/phoenix/router/authentication.JWT
         0     0% 96.41%    25.22MB  8.21%  github.com/kokizzu/phoenix/router/middleware.Exclusive.func1
         0     0% 96.41%    25.22MB  8.21%  github.com/go-chi/chi.(*Mux).Mount.func1
         0     0% 96.41%    41.22MB 13.42%  github.com/go-chi/chi.(*Mux).ServeHTTP
         0     0% 96.41%    25.22MB  8.21%  github.com/go-chi/chi.(*Mux).routeHTTP
         0     0% 96.41%    39.72MB 12.93%  github.com/go-chi/chi/middleware.Recoverer.func1
         0     0% 96.41%    39.72MB 12.93%  github.com/go-chi/chi/middleware.RequestID.func1
         0     0% 96.41%     2.55MB  0.83%  go.kokizzu.io/apinizer/v8/authorization.NewJWTAuthorization
         0     0% 96.41%     2.55MB  0.83%  go.kokizzu.io/apinizer/v8/authorization.getToken
         0     0% 96.41%       10MB  3.26%  main.main.func1
         0     0% 96.41%     7.50MB  2.44%  net.(*TCPListener).Accept
         0     0% 96.41%     7.50MB  2.44%  net.(*TCPListener).accept
         0     0% 96.41%     7.50MB  2.44%  net.(*netFD).accept
         0     0% 96.41%       10MB  3.26%  net/http.(*Server).ListenAndServe
         0     0% 96.41%       10MB  3.26%  net/http.(*Server).Serve
         0     0% 96.41%   243.36MB 79.25%  net/http.(*conn).readRequest
         0     0% 96.41%    39.72MB 12.93%  net/http.HandlerFunc.ServeHTTP
         0     0% 96.41%     2.01MB  0.65%  net/http.newBufioReader
         0     0% 96.41%   239.36MB 77.94%  net/http.readRequest
         0     0% 96.41%    41.22MB 13.42%  net/http.serverHandler.ServeHTTP
         0     0% 96.41%     1.55MB   0.5%  regexp.(*Regexp).MatchString
         0     0% 96.41%     1.55MB   0.5%  regexp.(*Regexp).backtrack
         0     0% 96.41%     1.55MB   0.5%  regexp.(*Regexp).doExecute
         0     0% 96.41%     1.55MB   0.5%  regexp.(*Regexp).doMatch
         0     0% 96.41%     1.55MB   0.5%  regexp.MatchString
         0     0% 96.41%        2MB  0.65%  sync.(*Once).Do
         0     0% 96.41%        2MB  0.65%  sync.(*Once).doSlow
         0     0% 96.41%     2.50MB  0.81%  syscall.Getsockname

or this caused by something else?

kokizzu commented 3 years ago

my friend found it https://rover.rocks/golang-memory-leak '__') probably it's gin-gonic/gin's context.Set that did the cache trashing that holds the allocation

s-tokutake commented 3 years ago

We are facing the same problem as above. However, we does not use gin-gonic/gin.

josharian commented 3 years ago

Do you have any statistics about what the header values are? If there are a handful of very common values, perhaps we could hard-code some strings to re-use instead of allocate. (Like string interning, but hard-coded.)

jonaz commented 2 years ago

We are also seeing a majority of the memory being used in ReadMIMEHeader in a simple proxy application. I have yet to investigate if clients send unusually large headers (for example many cookies). 2022-03-25-075812_2024x1120_scrot

Note that this is compiled with go 1.13. I've recompiled using go 1.17 now and will check if there are any improvements.

CAFxX commented 2 years ago

Spent a bit on this and, unfortunately, there does not seem to be a lot that can be done. The only approach I found that worked[^alt], given the constraints above, was interning (under the assumption that headers are not mostly unique) as that reduced both allocations and CPU usage, so this may be a good candidate for some form of string interning (not exposed to users):

benchmarks

note: machine was noisy, need to run again on a dedicated one ``` net/textproto: name old time/op new time/op delta ReadMIMEHeader/client_headers-4 2.38µs ± 9% 2.26µs ±11% -5.35% (p=0.000 n=26+29) ReadMIMEHeader/server_headers-4 2.10µs ± 8% 1.95µs ± 9% -7.28% (p=0.000 n=28+26) Uncommon-4 549ns ± 7% 522ns ±10% -4.99% (p=0.000 n=29+28) name old alloc/op new alloc/op delta ReadMIMEHeader/client_headers-4 1.53kB ± 0% 0.95kB ± 1% -38.41% (p=0.000 n=29+28) ReadMIMEHeader/server_headers-4 1.10kB ± 0% 0.92kB ± 0% -16.51% (p=0.000 n=30+30) Uncommon-4 468B ± 0% 435B ± 0% -7.05% (p=0.000 n=30+27) name old allocs/op new allocs/op delta ReadMIMEHeader/client_headers-4 14.0 ± 0% 3.0 ± 0% -78.57% (p=0.000 n=30+30) ReadMIMEHeader/server_headers-4 14.0 ± 0% 3.0 ± 0% -78.57% (p=0.000 n=30+30) Uncommon-4 5.00 ± 0% 3.00 ± 0% -40.00% (p=0.000 n=30+30) net/http: name old time/op new time/op delta CookieString-4 653ns ± 9% 819ns ±35% +25.42% (p=0.000 n=26+28) ReadSetCookies-4 2.40µs ± 9% 2.96µs ±27% +23.31% (p=0.000 n=26+28) ReadCookies-4 2.85µs ± 9% 3.71µs ±52% +30.13% (p=0.000 n=25+29) HeaderWriteSubset-4 418ns ± 2% 431ns ±10% +3.17% (p=0.000 n=27+25) CopyValues-4 1.22µs ± 5% 1.19µs ±11% -2.48% (p=0.000 n=25+28) ServerMatch-4 18.7ns ± 2% 18.8ns ± 8% ~ (p=0.113 n=28+26) ReadRequestChrome-4 2.81µs ±19% 2.45µs ±14% -12.95% (p=0.000 n=28+26) ReadRequestCurl-4 1.36µs ± 6% 1.26µs ± 7% -6.72% (p=0.000 n=26+27) ReadRequestApachebench-4 1.38µs ±12% 1.30µs ±18% -5.32% (p=0.000 n=27+30) ReadRequestSiege-4 1.77µs ± 8% 1.62µs ± 6% -8.67% (p=0.000 n=26+28) ReadRequestWrk-4 1.05µs ±13% 0.96µs ± 5% -8.53% (p=0.000 n=27+24) FileAndServer_1KB/NoTLS-4 133µs ± 6% 126µs ±10% -5.04% (p=0.000 n=29+29) FileAndServer_1KB/TLS-4 142µs ± 7% 135µs ± 6% -4.87% (p=0.000 n=28+26) FileAndServer_16MB/NoTLS-4 11.0ms ± 4% 10.5ms ± 6% -4.82% (p=0.000 n=27+28) FileAndServer_16MB/TLS-4 21.7ms ±10% 21.2ms ±10% -2.48% (p=0.025 n=28+29) FileAndServer_64MB/NoTLS-4 41.7ms ± 5% 41.5ms ± 5% ~ (p=0.423 n=28+29) FileAndServer_64MB/TLS-4 82.9ms ±12% 81.0ms ±12% ~ (p=0.096 n=29+29) ServeMux-4 50.6µs ± 7% 50.1µs ± 7% ~ (p=0.118 n=26+28) ServeMux_SkipServe-4 23.0µs ± 4% 23.1µs ± 5% ~ (p=0.796 n=27+26) ClientServer-4 95.7µs ± 7% 92.9µs ± 7% -2.91% (p=0.001 n=30+28) ClientServerParallel4-4 37.5µs ±24% 36.5µs ±47% ~ (p=0.124 n=29+29) ClientServerParallel64-4 122µs ±49% 108µs ±58% -11.15% (p=0.005 n=26+29) Server-4 169µs ±43% 165µs ±48% ~ (p=0.618 n=29+25) Client-4 147µs ±25% 136µs ±17% -7.57% (p=0.005 n=30+24) ServerFakeConnNoKeepAlive-4 23.0µs ±61% 20.8µs ±38% ~ (p=0.109 n=30+30) ServerFakeConnWithKeepAlive-4 11.6µs ±21% 11.7µs ±39% ~ (p=0.686 n=27+30) ServerFakeConnWithKeepAliveLite-4 7.74µs ±30% 7.71µs ±23% ~ (p=0.879 n=26+29) ServerHandlerTypeLen-4 8.96µs ±44% 9.52µs ±46% ~ (p=0.138 n=30+30) ServerHandlerNoLen-4 7.92µs ±33% 7.56µs ±29% ~ (p=0.376 n=25+26) ServerHandlerNoType-4 7.70µs ±33% 8.70µs ±51% +13.01% (p=0.036 n=28+30) ServerHandlerNoHeader-4 5.52µs ±22% 6.10µs ±29% +10.49% (p=0.009 n=28+30) ServerHijack-4 16.2µs ± 4% 16.4µs ± 7% ~ (p=0.580 n=26+28) CloseNotifier-4 137µs ± 9% 143µs ± 9% +4.34% (p=0.000 n=26+27) ResponseStatusLine-4 24.5ns ±13% 24.7ns ± 5% +0.70% (p=0.033 n=28+27) name old alloc/op new alloc/op delta CookieString-4 176B ± 0% 176B ± 0% ~ (all equal) ReadSetCookies-4 1.01kB ± 0% 1.01kB ± 0% ~ (all equal) ReadCookies-4 1.84kB ± 0% 1.84kB ± 0% ~ (all equal) HeaderWriteSubset-4 0.00B 0.00B ~ (all equal) CopyValues-4 736B ± 0% 736B ± 0% ~ (all equal) ReadRequestChrome-4 1.83kB ± 0% 1.33kB ± 1% -27.52% (p=0.000 n=30+30) ReadRequestCurl-4 912B ± 0% 887B ± 0% -2.73% (p=0.000 n=30+29) ReadRequestApachebench-4 916B ± 0% 887B ± 0% -3.14% (p=0.000 n=30+29) ReadRequestSiege-4 1.00kB ± 0% 0.92kB ± 0% -8.05% (p=0.000 n=30+27) ReadRequestWrk-4 864B ± 0% 855B ± 0% -1.04% (p=0.000 n=30+22) ServeMux-4 17.3kB ± 0% 17.3kB ± 0% ~ (all equal) ServeMux_SkipServe-4 0.00B 0.00B ~ (all equal) ClientServer-4 5.01kB ± 0% 5.01kB ± 1% ~ (p=0.194 n=29+30) ClientServerParallel4-4 6.74kB ±18% 6.69kB ±17% ~ (p=0.976 n=30+30) ClientServerParallel64-4 16.4kB ±24% 16.1kB ±22% ~ (p=0.640 n=29+28) Server-4 2.32kB ± 1% 2.36kB ± 3% +1.82% (p=0.000 n=29+28) Client-4 3.50kB ± 0% 3.53kB ± 2% +0.83% (p=0.000 n=26+30) ServerFakeConnNoKeepAlive-4 4.75kB ± 0% 4.56kB ± 1% -4.18% (p=0.000 n=26+30) ServerFakeConnWithKeepAlive-4 2.48kB ± 0% 2.24kB ± 0% -9.59% (p=0.000 n=30+29) ServerFakeConnWithKeepAliveLite-4 1.35kB ± 0% 1.36kB ± 0% +0.73% (p=0.000 n=30+27) ServerHandlerTypeLen-4 2.16kB ± 0% 2.19kB ± 0% +1.31% (p=0.000 n=30+30) ServerHandlerNoLen-4 2.13kB ± 0% 2.15kB ± 0% +1.09% (p=0.000 n=30+29) ServerHandlerNoType-4 2.13kB ± 0% 2.16kB ± 0% +1.28% (p=0.000 n=30+29) ServerHandlerNoHeader-4 1.35kB ± 0% 1.36kB ± 0% +0.93% (p=0.000 n=27+28) ServerHijack-4 16.2kB ± 0% 16.4kB ± 0% +0.93% (p=0.000 n=28+30) CloseNotifier-4 3.36kB ± 1% 3.48kB ± 2% +3.40% (p=0.000 n=28+30) ResponseStatusLine-4 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta CookieString-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadSetCookies-4 15.0 ± 0% 15.0 ± 0% ~ (all equal) ReadCookies-4 11.0 ± 0% 11.0 ± 0% ~ (all equal) HeaderWriteSubset-4 0.00 0.00 ~ (all equal) CopyValues-4 11.0 ± 0% 11.0 ± 0% ~ (all equal) ReadRequestChrome-4 14.0 ± 0% 6.0 ± 0% -57.14% (p=0.000 n=30+30) ReadRequestCurl-4 9.00 ± 0% 6.00 ± 0% -33.33% (p=0.000 n=30+30) ReadRequestApachebench-4 9.00 ± 0% 6.00 ± 0% -33.33% (p=0.000 n=30+30) ReadRequestSiege-4 11.0 ± 0% 6.0 ± 0% -45.45% (p=0.000 n=30+30) ReadRequestWrk-4 7.00 ± 0% 6.00 ± 0% -14.29% (p=0.000 n=30+30) ServeMux-4 360 ± 0% 360 ± 0% ~ (all equal) ServeMux_SkipServe-4 0.00 0.00 ~ (all equal) ClientServer-4 59.0 ± 0% 53.0 ± 0% -10.17% (p=0.000 n=30+30) ClientServerParallel4-4 65.4 ± 9% 59.2 ± 9% -9.48% (p=0.000 n=30+30) ClientServerParallel64-4 100 ±19% 96 ± 9% -4.49% (p=0.003 n=30+28) Server-4 21.0 ± 0% 18.0 ± 0% -14.29% (p=0.000 n=30+30) Client-4 42.0 ± 0% 39.0 ± 0% -7.14% (p=0.000 n=30+30) ServerFakeConnNoKeepAlive-4 53.0 ± 0% 47.0 ± 0% -11.32% (p=0.000 n=30+30) ServerFakeConnWithKeepAlive-4 23.0 ± 0% 17.0 ± 0% -26.09% (p=0.000 n=30+30) ServerFakeConnWithKeepAliveLite-4 13.0 ± 0% 12.0 ± 0% -7.69% (p=0.000 n=30+30) ServerHandlerTypeLen-4 20.0 ± 0% 19.0 ± 0% -5.00% (p=0.000 n=30+30) ServerHandlerNoLen-4 18.0 ± 0% 17.0 ± 0% -5.56% (p=0.000 n=30+30) ServerHandlerNoType-4 19.0 ± 0% 18.0 ± 0% -5.26% (p=0.000 n=30+30) ServerHandlerNoHeader-4 13.0 ± 0% 12.0 ± 0% -7.69% (p=0.000 n=30+30) ServerHijack-4 51.0 ± 0% 50.0 ± 0% -1.96% (p=0.000 n=30+30) CloseNotifier-4 51.0 ± 0% 49.0 ± 0% -3.92% (p=0.000 n=30+30) ResponseStatusLine-4 0.00 0.00 ~ (all equal) name old speed new speed delta ReadRequestChrome-4 218MB/s ±16% 249MB/s ±13% +14.15% (p=0.000 n=28+27) ReadRequestCurl-4 57.6MB/s ± 5% 61.4MB/s ±13% +6.72% (p=0.000 n=26+28) ReadRequestApachebench-4 59.7MB/s ±11% 63.1MB/s ±15% +5.79% (p=0.000 n=27+30) ReadRequestSiege-4 85.1MB/s ± 8% 93.5MB/s ± 5% +9.82% (p=0.000 n=27+28) ReadRequestWrk-4 38.2MB/s ±11% 41.6MB/s ± 5% +8.89% (p=0.000 n=27+25) FileAndServer_1KB/NoTLS-4 7.71MB/s ± 6% 8.10MB/s ± 9% +5.10% (p=0.000 n=29+30) FileAndServer_1KB/TLS-4 7.21MB/s ± 8% 7.57MB/s ± 5% +5.05% (p=0.000 n=28+26) FileAndServer_16MB/NoTLS-4 1.53GB/s ± 4% 1.61GB/s ± 6% +5.11% (p=0.000 n=27+28) FileAndServer_16MB/TLS-4 775MB/s ± 9% 794MB/s ± 9% +2.52% (p=0.025 n=28+29) FileAndServer_64MB/NoTLS-4 1.61GB/s ± 5% 1.62GB/s ± 5% ~ (p=0.423 n=28+29) FileAndServer_64MB/TLS-4 808MB/s ±13% 826MB/s ±10% ~ (p=0.098 n=30+28) ```

I have pushed the poc here. If we think this may be useful I can try to refine it properly (the current interning logic is really crude).

If you have internal benchmarks it would be useful to run them on that poc to get a better sense of whether there are some workloads where this approach falls short.

While I haven't tested this on our production workloads yet, I can share that on one of our highest-traffic services ReadMIMEHeader accounts for ~2% of CPU time and ~2.5% of total allocations by bytes.

[^alt]: other attempts included adding more headers to the list of known headers, some form of single string allocation, and other similar micro-optimizations

golang / go