Open rsc opened 5 years ago
A while ago I wrote some code to test this exact same thing with the json decoder, but could not detect any significant difference between the memory usage of the decoder with memoized strings and the regular json decoder, which made me think maybe this is already done somewhere? Apparently not. I can try to find out my test code...
Here's a link to my test code that does this for decoding json into interface{}:
https://github.com/bserdar/jsdecodertest
My initial comment was wrong, it does make a difference to store strings in a map. My test code stores only field keys in a map because those are the strings that are likely to repeat.
The json/ package is a copy of the stdlib encoding/json package modified to include a map[string]string in the decoder state.
Nit: I assume the proposal meant to use a map[string]string
with map[string([]byte)]
lookup as map[[]byte]string
is not a supported map type.
Also cc @josharian, who had a very similar idea a while ago. That was briefly discussed in #5160.
As a thought, we probably don't want to reuse all strings. For example, in a map the keys probably tend to repeat much more often than the values.
Also, do we want to make this configurable? Someone who writes a program likely knows whether they'll read somewhat static/repeated data or not.
Change https://golang.org/cl/188500 mentions this issue: encoding/json: memoize keys during decode
@mvdan, are you concerned about the overhead of the map lookup in your Jun 26 comment? From the outside there can be no possible downside to reusing all strings that are constructed.
The CL mentioned above has an implementation of this for JSON decoder. It introduces some overhead when decoding to an interface{}, because that's the only case where the keys read during decoding will be kept after decoding is done.
However, I think there is a better way to do this. What I propose is to write an encoding/LiteralStore type (a map[string]string) containing a Get(string) string method to retrieve the shared instance of the string. Then this can be used from the decoders in encoding/xml and encoding/json. To use the LiteralStore, the caller has to assign an instance to the JSON/XML decoder. This way, the caller has the option to use a literal store for large input where it makes a difference, use the store for multiple docs, or avoid it completely.
What do you think?
This is one of the conventional use-cases for weak references in GC'd languages.
Ideally, we should reuse strings that remain reachable from the last Decode
invocation, and let the collector clean up any strings that have since become completely unreachable. That way, the frequently-used strings will remain in the map, but the infrequent strings can still be collected (so that the map doesn't grow without bound if the same Decoder is used on many different inputs over time).
Alternatives that do not require weak references include:
Decoder
method (perhaps func (*Decoder) InternStrings(func([]byte)bool)
).@bcmills, I propose to make the interning behavior explicit opt-in using a Decoder
field instead of a method. After reading #32593, I believe one way to make this acceptable in terms of performance and usage is this:
Decoder.KeyLiterals
field of type LiteralStore to both json and xml decoders.KeyLiterals
if it is non-nil during decoding to store keys This implementation does not change the existing behavior and makes the use of a literal store explicit opt-in.
A literal store like this is only useful for larger json/xml input, and only if it is manually decoded, or unmarshaled into an interface{}. If the input is unmarshaled/decoded into a struct, then there is no need to keep the keys as they will be lost.
Re: weak references, afaik there is no way to implement this using finalizers, am I wrong?
@bserdar I don't think there is any way to do this with finalizers because existing users will be using string
values, not pointers, so there is nothing to hang a finalizer on. For that matter weak references wouldn't work either: you would have to change all the users to use the weak reference type rather than string
.
In general finalizers and weak references are equivalent in power, so anything you can do with one you can do with the other. But you have to be able to control the types that people will use.
@ianlancetaylor thanks for the clarification.
If there is interest in this, I can modify https://golang.org/cl/188500 to implement this. I already have some wrapper code around json/xml decoder to do this because reusing just the key strings makes a lot of difference for large docs.
cc @mvdan
For that matter weak references wouldn't work either: you would have to change all the users to use the weak reference type rather than
string
.
No? We would need some sort of language-supported “weak string” type, but then only the json.Decoder
would hold a weak reference — the decoded messages would use a regular string
so that the data doesn't disappear out from beneath them.
(But that assumes a language change that seems pretty unlikely to ever happen.)
In general finalizers and weak references are equivalent in power, so anything you can do with one you can do with the other. But you have to be able to control the types that people will use.
Ah, I think I see what you're saying. If we could control the types in use, then we could add a pointer indirection, and require users to keep the pointer alive in order to retain the string
value in the decoder's cache.
But then you can't use those pointers in the same way as ordinary strings: they can't be map keys, they won't work with the normal built-in string
operations, and they can't be passed to the strings
package without falling back to manual memory management (via runtime.KeepAlive
).
It may be true that they are technically equivalent in power, but they are not equivalently usable.
@bserdar, strings.LiteralStore
seems unfortunate given that we're also discussing adding generics to the language, because there is a much more general “interning table” API that can be used with any type that supports hashing and equality.
The general form could have an API like:
package intern
type Table(type T HashEqualer) […]
func (t Table(T)) Get(x T) (canonicalX T) {
[…]
}
Such an API is also useful for building things like abstract syntax trees, where it can be useful to compress trees by coalescing occurrences of a common subexpression to the same canonical instance.
(A long time ago I maintained a server that evaluated large numbers of boolean expressions for ad targeting, and it used a similar approach — to great effect — to compress the targeting expressions, including in our wire protocol.)
@bcmills, interning table would be nice, but it won't be available any time soon. Until that becomes available, we can keep interning internal to the Decoder, but expose an InternKeys(bool) method for explicit opt-in. Or, implement the strings.LiteralStore, and change it to use the intern package when generics are done.
xml.Decoder can benefit from interning strings as well. What is the opinion about that? Decoder.InternKeys() would be the explicit opt-in for interning keys during decode. Something similar to this: https://go-review.googlesource.com/c/go/+/188500
There's been a lot of discussion here but I haven't seen any objections to simply putting the map I suggested into Decoder and not trying to reuse it any more than that.
I've seen no rationale for adding new API. Just do it all the time. Am I missing something?
There is a small performance hit it if it used all the time. Measurements are here: https://go-review.googlesource.com/c/go/+/188500:
CodeDecoder-8 9.87ms ± 2% CodeDecodeToInterfaceNoIntern-8 10.9ms ± 3% CodeDecodeToInterfaceIntern-8 11.6ms ± 4%
If there is interest in this, I can fix that CL
I'd gladly take that performance hit over new API. Most JSON has many repeated strings so as long as we can show a lowered memory usage I think that's a clear win.
It uses less memory with interning. These are from the same unit test (CodeDecode*), with explicit calls to gc:
No interning:
Alloc:10708232 TotalAlloc:26603840 Sys:71762168 Lookups:0 Mallocs:364454 Frees:196341 HeapAlloc:10708232 HeapSys:66519040 HeapIdle:54239232 HeapInuse:12279808 HeapReleased:49496064 HeapObjects:168113
With interning:
Alloc:10487800 TotalAlloc:38961736 Sys:71762168 Lookups:0 Mallocs:635980 Frees:481644 HeapAlloc:10487800 HeapSys:66617344 HeapIdle:52895744 HeapInuse:13721600 HeapReleased:44720128 HeapObjects:154336
This seems worth doing, and it seems like a minor optimization, not a visible API change. It's unclear this really needs to be in the proposal process at all, but since it's here, based on the discussion above and how little is affected, it seems like a likely accept.
I'd say that the memory comments are the other way around?
@mvrhov, sorry, I don't quite understand what you mean.
Well the numbers for interned strings are larger... Total allocs by 30%, the number of mallcos by 50%, frees by 60%. The number of allcos is smaller by 2%
Allocs are smaller meaning there's less memory allocated in the end. TotalAlloc is higher because of the interning. Frees are larger, because most of the interned string copies that would've been in the unmarshaled object are collected. Overall, with interning when everything is said and done, there's less memory allocated.
@mvdan, are you concerned about the overhead of the map lookup in your Jun 26 comment? From the outside there can be no possible downside to reusing all strings that are constructed.
@rsc sorry that I missed your comment for this long. I could have sworn I had replied.
My concern is about keeping strings alive for longer than needed. That's very much related to one of your original points:
It's unclear to me whether the map in a Decoder should persist between Decode operations. Probably?
For example, imagine if I have a server that receives JSON objects via a network stream, unmarshals them, writes them to a database, and discards them. Right now, I would write this program in a way that the "incoming objects" goroutine keeps a single decoder alive forever.
If the decoder starts memoizing strings forever (since the decoder persists forever), my memory usage could increase over time if I keep seeing new/unique strings. For example, it seems pretty common to represent UUIDs and hashes as strings in JSON, so I'm a bit worried that users could easily run into this kind of "memory leak".
@mvdan, is this still a concern if only the JSON keys are memoized? Keys are repetitive. The memory numbers I pasted a while ago above is from memoizing only the JSON keys, not the field values.
Also, memoizing keys are only meaningful when decoding to a map.
It's unclear to me if the current proposal is to memoize all strings, or just keys when decoding into map[string]T
. I imagine most cases of unique/changing strings don't happen in map keys, but I still find that a possible scenario. For example, a map where the keys are unique IDs or hashes seems pretty reasonable, if one wants to then do quick lookups after unmarshaling.
So, my concern is probably smaller if we only memoize map keys, but even then, I'm still a bit concerned.
The proposal is to only memoize in a single json.Decoder, not globally. If you want to avoid the memoization, you could use multiple json.Decoders. Note that each call to json.Unmarshal uses its own Decoder, so all those would be independent. (If you decoded an object list into a []map[string]int, then all the map keys in one decoding could be shared, but not across decodings.)
Another possibility is to do the memoization by default but allow a hook in the Decoder (a method SetSaveString(func([]byte) string)
or something like that) that would let you override the policy to have a global pool, or a pool only for small strings.
But it seems like we if there's a win-win here then we should take that win by default.
Does anyone object to marking this accepted assuming that there is a win-win? Obviously if the benchmarks say there isn't a win then we wouldn't take the CL.
I think this still seems like a likely accept, but I'm leaving it for another week to make sure I'm reading the thread right.
If you want to avoid the memoization, you could use multiple json.Decoders.
My point is precisely that a program keeping a long-lived json.Decoder
isn't doing anything wrong, but could silently gain a "memory leak" of sorts when upgrading its Go version. That's what I tried to explain in my earlier comment. I guess the changelog could tell them to redo their code to not reuse a decoder forever, but that seems a bit weird.
Also, even though this proposal is marked as a likely accept, I'm still not clear on this detail which I think is pretty important:
It's unclear to me if the current proposal is to memoize all strings, or just keys when decoding into
map[string]T
.
I don't object to the proposal being accepted per se, but I don't support it either, given that for now it's just a "maybe, we'll see how the CL turns out" :)
Would adding an API to json.Decoder
to disable (or enable) interning help? There's already a CL (though quite old) with optional interning.
Just as a data point, I actually included an experience report of when we did this via josharian/intern in https://github.com/golang/go/issues/5160#issuecomment-424988628 (at the time it was just about reducing memory usage and not memory allocations, since allocations happened during unmarshaling - i.e. before we could dedup the strings)
In the same comment I actually proposed as well to make it configurable on a per-field basis using the json field tag, so +1 from me for making it configurable:
type Collection struct {
Objects []Object `json:"objects"`
}
type Object struct {
GUID string `json:"guid"`
ParentGUID string `json:"parent_guid,intern"` // perform interning during unmarshaling
// ...
}
FWIW I'm implementing proper support for this - with the syntax above - in https://github.com/mailru/easyjson/pull/279.
No change in consensus here, so accepting.
Let's start with the simple per-Decoder interning on by default and go from there. But let's plan to land this at the start of a cycle, so not for Go 1.15. That will give time to soak and maybe discover reasons it's a bad idea.
I can rebase this old CL unless there are other plans: https://go-review.googlesource.com/c/go/+/188500
@rsc I think we should answer the two questions I raised again in https://github.com/golang/go/issues/32779#issuecomment-609479347 before getting into a CL. To repeat my current understanding:
1) Would this only affect map keys, or all strings? This is unclear, and IMO should be decided before we jump to an implementation.
2) The change, as proposed right now, would silently introduce memory leaks in programs that currently keep a Decoder alive for a long time. You say:
If you want to avoid the memoization, you could use multiple json.Decoders.
In my opinion, this is a subtle and dangerous breaking change. Even if we strongly marked this change in behavior in the release notes, discouraging the use of global or long-lived decoders, it still seems like a change that could bring pain to our users since it's easy not to notice a leak until a service runs out of memory a week later.
Maybe this is a dumb question but, would the strings being memoized be from the types, or strings that just flew by? If it's the types themselves I would be comfortable running this in prod, but if it's strings that happened to be passed from a user (like the values in a map or in a list or "unknown" keys to a struct) this sounds like a possible DoS.
I don't know how you could keep an encoder around for a long time in a web context other than maybe web sockets???
@mvdan, what if we added some sort of decay behavior or LRU caching, so that infrequently used-strings would expire instead of being memoized forever?
That would at least prevent unbounded memory leaks for infrequent keys, while still providing most of the memory reduction of memoization.
@mvdan, what if we added some sort of decay behavior or LRU caching, so that infrequently used-strings would expire instead of being memoized forever?
That would at least prevent unbounded memory leaks for infrequent keys, while still providing most of the memory reduction of memoization.
Agreed, this is similar in spirit to what you get with josharian/intern, and it worked well in practice for us. Worth noting that we were interning only the field values, not the field keys (we didn't have maps in the decoded struct, but we had quite some string fields whose values were likely to repeat).
Would this only affect map keys, or all strings?
Given what I wrote above, I think we should do the latter. If the worry is keeping too many strings alive it should be easy enough to cap the size of the interning map (even just random replacement would likely be enough, and trivial to implement; as the current proposal is to have an interning map per Decoder, it will likely not make much of a difference to implement a better eviction policy since multiple Decoders will allocate the same strings anyway).
Some sort of LRU, or mechanism to limit the amount of memory we hold onto at once, seems like an OK tradeoff to me. I guess we can work this out on the CL itself, but it seemed clearly backwards incompatible to me to make this entirely unbounded by design.
I've been experimenting with several possible implementations of this.
Here are some of my thoughts:
map[string]string
is not a good data structure for a string cache since its 1) hard to prevent unbounded growth 2) has double memory use (one for the key and one for the value; even though they hold the same value), and 3) holds a reference to a string value that may no longer be used by the application.Here's a simple data structure that I tried that does seem sufficiently effective to use by default:
type stringCache [64]string
func (c *stringCache) make(b []byte) string {
if len(b) < 2 || len(b) > 16 {
return string(b)
}
h := hashBytes(b) % len(*c)
if s := (*c)[h]; s == string(b) {
return s
}
s := string(b)
(*c)[h] = s
return s
}
It's a small, fixed-size cache. Thus it avoids any concerns about unbounded memory growth or overtly wasting too much space holding onto old, unused strings. At worst, the memory usage for this is len(stringCache)*(unsafe.Sizeof(string) + maxStringSize)
where maxStringSize
is the maximum string length we try caching (16B in the above implementation). The effectiveness of this is highly dependent on the speed of hashBytes
. The size of stringCache
should be determined based on how often we expect to see the same strings repeat themselves and according to the birthday problem.
Using runtime.memhash
as the implementation (see #42710) of hashBytes
, I get the following numbers:
BenchmarkMalloc/FrequentNames 48031 25150 ns/op 9600 B/op 1000 allocs/op
BenchmarkCache/FrequentNames 66622 18019 ns/op 96 B/op 10 allocs/op // 0.7x slower than Malloc
BenchmarkMap/FrequentNames 40252 31818 ns/op 678 B/op 11 allocs/op // 1.3x slower than Malloc
BenchmarkMalloc/PathalogicalNames 76096 15348 ns/op 8000 B/op 500 allocs/op
BenchmarkCache/PathalogicalNames 48894 24289 ns/op 8000 B/op 500 allocs/op // 1.6x slower than Malloc
BenchmarkMap/PathalogicalNames 10000 123864 ns/op 91250 B/op 524 allocs/op // 8.1x slower than Malloc
This tests three different approaches:
Malloc
is using a naive []byte
to string
conversion.Cache
is using stringCache
.Map
is using a map[string]string
as the cache.This tests two extreme sets of data:
FrequentNames
represents the best-case scenario where the same 10 strings occur over repeatedly.Pathological
represents the worst-case scenario where every string is exactly 16B long, but all different, resulting in a 100% cache miss. This is pathological for several reasons: 1) every entry in the cache will be populated, 2) the string in every entry is also 16B, so we must do a byte-for-byte comparison, 3) only the last few bytes of the string differ, 4) every entry results in a miss. Thus, for every string we do O(n) work to hash the string, and O(n) work to compare the string (where n is the string length).Some notes:
map[string]string
rarely out-performs malloc and seems to be a net loss.FrequentNames
than PathologicalNames
, the use of stringCache
is probably a net win.stringCache
is only 1.3x slower (surprisingly 8x slower for map[string]string
).@dsnet nice analysis. One more thing to try: use a fixed size array of strings, iterate instead of hashing for lookups, and do random replacement. Something like https://github.com/josharian/go/commit/ab074c9fa67a227ad89565941c501fda18979a88. (There are many variations on this theme available.)
Move-to-front is also a good caching scheme when there are a few very common keys.
On Sat, Nov 21, 2020 at 12:18 PM Joe Tsai notifications@github.com wrote:
I've been experimenting with several possible implementations of this.
Here are some of my thoughts:
- As a string gets longer, the probability of a cache hit drops and the cache lookup cost grows. This makes sense since longer strings have an exponentially larger number of possible representations and also take longer to compare or hash. Thus, we should only do caching for small strings (e.g., <= 16B).
- The Go memory allocator is already fairly fast in allocating small strings. For example, a 16B string takes 35ns on my machine. Granted, this is the up-front CPU cost, and doesn't include the asynchronous costs associated with GC heap scanning and memory freeing. Any cache lookup cost should be comparable to or faster than 35ns.
- For most realistic JSON data, strings in JSON object names have a high degree of locality. That is, seeing a specific object name has a high probability of seeing it again within the next few dozen object names. This is especially true for a JSON array containing homogenous JSON objects, but is also for tree-like representations of JSON objects where each node has a similar set of fields. Thus, a small cache can be highly beneficial for JSON object names.
Here's a realistic JSON data set:
https://storage.googleapis.com/synthea-public/synthea_sample_data_fhir_r4_sep2019.zip
This data set contains many large JSON documents, generated health data. These are the kind of documents that caching keys can make a real difference.
Several observations:
- For most realistic JSON data, identical strings in general JSON values exhibit poor locality. Repeated strings in data tend to occur either for enum-like values (e.g., "SUCCESS") or what are functionally "pointers" to create a graph out of JSON data (e.g., a ID reference to some other part of the JSON data). In order to detect repeated usages, we would need a relatively large cache. Also since most of strings will be a cache miss, there is the wasted cost of looking up a string in the common case where there is no benefit.
- A naive map[string]string is not a good data structure for a string cache since its 1) hard to prevent unbounded growth 2) has double memory use (one for the key and one for the value; even though they hold the same value), and 3) holds a reference to a string value that may no longer be used by the application.
Here's a simple data structure that I tried that does seem sufficiently effective to use by default:
type stringCache [64]string func (c stringCache) make(b []byte) string { if len(b) < 2 || len(b) > 16 { return string(b) } h := hashBytes(b) % len(c) if s := (c)[h]; s == string(b) { return s } s := string(b) (c)[h] = s return s }
It's a small, fixed-size cache. Thus it avoids any concerns about unbounded memory growth or overtly wasting too much space holding onto old, unused strings. At worst, the memory usage for this is len(stringCache)*(unsafe.Sizeof(string)
- maxStringSize) where maxStringSize is the maximum string length we try caching (16B in the above implementation). The effectiveness of this is highly dependent on the speed of hashBytes. The size of stringCache should be determined based on how often we expect to see the same strings repeat themselves and according to the birthday problem https://en.wikipedia.org/wiki/Birthday_problem.
Using runtime.memhash as the implementation (see #42710 https://github.com/golang/go/issues/42710) of hashBytes, I get the following numbers https://play.golang.org/p/7e6w9w1V8dt:
BenchmarkMalloc/FrequentNames 48031 25150 ns/op 9600 B/op 1000 allocs/op BenchmarkCache/FrequentNames 66622 18019 ns/op 96 B/op 10 allocs/op // 0.7x slower than Malloc BenchmarkMap/FrequentNames 40252 31818 ns/op 678 B/op 11 allocs/op // 1.3x slower than Malloc
BenchmarkMalloc/PathalogicalNames 76096 15348 ns/op 8000 B/op 500 allocs/op BenchmarkCache/PathalogicalNames 48894 24289 ns/op 8000 B/op 500 allocs/op // 1.6x slower than Malloc BenchmarkMap/PathalogicalNames 10000 123864 ns/op 91250 B/op 524 allocs/op // 8.1x slower than Malloc
This tests three different approaches:
- Malloc is using a naive []byte to string conversion.
- Cache is using stringCache.
- Map is using a map[string]string as the cache.
This tests two extreme sets of data:
- FrequentNames represents the best-case scenario where the same 10 strings occur over repeatedly.
- Pathological represents the worst-case scenario where every string is exactly 16B long, but all different, resulting in a 100% cache miss. This is pathological for several reasons: 1) every entry in the cache will be populated, 2) the string in every entry is also 16B, so we must do a byte-for-byte comparison, 3) only the last few bytes of the string differ, 4) every entry results in a miss. Thus, for every string we do O(n) work to hash the string, and O(n) work to compare the string (where n is the string length).
Some notes:
- A map[string]string rarely out-performs malloc and seems to be a net loss.
- Assuming that most data look more like FrequentNames than PathologicalNames, the use of stringCache is probably a net win.
- At least the pathological case for stringCache is only 1.3x slower (surprisingly 8x slower for map[string]string).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/32779#issuecomment-731623919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4AGDKAYEOPE45EW7GYE6TSRAG65ANCNFSM4H3MJFLQ .
@josharian and @randall77. I tried a simple linear search in some of my earlier approaches, and it does decent where there is a hit. However, I felt like it performed poorly in the cases where there were never any matches since the miss cost was proportional to the number of maximum elements that we linear search over.
One advantage of a linear search (with move-to-front) is that it provides a heuristic for which entry to evict. This avoids the collision problem that my naive hashing approach exhibits (due to the birthday problem). In order to have a decent hit rate on a list of strings with only N unique entries, the cache size needs to be approximately 2*N.
We could consider any the frequent item algorithms described here: http://dimacs.rutgers.edu/~graham/pubs/papers/freq.pdf
It might be worth dividing the set of keys up in to buckets, hashing to the bucket, and performing one of the above algorithms on a single bucket.
...with the cheapest possible “hash” being string length, since we are only considering lengths 2..16.
@randall77 and @josharian. Thanks for the idea. I haven't yet applied any of the specific algorithms in the paper you referenced, but I suspect they won't perform well in this specific application since allocation of small strings is already so dang fast that any book-keeping is often just as expensive as malloc itself (defeating the point of string interning).
I incorporated some of the ideas above and some new ideas:
1 |> 0
2 |=> 1
3 |======> 6
4 |==========> 10
5 |===========> 11
6 |====================> 20
7 |==============> 14
8 |================> 16
9 |===============> 15
10 |=========> 9
11 |========> 8
12 |========> 8
13 |=====> 5
14 |=======> 7
15 |=======> 7
16 |===> 3
1, 9 |===============> 15
2,10 |==========> 10
3,11 |==============> 14
4,12 |==================> 18
5,13 |================> 16
6,14 |===========================> 27
7,15 |=====================> 21
8,16 |===================> 19
Bucketizing based on length has the advantage of isolating the effects of a given string length from affecting another bucket. However, it has the disadvantage that certain lengths may be under-utilized (i.e, the 2B,10B bucket for the specific data I'm playing with), leading to wasted space.
The current algorithm I came up with is:
type stringCache [256]string
func (c *stringCache) make(b []byte) string {
if len(b) < 2 || len(b) > 16 {
return string(b)
}
var h uint64
{
// Read the string in two parts using wide-integer loads.
// The prefix and suffix may overlap, which is fine.
switch {
case len(b) >= 8:
h ^= uint64(binary.LittleEndian.Uint64(b[:8]))
h *= 0x00000100000001B3 // inspired by FNV-64
h ^= uint64(binary.LittleEndian.Uint64(b[len(b)-8:]))
case len(b) >= 4:
h ^= uint64(binary.LittleEndian.Uint32(b[:4]))
h *= 0x01000193 // inspired by FNV-32
h ^= uint64(binary.LittleEndian.Uint32(b[len(b)-4:]))
default:
h ^= uint64(binary.LittleEndian.Uint16(b[:2]))
h *= 0x010f // inspired by hypothetical FNV-16
h ^= uint64(binary.LittleEndian.Uint16(b[len(b)-2:]))
}
// Collapse a 64-bit, 32-bit, or 16-bit hash into an 8-bit hash.
h ^= h >> 32
h ^= h >> 16
h ^= h >> 8
// The cache has 8 buckets based on the lower 3-bits of the length.
h = ((h << 3) | uint64(len(b)&0b111)) % uint64(len(*c))
}
if s := (*c)[h]; s == string(b) {
return s
}
s := string(b)
(*c)[h] = s
return s
}
Relative to my earlier algorithm, this is decently faster and results is fewer misses:
BenchmarkMalloc/Best 41395 26997 ns/op 9600 B/op 1000 allocs/op
BenchmarkCache1/Best 66477 18359 ns/op 96 B/op 10 allocs/op // 1.47x faster
BenchmarkCache2/Best 88276 13699 ns/op 96 B/op 10 allocs/op // 1.97x faster
BenchmarkMap/Best 39099 30543 ns/op 678 B/op 11 allocs/op // 0.88x faster
BenchmarkMalloc/Real 2503 505818 ns/op 161456 B/op 18529 allocs/op
BenchmarkCache1/Real 2769 440636 ns/op 28960 B/op 2475 allocs/op // 1.15x faster
BenchmarkCache2/Real 4138 302386 ns/op 17712 B/op 1139 allocs/op // 1.67x faster
BenchmarkMap/Real 1668 741258 ns/op 31706 B/op 467 allocs/op // 0.68x faster
BenchmarkMalloc/Worst 72957 16689 ns/op 8000 B/op 500 allocs/op
BenchmarkCache1/Worst 46144 26646 ns/op 8000 B/op 500 allocs/op // 1.60x slower
BenchmarkCache2/Worst 49250 23897 ns/op 8000 B/op 500 allocs/op // 1.43x slower
BenchmarkMap/Worst 10000 135081 ns/op 91233 B/op 524 allocs/op // 8.09x slower
where Cache2
is the algorithm in this post, and Cache1
is the earlier algorithm. The Real
benchmark was switched to using object names from the dataset provided by @bserdar.
Change https://golang.org/cl/277376 mentions this issue: WIP: runtime: add string interning profile
Part of the motivation for #32593 is the observation in json-iterator/go#376 that when you json decode, many of the same strings appear over and over in the JSON and get copied and reconstructed in the result over and over as well.
We do not want to make the JSON coalesce all the allocated strings into one giant buffer, because then holding on to just one of them holds onto the entire thing.
But it might make sense to add to the decoder state a simple map[[]byte]string and remember which specific byte slices we've already converted to string and reuse those allocated strings instead of doing the same conversions again in a particular Decode or Unmarshal operation.
It's unclear to me whether the map in a Decoder should persist between Decode operations. Probably? That will affect users who put Decoders in a pool or something like that, though. (Solution: don't put decoders in pools.)