Closed paddlesteamer closed 4 years ago
I just did a quick test, but it looks like this change is quite a big performance regression:
benchmark old ns/op new ns/op delta
BenchmarkCacheGetExpiring-2 121 154 +27.27%
BenchmarkCacheGetNotExpiring-2 35.6 71.1 +99.72%
BenchmarkRWMutexMapGet-2 32.3 31.9 -1.24%
BenchmarkRWMutexInterfaceMapGetStruct-2 82.2 75.7 -7.91%
BenchmarkRWMutexInterfaceMapGetString-2 65.5 64.3 -1.83%
BenchmarkCacheGetConcurrentExpiring-2 117 223 +90.60%
BenchmarkCacheGetConcurrentNotExpiring-2 85.5 167 +95.32%
BenchmarkRWMutexMapGetConcurrent-2 71.9 55.7 -22.53%
BenchmarkCacheGetManyConcurrentExpiring-2 117 128 +9.40%
BenchmarkCacheGetManyConcurrentNotExpiring-2 102 161 +57.84%
BenchmarkCacheSetExpiring-2 177 260 +46.89%
BenchmarkCacheSetNotExpiring-2 75.3 160 +112.48%
BenchmarkRWMutexMapSet-2 69.7 66.6 -4.45%
BenchmarkCacheSetDelete-2 163 243 +49.08%
BenchmarkRWMutexMapSetDelete-2 149 146 -2.01%
BenchmarkCacheSetDeleteSingleLock-2 114 199 +74.56%
BenchmarkRWMutexMapSetDeleteSingleLock-2 101 99.4 -1.58%
BenchmarkIncrementInt-2 120 139 +15.83%
BenchmarkDeleteExpiredLoop-2 2880788 5902806 +104.90%
BenchmarkShardedCacheGetExpiring-2 154 180 +16.88%
BenchmarkShardedCacheGetNotExpiring-2 57.0 89.8 +57.54%
BenchmarkShardedCacheGetManyConcurrentExpiring-2 88.0 99.4 +12.95%
BenchmarkShardedCacheGetManyConcurrentNotExpiring-2 36.5 54.7 +49.86%
The pointers also make the GC work harder, so there's probably some indirect performance regression there too (which may already be accounted for in the above benchmarks, didn't look in-depth).
A simpler way to fix this would be to just lock everything for the entire function instead of switching from RLock()
to Lock()
. This would also cause some performance regressions, but only if you call GetWithExpirationUpdate()
, rather than everything (even if you don't use OnEvict
).
Hi again,
I removed the defer
s I put in the previous commits and the benchmark results are better now. Can you run the benchmark tests again?
defer
got a big performance boost in Go 1.14; so it doesn't make that much of a difference any more. By the way, to be clear: I'm not the maintainer of this library, I've just been going through some PRs and issues to include in my fork. I already did some benchmarks on defer this morning, and it made very little difference.
I included new benchmarks at the bottom, as you can see it doesn't make that much of a difference.
The basic issue is that locks are not free, for example take this simple benchmark:
// BenchmarkMain-2 46259625 24.3 ns/op
func BenchmarkMain(b *testing.B) {
var s struct{ sync.Mutex }
for n := 0; n < b.N; n++ {
s.Lock()
s.Unlock()
}
}
So that's the minimum amount of performance regression that can be expected for operations that add a new lock.
The other issue is that pointers are not free either; I included a benchmark below which changes just map[string]Item
to pointers, and as you can see that causes quite a big regression too.
For reference, here's what I did in my fork:
// Touch replaces the expiry of a key and returns the current value.
func (c *cache) Touch(k string, d time.Duration) (interface{}, bool) {
if d == DefaultExpiration {
d = c.defaultExpiration
}
c.mu.Lock()
defer c.mu.Unlock()
item, ok := c.items[k]
if !ok {
return nil, false
}
item.Expiration = time.Now().Add(d).UnixNano()
c.items[k] = item
return item.Object, true
}
And the benchmark for this compared to your version:
// BenchmarkTouch-2 5750259 197 ns/op My Touch
// BenchmarkTouch-2 4155210 284 ns/op Your GetWithExpirationUpdate
func BenchmarkTouch(b *testing.B) {
tc := New(DefaultExpiration, 0)
d := 5 * time.Second
tc.Set("foo", 0, DefaultExpiration)
b.ResetTimer()
for i := 0; i < b.N; i++ {
tc.GetWithExpirationUpdate("foo", d)
}
}
This is a simple benchmark of course; I tried to make a benchmark which made your version look better in cases where you're also doing other stuff in goroutines, for example:
// BenchmarkTouch-2 6014732 200 ns/op My Touch
// BenchmarkTouch-2 4213928 284 ns/op Your GetWithExpirationUpdate
func BenchmarkTouch(b *testing.B) {
tc := New(DefaultExpiration, 0)
d := 5 * time.Second
tc.Set("foo", 0, DefaultExpiration)
go func() {
for {
tc.Get("foo")
time.Sleep(1 * time.Millisecond)
}
}()
b.ResetTimer()
for i := 0; i < b.N; i++ {
tc.GetWithExpirationUpdate("foo", d)
}
}
I tried a bunch of different variations of this, but I couldn't come up with anything where your version had better performance.
I don't know ... maybe I missed something – benchmarking realistic workloads of hard – but from what I can see just doing c.mu.Lock()
for the duration of Touch()
or GetWithExpirationUpdate()
seems to be better in almost all cases?
For your latest commit, here's the comparison with the original PR:
benchmark old ns/op new ns/op delta
BenchmarkCacheGetExpiring-2 154 152 -1.30%
BenchmarkCacheGetNotExpiring-2 71.1 68.1 -4.22%
BenchmarkRWMutexMapGet-2 31.9 31.9 +0.00%
BenchmarkRWMutexInterfaceMapGetStruct-2 75.7 76.0 +0.40%
BenchmarkRWMutexInterfaceMapGetString-2 64.3 63.3 -1.56%
BenchmarkCacheGetConcurrentExpiring-2 223 201 -9.87%
BenchmarkCacheGetConcurrentNotExpiring-2 167 183 +9.58%
BenchmarkRWMutexMapGetConcurrent-2 55.7 59.0 +5.92%
BenchmarkCacheGetManyConcurrentExpiring-2 128 181 +41.41%
BenchmarkCacheGetManyConcurrentNotExpiring-2 161 153 -4.97%
BenchmarkCacheSetExpiring-2 260 266 +2.31%
BenchmarkCacheSetNotExpiring-2 160 163 +1.88%
BenchmarkRWMutexMapSet-2 66.6 66.5 -0.15%
BenchmarkCacheSetDelete-2 243 247 +1.65%
BenchmarkRWMutexMapSetDelete-2 146 146 +0.00%
BenchmarkCacheSetDeleteSingleLock-2 199 203 +2.01%
BenchmarkRWMutexMapSetDeleteSingleLock-2 99.4 99.2 -0.20%
BenchmarkIncrementInt-2 139 136 -2.16%
BenchmarkDeleteExpiredLoop-2 5902806 5908243 +0.09%
BenchmarkShardedCacheGetExpiring-2 180 188 +4.44%
BenchmarkShardedCacheGetNotExpiring-2 89.8 84.0 -6.46%
BenchmarkShardedCacheGetManyConcurrentExpiring-2 99.4 100 +0.60%
BenchmarkShardedCacheGetManyConcurrentNotExpiring-2 54.7 51.2 -6.40%
And here's the comparison with master:
benchmark old ns/op new ns/op delta
BenchmarkCacheGetExpiring-2 121 152 +25.62%
BenchmarkCacheGetNotExpiring-2 35.6 68.1 +91.29%
BenchmarkRWMutexMapGet-2 32.3 31.9 -1.24%
BenchmarkRWMutexInterfaceMapGetStruct-2 82.2 76.0 -7.54%
BenchmarkRWMutexInterfaceMapGetString-2 65.5 63.3 -3.36%
BenchmarkCacheGetConcurrentExpiring-2 117 201 +71.79%
BenchmarkCacheGetConcurrentNotExpiring-2 85.5 183 +114.04%
BenchmarkRWMutexMapGetConcurrent-2 71.9 59.0 -17.94%
BenchmarkCacheGetManyConcurrentExpiring-2 117 181 +54.70%
BenchmarkCacheGetManyConcurrentNotExpiring-2 102 153 +50.00%
BenchmarkCacheSetExpiring-2 177 266 +50.28%
BenchmarkCacheSetNotExpiring-2 75.3 163 +116.47%
BenchmarkRWMutexMapSet-2 69.7 66.5 -4.59%
BenchmarkCacheSetDelete-2 163 247 +51.53%
BenchmarkRWMutexMapSetDelete-2 149 146 -2.01%
BenchmarkCacheSetDeleteSingleLock-2 114 203 +78.07%
BenchmarkRWMutexMapSetDeleteSingleLock-2 101 99.2 -1.78%
BenchmarkIncrementInt-2 120 136 +13.33%
BenchmarkDeleteExpiredLoop-2 2880788 5908243 +105.09%
BenchmarkShardedCacheGetExpiring-2 154 188 +22.08%
BenchmarkShardedCacheGetNotExpiring-2 57.0 84.0 +47.37%
BenchmarkShardedCacheGetManyConcurrentExpiring-2 88.0 100 +13.64%
BenchmarkShardedCacheGetManyConcurrentNotExpiring-2 36.5 51.2 +40.27%
master vs. just changing map[string]Item
to map[string]*Item
:
benchmark old ns/op new ns/op delta
BenchmarkCacheGetExpiring-2 121 119 -1.65%
BenchmarkCacheGetNotExpiring-2 35.6 34.3 -3.65%
BenchmarkRWMutexMapGet-2 32.3 31.9 -1.24%
BenchmarkRWMutexInterfaceMapGetStruct-2 82.2 77.9 -5.23%
BenchmarkRWMutexInterfaceMapGetString-2 65.5 63.3 -3.36%
BenchmarkCacheGetConcurrentExpiring-2 117 109 -6.84%
BenchmarkCacheGetConcurrentNotExpiring-2 85.5 101 +18.13%
BenchmarkRWMutexMapGetConcurrent-2 71.9 58.6 -18.50%
BenchmarkCacheGetManyConcurrentExpiring-2 117 110 -5.98%
BenchmarkCacheGetManyConcurrentNotExpiring-2 102 109 +6.86%
BenchmarkCacheSetExpiring-2 177 253 +42.94%
BenchmarkCacheSetNotExpiring-2 75.3 150 +99.20%
BenchmarkRWMutexMapSet-2 69.7 66.5 -4.59%
BenchmarkCacheSetDelete-2 163 238 +46.01%
BenchmarkRWMutexMapSetDelete-2 149 146 -2.01%
BenchmarkCacheSetDeleteSingleLock-2 114 186 +63.16%
BenchmarkRWMutexMapSetDeleteSingleLock-2 101 99.2 -1.78%
BenchmarkIncrementInt-2 120 119 -0.83%
BenchmarkDeleteExpiredLoop-2 2880788 5611693 +94.80%
BenchmarkShardedCacheGetExpiring-2 154 157 +1.95%
BenchmarkShardedCacheGetNotExpiring-2 57.0 54.9 -3.68%
BenchmarkShardedCacheGetManyConcurrentExpiring-2 88.0 87.3 -0.80%
BenchmarkShardedCacheGetManyConcurrentNotExpiring-2 36.5 35.9 -1.64%
Thank you for your long, detailed response. I just had time to do benchmarks myself and you're right, it is obvious that using a single mutex is faster. Converting the map value type to pointer was also a bad idea and cause regression.
So, I will switch to your version of Touch/GetWithExpirationUpdate
in my fork and I'm closing this PR since your version is better. I know this repository has gone stale but I encourage you to open a PR for your Touch
implementation since it may be useful for the ones looking for that functionality -even though it will probably never be merged.
Thanks again and have a good day!
You're welcome @paddlesteamer; I mostly just wanted to know this for my own reasons :-)
I put up the fork over here the other day by the way: https://github.com/zgoat/zcache – I'm not 100% decided yet on some things I added so I didn't release a v1 yet, but all the existing stuff should be compatible, FYI :-)
This PR is fixed version of #96. Main changes are:
cache.items
are converted frommap[string]Item
tomap[string]*Item
. I needed to do it because, inGetWithExpirationUpdate
, it is the only way to modify theExpiration
field of anItem
. The other way around (re-setting the item) needs a write lock, therefore blocks all reads/writes toitems
. Not convenient for 'cache-get's.Item
has its ownRWLock
. This way, we don't need a write lock inGetWithExpirationUpdate
.Supersedes #125