Closed paskal closed 3 years ago
Hey team! The failures of the test are caused by the lack of synchronization when updating the cache, which is not necessarily a bad thing as this way the cache is more performant in multithreaded environment and still eventually consistent.
I mean the LruCache.Get()
function and in particular this part:
c.backend.Add(key, data)
if s, ok := data.(Sizer); ok {
atomic.AddInt64(&c.currentSize, int64(s.Size()))
if c.maxCacheSize > 0 && atomic.LoadInt64(&c.currentSize) > c.maxCacheSize {
for atomic.LoadInt64(&c.currentSize) > c.maxCacheSize {
c.backend.RemoveOldest()
}
}
}
If the requirement is that none of the client goroutines should ever observe c.currentSize
greater than c.maxCacheSize
, then this entire code block must be a single atomic operation, which would be much slower for the clients. As it is not atomic, it is expected to have fluctuations of the currentSize above the maxCacheSize, especially in case of high number of parallel goroutines doing the updates.
Possible scenario could be:
size := c.size()
in the test (line #228).c.backend.RemoveOldest()
.c.currentSize
that is way above the c.maxCacheSize
and the test fails at line #229.Eventually, when execution of all the goroutines is complete, the cache will converge to the correct state, but during the execution it can fluctuate pretty significantly.
I ran the tests locally and in all cases the test fails at line #229, which is inside of the goroutine. As the goroutines can observe the fluctuations, you probably should not assume any upper limit (the size < 200
).
If you are ok with the fluctuations of the actual cache size, then I would suggest to remove the require.True(t, size < 200 ...)
assertion in the goroutine here and keep only the one after the loop here (the assert.True(t, c.size() < 123 && c.size() >= 0)
).
What do you think?
@chychkan, thanks a lot for the investigation!
@umputun what do you think about removing problematic test line and acknowledging that the size limit for the cache has a loose guarantee? I don't see the reason to sacrifice the speed to enforce the limit and would vote for altering the text, as proposed by Oleksandr.
With probability higher than 20% locally and than 50% on GitHub Actions,
TestCache_MaxCacheSizeParallel/LruCache
fails with race condition. It's seen asLruCache.currentSize
wrong values \ concurrent changes despite the fact that all changes are done usingatomic.AddInt64
andatomic.LoadInt64
.Example in GitHub Actions
If somebody would be able to fix code in a way it won't fail, it would be great.