Failing TestCache_MaxCacheSizeParallel/LruCache

paskal commented 4 years ago

With probability higher than 20% locally and than 50% on GitHub Actions, TestCache_MaxCacheSizeParallel/LruCache fails with race condition. It's seen as LruCache.currentSize wrong values \ concurrent changes despite the fact that all changes are done using atomic.AddInt64 and atomic.LoadInt64.

Example in GitHub Actions

If somebody would be able to fix code in a way it won't fail, it would be great.

chychkan commented 3 years ago

Hey team! The failures of the test are caused by the lack of synchronization when updating the cache, which is not necessarily a bad thing as this way the cache is more performant in multithreaded environment and still eventually consistent.

I mean the LruCache.Get() function and in particular this part:

c.backend.Add(key, data)

if s, ok := data.(Sizer); ok {
    atomic.AddInt64(&c.currentSize, int64(s.Size()))
    if c.maxCacheSize > 0 && atomic.LoadInt64(&c.currentSize) > c.maxCacheSize {
        for atomic.LoadInt64(&c.currentSize) > c.maxCacheSize {
            c.backend.RemoveOldest()
        }
    }
}

If the requirement is that none of the client goroutines should ever observe c.currentSize greater than c.maxCacheSize, then this entire code block must be a single atomic operation, which would be much slower for the clients. As it is not atomic, it is expected to have fluctuations of the currentSize above the maxCacheSize, especially in case of high number of parallel goroutines doing the updates.

Possible scenario could be:

Goroutine #N adds value to the cache, cache is still below the maxCacheSize and the execution flow is passed to goroutine #N+1 just before goroutine #N reaches statement size := c.size() in the test (line #228).
Goroutine #N+1 and multiple other goroutines add values to the cache and update the currentSize but the execution flow for each of them is passed to the next one before they reach the c.backend.RemoveOldest().
The execution flow is passed to the goroutine #N and it gets c.currentSize that is way above the c.maxCacheSize and the test fails at line #229.

Eventually, when execution of all the goroutines is complete, the cache will converge to the correct state, but during the execution it can fluctuate pretty significantly.

I ran the tests locally and in all cases the test fails at line #229, which is inside of the goroutine. As the goroutines can observe the fluctuations, you probably should not assume any upper limit (the size < 200).

If you are ok with the fluctuations of the actual cache size, then I would suggest to remove the require.True(t, size < 200 ...) assertion in the goroutine here and keep only the one after the loop here (the assert.True(t, c.size() < 123 && c.size() >= 0)).

What do you think?

paskal commented 3 years ago

@chychkan, thanks a lot for the investigation!

@umputun what do you think about removing problematic test line and acknowledging that the size limit for the cache has a loose guarantee? I don't see the reason to sacrifice the speed to enforce the limit and would vote for altering the text, as proposed by Oleksandr.

go-pkgz / lcw

Failing TestCache_MaxCacheSizeParallel/LruCache #28