scalalang2 / golang-fifo

Modern efficient cache design with simple FIFO queue only in Golang
MIT License
142 stars 7 forks source link

Data race? #21

Closed richardartoul closed 10 months ago

richardartoul commented 10 months ago

My test suite caught this:

Write at 0x00c02f86dee0 by goroutine 3037:
  github.com/scalalang2/golang-fifo/sieve.(*Sieve[go.shape.struct { BucketURL string; FileName string; Offset int; Length int },go.shape.interface { GoAsync(func() (github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error)); GoSync(func() (github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error)); IsDone() bool; Reject(error); Resolve(github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue]); ResolveOrReject(github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error); Wait() (github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error); WaitCtx(context.Context) (github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error) }]).Get()
      /home/runner/go/pkg/mod/github.com/scalalang2/golang-fifo@v0.1.4/sieve/sieve.go:54 +0x1de
...

Previous write at 0x00c02f86dee0 by goroutine 3007:
  github.com/scalalang2/golang-fifo/sieve.(*Sieve[go.shape.struct { BucketURL string; FileName string; Offset int; Length int },go.shape.interface { GoAsync(func() (github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error)); GoSync(func() (github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error)); IsDone() bool; Reject(error); Resolve(github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue]); ResolveOrReject(github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error); Wait() (github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error); WaitCtx(context.Context) (github.com/warpstreamlabs/warpstream/pkg/cache.TWrapper[github.com/warpstreamlabs/warpstream/pkg/filecache.FooterCacheValue], error) }]).Get()
      /home/runner/go/pkg/mod/github.com/scalalang2/golang-fifo@v0.1.4/sieve/sieve.go:54 +0x1de
func (s *Sieve[K, V]) Get(key K) (value V, ok bool) {
    s.lock.RLock()
    defer s.lock.RUnlock()
    if e, ok := s.items[key]; ok {
        e.Value.(*entry[K, V]).visited = true
        return e.Value.(*entry[K, V]).value, true
    }

    return
}

Looks like concurrent writes while holding an RLock?

richardartoul commented 10 months ago

I could see this being an intentional performance thing if setting a boolean field compiles down to a single instruction 🤔 but it makes using this library in any codebase that runs tests with the race detector enabled impossible

scalalang2 commented 10 months ago

@richardartoul Thank you for reporting.

I've researched one thing that can cause this.

  1. Go map is not addressable which means the pointed value may change during performing operations
  2. Changes in pointed value can occur when go map rebalances itself to ensure O(1) access.

I think that there are two possible solutions here

Hot Fixes

This is a critical bug, I'll handle it with a hotfix and conduct further researchs for this.

scalalang2 commented 10 months ago

I also want to hear @richardartoul opinions about this.

richardartoul commented 10 months ago

Your hot fix looks good to me, but I don’t understand the two points you made above. The issue is just that there are two concurrent goroutines loading the same cache value at the same time and they both end up modifying the visited field because they’re not holding an exclusive lock.

another alternative would be to add a sub-lock inside the entry struct and acquire that before marking visited as true. For now though I’d probably merge your existing hot fix since the more complex solutions probably require detailed concurrent benchmarking.