go-graphite / carbonzipper

proxy to transparently merge graphite carbon backends
Other
104 stars 29 forks source link

Concurrent requests to same metric crash the program #66

Closed gunnihinn closed 6 years ago

gunnihinn commented 6 years ago

Run current HEAD, make it talk to any store, and make concurrent requests to any single metric with cache disabled:

target='something'
while true; do
    curl -s 'http://127.0.0.1:8081/render/?target=$target&noCache=1&format=json' > /dev/null &
done

This will crash the carbonzipper process with a close of a closed channel; see [1].

The stack trace points to a manual call of a method that had been defer-called above, and which does close a channel. This is a red herring, as the issue is still reproducible after removing the manual call.

The real issue could be that the QueryItem cache keys are constructed just from the request data, so two requests for the same target that come in close enough to each other may be able to double-write a *QueryItem to the same cache key before one of them is locked. Once the cache key gets filled in, two or more goroutines then close the same channel.

Running with cache disabled only serves to exhibit the issue more quickly; we see this in production at Booking.com with the cache enabled.

[1] Stack trace:

panic: close of closed channel
goroutine 8297 [running]:
panic(0xf05d80, 0x107ab90)
/usr/lib64/go/src/runtime/panic.go:551 +0x3d9 fp=0xc429405110 sp=0xc429405070 pc=0x431629
runtime.closechan(0xc432f70840)
/usr/lib64/go/src/runtime/chan.go:333 +0x2b3 fp=0xc429405170 sp=0xc429405110 pc=0x405e23
github.com/go-graphite/carbonapi/vendor/github.com/go-graphite/carbonzipper/zipper/cache.(*QueryItem).StoreAbort(0xc421176b40)
/home/redacted/go/src/github.com/go-graphite/carbonapi/vendor/github.com/go-graphite/carbonzipper/zipper/cache/query.go:59 +0xb8 fp=0xc4294051b8 sp=0xc429405170 pc=0xce9608
github.com/go-graphite/carbonapi/vendor/github.com/go-graphite/carbonzipper/zipper/broadcast.(*BroadcastGroup).Fetch(0xc4201f8960, 0x1084500, 0xc428615ac0, 0xc4211027c0, 0x0, 0x0, 0x0)
/home/redacted/go/src/github.com/go-graphite/carbonapi/vendor/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group.go:258 +0xcc2 fp=0xc429405918 sp=0xc4294051b8 pc=0xceda72
github.com/go-graphite/carbonapi/vendor/github.com/go-graphite/carbonzipper/zipper/broadcast.(*BroadcastGroup).doSingleFetch(0xc4204750e0, 0x1084500, 0xc4222d7500, 0xc4222100c0, 0x10894a0, 0xc4201f8960, 0xc4211027c0, 0xc422f29bc0, 0xc422f29b00)
/home/redacted/go/src/github.com/go-graphite/carbonapi/vendor/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group.go:182 +0x8ff fp=0xc429405f98 sp=0xc429405918 pc=0xcebe1f
runtime.goexit()
/usr/lib64/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc429405fa0 sp=0xc429405f98 pc=0x463051
created by github.com/go-graphite/carbonapi/vendor/github.com/go-graphite/carbonzipper/zipper/broadcast.(*BroadcastGroup).Fetch
/home/redacted/go/src/github.com/go-graphite/carbonapi/vendor/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group.go:214 +0x920
Civil commented 6 years ago

Cache key is intentionally the same, because otherwise there is no point in caching :)

I think there might be another broken thing: https://github.com/go-graphite/carbonzipper/blob/master/zipper/cache/query.go#L37-L44

I've used "nil, false" before to detect that "We need to fetch data!". But after refactor we are also returning "nil, false" in case of "ctx.Done()" - and that's actually wrong - it should be "nil, true" and check should be also modified here: https://github.com/go-graphite/carbonzipper/blob/master/zipper/broadcast/broadcast_group.go#L199

Civil commented 6 years ago

I've wrote a test that can reproduce that crash 20% of times (other 80% it just locks up).

I just cancel request and issue dozen of fetches in parallel. Will commit test later today (even if won't be able to figure out what's the cause of lock up)

Civil commented 6 years ago
==================
WARNING: DATA RACE
Write at 0x00c420466368 by goroutine 11:
  github.com/go-graphite/carbonzipper/zipper/cache.(*QueryItem).StoreAbort()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/cache/query.go:58 +0xa6
  github.com/go-graphite/carbonzipper/zipper/broadcast.(*BroadcastGroup).Find()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group.go:386 +0x1177
  github.com/go-graphite/carbonzipper/zipper/broadcast.(*BroadcastGroup).doSingleFetch()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group.go:141 +0xcf4

Previous read at 0x00c420466368 by goroutine 14:
  github.com/go-graphite/carbonzipper/zipper/cache.(*QueryItem).FetchOrLock()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/cache/query.go:45 +0xdb
  github.com/go-graphite/carbonzipper/zipper/broadcast.(*BroadcastGroup).Find()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group.go:316 +0x461
  github.com/go-graphite/carbonzipper/zipper/broadcast.(*BroadcastGroup).doSingleFetch()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group.go:141 +0xcf4

Goroutine 11 (running) created at:
  github.com/go-graphite/carbonzipper/zipper/broadcast.(*BroadcastGroup).Fetch()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group.go:214 +0x95d
  github.com/go-graphite/carbonzipper/zipper/broadcast.TestFetchRequestsWithTimeout.func1.1()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group_test.go:1302 +0xc0

Goroutine 14 (running) created at:
  github.com/go-graphite/carbonzipper/zipper/broadcast.(*BroadcastGroup).Fetch()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group.go:214 +0x95d
  github.com/go-graphite/carbonzipper/zipper/broadcast.TestFetchRequestsWithTimeout.func1.1()
      /home/civil/go/src/github.com/go-graphite/carbonzipper/zipper/broadcast/broadcast_group_test.go:1302 +0xc0
==================

https://github.com/go-graphite/carbonzipper/commit/cfa47cd8a8de0ec23a29c5500a153a524b80d5db

But that might be not related to this issue.

Civil commented 6 years ago

On my system current test always panics with "panic: close of closed channel" (required approx. 10 runs to actually panic)

Civil commented 6 years ago

Feel free to reopen if problem appears again.