thanos-io / promql-engine

Multi-threaded PromQL engine implementation based on the Volcano paper.
Apache License 2.0
142 stars 55 forks source link

goroutine leak in drainBufferOnCancel #93

Closed GiedriusS closed 2 years ago

GiedriusS commented 2 years ago

image

Here's how memory consumption looks like with the engine on the latest commit. 40k+ (and growing) goroutines are stuck on:

goroutine profile: total 58657
47854 @ 0x43cd76 0x406ab0 0x406898 0x1cb6453 0x46e2a1
#   0x1cb6452   github.com/thanos-community/promql-engine/execution/exchange.(*concurrencyOperator).drainBufferOnCancel+0x32    /home/giedrius/go/pkg/mod/github.com/vinted/promql-engine@v0.0.0-20221024105757-4c9896d26166/execution/exchange/concurrent.go:88

10700 @ 0x43cd76 0x44cc3c 0x1ca48cb 0x46e2a1
#   0x1ca48ca   github.com/thanos-community/promql-engine/worker.(*Worker).start+0x20a  /home/giedrius/go/pkg/mod/github.com/vinted/promql-engine@v0.0.0-20221024105757-4c9896d26166/worker/worker.go:61

Full goroutine dump: promql_engine.txt

GiedriusS commented 2 years ago

https://github.com/thanos-community/promql-engine/blob/8e5169a06e7c1012fd2e3cf8785ac0850452c02f/engine/engine.go#L190 sounds like we should always cancel this context but we don't for some reason?

fpetkovski commented 2 years ago

I also think this is the culprit. We used to have it here: https://github.com/thanos-community/promql-engine/pull/83/files#diff-943d2c24d27405c73a2fe762aa3a28660ef3157100e46ab70defc5267fdb796bL44 but I think that PR removed it.

cc @bwplotka

GiedriusS commented 2 years ago

https://github.com/thanos-community/promql-engine/pull/94 I added a fix here, at least it fixes this particular case :smile: I see no reason not to always cancel the context at the end of execution.