A dramatic memory leak occurs with a specific kind of query when running the all-in-one (-target=all) loki, v3 is affected. 2.x line is fine. We had pods spike up to 100GB in a matter of minutes.
pkg/logql/syntax/ast.go in the reorderStages() function, specifically with the combineFilters() function.
The function seems to have a side effect where it is changing the original request's AST and with multiple queries all running in parallel, the AST gets real big. There's a heap dump below.
I was able to get it working by changing this line:
for _, s := range m { switch f := s.(type) { case *LineFilterExpr: filters = append(filters, f)
to
for _, s := range m { switch f := s.(type) { case *LineFilterExpr: filters = append(filters, MustClone(f))
Clone the filter exprs so the combineFilters() won't change the original request's AST.
To Reproduce
Steps to reproduce the behavior:
Run the loki with -target=all
Load it with some data ( I used the canary program)
Make sure some chunks get flushed to disk, the bug will not occur if there are no chunks flushed to the store.
Run the query, {stream="stdout",name="loki-canary"} |= "p" | json |= "p"
Expected behavior
Should work normally.
Environment:
Infrastructure: repro'd on my laptop and in a k8s environment
The loki config that I was using locally to repro:
Describe the bug
A dramatic memory leak occurs with a specific kind of query when running the all-in-one (-target=all) loki, v3 is affected. 2.x line is fine. We had pods spike up to 100GB in a matter of minutes.
A query that causes the leak to happen:
{stream="stdout",name="loki-canary"} |= "p" | json |= "p"
These queries work fine,
Seems to be specific to having filters before and after the parser expr. (Note, I didn't test that much around this though)
I did some digging and I think the issue is here:
https://github.com/grafana/loki/blob/f8977587476169197d6da4d7055b97b189808344/pkg/logql/syntax/ast.go#L144
pkg/logql/syntax/ast.go
in thereorderStages()
function, specifically with thecombineFilters()
function.The function seems to have a side effect where it is changing the original request's AST and with multiple queries all running in parallel, the AST gets real big. There's a heap dump below.
I was able to get it working by changing this line:
for _, s := range m { switch f := s.(type) { case *LineFilterExpr: filters = append(filters, f)
to
for _, s := range m { switch f := s.(type) { case *LineFilterExpr: filters = append(filters, MustClone(f))
Clone the filter exprs so the combineFilters() won't change the original request's AST.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Should work normally.
Environment:
The loki config that I was using locally to repro:
Heap dump while the leak was occurring:
Thanks, Mark.