grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.93k stars 3.45k forks source link

Promtail: go_routine leak. #9726

Open tristanmorgan opened 1 year ago

tristanmorgan commented 1 year ago

Describe the bug There seems to be a go_routine leak dealing with a SYSLOG collector. possibly related to #8054.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki (2.8.2)
  2. Started Promtail (2.8.2 arm64)
  3. Configure a syslog collector (UDP)
  4. wait for go_routines to build up.

Expected behavior Expected behaviour is for go_routines to be created and removed over time but only a few to be long lived.

Environment:

Screenshots, Promtail config, or terminal output below is a partial stack trace showing one of the many go_routines sitting idle.

goroutine 667 [select, 8428 minutes]:
io.(*pipe).read(0x40017190e0, {0x4000acc000, 0x1000, 0x4000477560?})
        /usr/local/go/src/io/pipe.go:57 +0x84
io.(*PipeReader).Read(...)
        /usr/local/go/src/io/pipe.go:136
bufio.(*Reader).Read(0x4001719140, {0x4000acc000, 0x1000, 0x0?})
        /usr/local/go/src/bufio/bufio.go:223 +0x118
bufio.(*Reader).fill(0x40009b0d80)
        /usr/local/go/src/bufio/bufio.go:106 +0xfc
bufio.(*Reader).ReadSlice(0x40009b0d80, 0x5c?)
        /usr/local/go/src/bufio/bufio.go:372 +0x30
bufio.(*Reader).collectFragments(0x4001708780?, 0x0?)
        /usr/local/go/src/bufio/bufio.go:447 +0x60
bufio.(*Reader).ReadBytes(0x5723080?, 0x68?)
        /usr/local/go/src/bufio/bufio.go:474 +0x1c
github.com/leodido/ragel-machinery/parser.(*DelimitedReader).Read(0x400011fd40)
        /src/loki/vendor/github.com/leodido/ragel-machinery/parser/arbitrary_reader.go:63 +0xc0
github.com/leodido/ragel-machinery/parser.(*Parser).Parse(0x40009ac060)
        /src/loki/vendor/github.com/leodido/ragel-machinery/parser/parser.go:71 +0x50
github.com/influxdata/go-syslog/v3/nontransparent.(*machine).Parse(0x400089d650, {0x3b6c860?, 0x4001719140?})
        /src/loki/vendor/github.com/influxdata/go-syslog/v3/nontransparent/parser.go:259 +0x84
github.com/grafana/loki/clients/pkg/promtail/targets/syslog/syslogparser.ParseStream({0x3b6fd00?, 0x40004318a0?}, 0x40009f6870, 0x2000)
        /src/loki/clients/pkg/promtail/targets/syslog/syslogparser/syslogparser.go:27 +0x308
github.com/grafana/loki/clients/pkg/promtail/targets/syslog.(*UDPTransport).handleRcv(0x4000164eb0, 0x40004318a0)
        /src/loki/clients/pkg/promtail/targets/syslog/transport.go:386 +0x148
created by github.com/grafana/loki/clients/pkg/promtail/targets/syslog.(*UDPTransport).acceptPackets
        /src/loki/clients/pkg/promtail/targets/syslog/transport.go:374 +0x4b4
chaudum commented 1 year ago

Hi @tristanmorgan Thanks for the report.

I think I already found the issue with the goroutine leak. The UDP transport of the syslog target creates a new goroutine for each remote address, see https://github.com/grafana/loki/blob/74a4dca54e4077b65a39c18b8390e76870b01526/clients/pkg/promtail/targets/syslog/transport.go#L366C9-L372

        stream, ok := streams[addr.String()]
        if !ok {
            stream = NewConnPipe(addr)
            streams[addr.String()] = stream
            t.openConnections.Add(1)
            go t.handleRcv(stream)
        }

These are only cleaned up when Promtail shuts down.

The fix would be to clean up these streams regularly in case they had been idle for some time and haven't seen any activity.

tristanmorgan commented 1 year ago

Thanks for responding so quickly. Would a timeout be possible (probably an upstream change)? Given this is UDP transport there is no FIN type message to close the connection. I wish I was more capable in Go to be able to try implement it.

chaudum commented 1 year ago

Thanks for responding so quickly. Would a timeout be possible (probably an upstream change)? Given this is UDP transport there is no FIN type message to close the connection.

Correct. Promtail would need to handle the idle timeout for each "connection" itself by tracking the timestamp of the last packet received.

I wish I was more capable in Go to be able to try implement it.

No worries. I will take a look now.