rs / zerolog

Zero Allocation JSON Logger
MIT License
10.39k stars 566 forks source link

Log call blocked goroutine for 5 minutes #619

Closed ViRb3 closed 9 months ago

ViRb3 commented 10 months ago

I'm not sure if this is due to zerolog itself, the Go runtime/GC, or overall system resource starvation, but I was running a rather complex program and noticed that a critical goroutine blocked for about 5 minutes. When I looked at the logs produced by zerolog, it seems like it was stuck between two log print lines for the entire blocking duration.

Main function that initializes zerolog. I fork to both console and file:

func main() {
    var logWriters []io.Writer
    logFile, err := os.OpenFile(CLI.LogFile, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0o644)
    if err != nil {
        log.Fatal().Err(err).Str("path", CLI.LogFile).Msg("open log file")
    }
    logWriters = append(logWriters, NewLevelWriter(logFile, zerolog.DebugLevel))
    var consoleWriter = zerolog.ConsoleWriter{Out: os.Stdout}
    logWriters = append(logWriters, NewLevelWriter(consoleWriter, zerolog.InfoLevel))
    log.Logger = log.Output(zerolog.MultiLevelWriter(logWriters...))
}

The goroutine that got blocked:

func StartPullLogs() {
    cfg := config.Get()
    ticker := time.NewTicker(time.Duration(cfg.PullLogsInterval) * time.Second)
    go func() {
        for ; true; <-ticker.C {
            ticker.Reset(time.Duration(cfg.PullLogsInterval) * time.Second)
            log.Debug().Msg("running pull logs loop")
            if err := doLogs(); err != nil {
                log.Err(err).Msg("pull logs loop") // <-- this line was printed
            }
            log.Debug().Msg("done pull logs loop") // <-- and this one, after 5 minutes!!
        }
    }()
}

The logs I found:

{"level":"error","error":"27_000 characters long error message here","time":"2023-11-18T18:09:58Z","message":"pull logs loop"}
// a lot of other logs here, so the rest of the program was running fine in this interval
{"level":"debug","time":"2023-11-18T18:15:29Z","message":"done pull logs loop"}

Any ideas are greatly appreciated, thank you!

rs commented 9 months ago

Your log output is probably blocking the write.

ViRb3 commented 9 months ago

Could you please elaborate? A ton of other logs were successfully written to disk in the 5 minute interval, so the file was not locked.

rs commented 9 months ago

If the output was a file, it is possible that some hardware storage issue or IO contention blocked your write on that file. The write on your output is the only thing that could block zerolog.

If you are using zerolog in the critical path, I would suggest to use a diode writer to make sure your log output performance does not impact your service.

ViRb3 commented 9 months ago

This makes a lot of sense, I have applied your recommendation and will keep an eye out. Thanks a lot for the help!