Promtail: Use `inotify` based file watcher instead of `polling` in Linux

tldr;

Promtail uses tail package as dependency to read file and watch for changes. The tail package supports two ways of watching the file changes (say when newline appended, file removed, file rotated etc)

Polling
Inotify

And Promtail uses polling all the time (even hard coded).

The problem is, with polling tailer performs poorly. Particularly when a log file is rotated very frequently. This leads to skipping some log lines during such cases.. The better approach is to use inotify based watcher.

NOTE: Linux tail command uses inotify based watcher

Experiment to justify the problem

NOTE: Experiment is performed on Linux machine running MANJARO with kernel 5.15.81

Gist is we generate logs that is being logrotated frequently (when size exceeds 1M) and we create our custom tailer with two different modes (polling and inotify) to understand the difference. Finally we compare total volume of logs with 4 things

How much log volume originally generated
How much log volume read by tailer in polling mode
How much log volume read by tailer in inotify mode
How much log volume read by Linux command tail.

We use (1) and (4) to justify if any logs are missed by our tailer in different modes.

Tailer in polling mode

We use following snippet with exact config promtail currently uses

func main() {
    t, err := tail.TailFile("/home/kavirajk/generated.log", tail.Config{
        Follow:    true,
        Poll:      true, // use polling instead of inotify
        ReOpen:    true,
        MustExist: true,
        Location: &tail.SeekInfo{
            Offset: 0,
            Whence: 0,
        },
    })
    if err != nil {
        panic(err)
    }

    for line := range t.Lines {
        fmt.Println(line.Text)
    }
}

Tailer in inotify mode

Same as above with Poll: False.

func main() {
    t, err := tail.TailFile("/home/kavirajk/generated.log", tail.Config{
        Follow:    true,
        Poll:      false, // use inotify instead
        ReOpen:    true,
        MustExist: true,
        Location: &tail.SeekInfo{
            Offset: 0,
            Whence: 0,
        },
    })
    if err != nil {
        panic(err)
    }

    for line := range t.Lines {
        fmt.Println(line.Text)
    }
}

Logrotate config

We use following logrotate config to rotate the generated log (/home/kavirajk/generated.log) when size exceeds 1M

/home/kavirajk/generated.log {
  size 1M
  copytruncate
}

NOTE: Here when file reaches1M, original file is just truncated without moving old contents to any new files (say generated.log.1)

Steps

Create original log file that is source of all the tailer here and this is the file we logrotate frequently
```
touch /home/kavirajk/generated.log
```
Start tailer in polling mode. All the log lines it reads goes into tailer-polling-generated.log
```
./tailer-polling | tee tailer-polling-generated.log
```

2. Start tailer in inotify mode. All the log lines it reads goes into `tailer-inotify-generated.log`

./tailer-inotify | tee /home/kavirajk/tailer-inotify-generated.log

3. Start Linux `tail` on the same generated log file. All the log lines it reads goes into `fromcmdtail-generated.log`

tail -f /home/kavirajk/generated.log | tee fromcmdtail-generated.log

4. Also start `logrotate` to rotate the original log file `generated.log`

watch -n 1 logrotate generated.conf -s .state.logrotate.status -v

5. While all the tailers are waiting for the input, finally generate the log via `flog`

time flog -f json -d 1ms -w -l | tee /home/kavirajk/untouched-generated.log >> /home/kavirajk/generated.log

This generates JSON log line every 1ms forever. We write the log lines to two files 
1. `generated.log` which will be input for all the tailer and will be logroated when reaches size of 1M
2. `untouched-generated.log` which is non-logrotated that contains all the log lines that are generated. We use it to compare the total log volume.

After running it for a while, here are the results.

~$ ls -l *-generated.log .rw-r--r-- 116M kavirajk 19 dec 08:55 fromcmdtail-generated.log .rw-r--r-- 116M kavirajk 19 dec 08:55 tailer-inotify-generated.log .rw-r--r-- 113M kavirajk 19 dec 08:55 tailer-polling-generated.log .rw-r--r-- 116M kavirajk 19 dec 08:55 untouched-generated.log



Here `tailer-polling-generated.log` skipped more than 3MB of log lines. The difference will keep increasing as we generate more logs.

Where as `tailer-inotify-generated.log` produced same amount of logs compared to original `untouched-generated.log`. Same with log generated by Linux command `tail` (`fromcmdtail-generated.log`)

## Proposal
Use `inotify` based file watcher in Promtail tailer instead of `polling` in Linux.

I think this log pending issue with promtail + logrotation is also related to this issue I explained here.

Hey @kavirajk . I managed to reproduce the issue also. However, all ways that we used in test have the gaps...

❯ ls -la | grep generated.log
-rw-r--r--   1 vdiachenko  admin  78999088 Dec 22 10:18 fromcmdtail-generated.log
-rw-r--r--   1 vdiachenko  admin  78999121 Dec 22 10:18 tailer-inotify-generated.log
-rw-r--r--   1 vdiachenko  admin  77134964 Dec 22 10:18 tailer-polling-generated.log
-rw-r--r--   1 vdiachenko  admin  78999636 Dec 22 10:18 untouched-generated.log

As you can see, untouched-generated.log has a size 78999636, but the rest of the files have different sizes, even the file from tail -f (fromcmdtail-generated.log). inotify tail is most close to the original size but not exact...

With any solution, we might get the gaps if we just truncate the original log file... I think if we tune the polling tailer it could also work for us...

However, the root cause of the issue is the way how the log files are rotated... I think we need to support only the mode when we move the current log file (rename) and create a new (empty) file for the logs... In this case, the tailer can still continue reading the data from renamed file(because it has opened connection to the renamed file) and after this tailer will reopen the newly created file... https://github.com/grafana/tail/blob/master/tail.go#L211 But even in this case it might be some gaps not on tailer level but on log rotation level, when the old file is moved but a new one is not created ... this case described more here https://www.datadoghq.com/blog/log-file-control-with-logrotate/#create-or-copy-log-files-to-manage-rotation

Thanks @vlad-diachenko :)

Agree that there is no solution to fix the problem 100%, given the logs are rotated frequently just by droping the old log files without copying it. This particular proposal was just to show how inotify performed better than polling w.r.t to single file log rotation (even in your experiment, the log loss is greater with polling than with inotify if you notice).

But then as a bigger picture I started reading about the unstable nature of inotify (looks like there is hard limit on how much events it can be notified about, which can easily break at scale) plus resource conception can be huge. So integrating it with tool like promtail with goal to scrape lot of log files, can lead to worse behaviour.

NOTE: Totally agree to have some recommendation on "how to logrotate" effectively. However the thing that lead to have this experimentation in the first place is, with exact same log rotation setup (replacing old log file without copy), FileBeat able to fetch the logs where as Promtail couldn't.

I see it as two different action items here.

Document how to log rotate effectively (just immediate truncate without copy is bad)
Do our best with promtail polling to avoid skipping logs as much as possible, even if logs are rotated by just truncating it frequently

For (2) I want to do

Have the batch size of promtail configurable, and have higher throughput and also lesser impact on polling to keep up.
Sping up tailer per file (similar to what Filebeat does

Thanks @vlad-diachenko :)

Agree that there is no solution to fix the problem 100%, given the logs are rotated frequently just by droping the old log files without copying it. This particular proposal was just to show how inotify performed better than polling w.r.t to single file log rotation (even in your experiment, the log loss is greater with polling than with inotify if you notice).

But then as a bigger picture I started reading about the unstable nature of inotify (looks like there is hard limit on how much events it can be notified about, which can easily break at scale) plus resource conception can be huge. So integrating it with tool like promtail with goal to scrape lot of log files, can lead to worse behaviour.

NOTE: Totally agree to have some recommendation on "how to logrotate" effectively. However the thing that lead to have this experimentation in the first place is, with exact same log rotation setup (replacing old log file without copy), FileBeat able to fetch the logs where as Promtail couldn't.

I see it as two different action items here.

Document how to log rotate effectively (just immediate truncate without copy is bad)

Do our best with promtail polling to avoid skipping logs as much as possible, even if logs are rotated by just truncating it frequently

For (2) I want to do

Have the batch size of promtail configurable, and have higher throughput and also lesser impact on polling to keep up.

Sping up tailer per file (similar to what Filebeat does

Sounds great ;) let me know If I can help with anything ) I find it interesting so we could make it together if you wish ;)

grafana / loki