grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.85k stars 3.44k forks source link

Promtail error "fsnotify queue overflow" in case of a large amount of log lines per second #2134

Open shootkin opened 4 years ago

shootkin commented 4 years ago

Describe the bug If I create multiple .log files, say 4, and append log lines to them with high rate per second, say 50 000 log lines per second to each file, I see such errors in promtail:

level=error ts=2020-05-26T10:47:55.406210161Z caller=filetarget.go:186 msg="error from fswatch" error="fsnotify queue overflow"

Also I see slowness in appearing new log lines from these files in loki.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki version 1.5.0 (branch: HEAD, revision: 12c7eab8b)
  2. Started Promtail version 1.5.0 (branch: HEAD, revision: 12c7eab8b) and tail /any/path/*.log
  3. Create heavy load to log files here: /any/path/*.log

Expected behavior I don't expect any error, and I expect fast appearing logs' lines in loki.

Environment:

Screenshots, Promtail config, or terminal output I've tried to increase fs.inotify.max_queued_events kernel parameter to 1048576. After that, I see the same error, but a bit later than earlier (after 2-3 minutes of tailing log files)

cyriltovena commented 4 years ago

This is a limitation from the fsnotify we are using. https://github.com/fsnotify/fsnotify

Seems like we need an option to allow for polling instead of using this library. That sounds like a lot of work.

cyriltovena commented 4 years ago

Hey @shootkin,

I've done some investigations and this should not affect the tailing of your 4 files, it should still work even though this errors show up.

Basically we use fsnotify to pick up quickly new files, this is to solve the use case where files are showing up and disappearing quickly. But we always actually sync all files manually every 10 seconds by default. (you can change that if you want https://github.com/grafana/loki/blob/master/docs/clients/promtail/configuration.md#target_config)

Regarding tailing files, we always poll every 100ms.

I'm going to work on a PR to make the fsnotify optional so that you can deactivate it since in your use case you don't need this.

Also you should definitively increase your batch size with that volume. (https://github.com/grafana/loki/blob/master/docs/clients/promtail/configuration.md#client_config)

Is this log noisy ? I was thinking may be we should also throttle that error.

Let me know if that works for you.

shootkin commented 4 years ago

Hi @cyriltovena, Thank you for investigation! So it means that the slowness is not due to this error. I also think it is better to remove this error from logs, cause it can mislead.

I increased batch size to 20 megabytes, but the slowness still persists. It matters for me, because I use logrotate, and if promtail is slow, then it will miss some log files.

Promtail tries to read log file to its end despite of the fact that this file was rotated (and it is good behaviour). But while promtail is processing old log file to its end, newly created log file is rotated too. Then promtail gets only the third file, and missed the second.

For now, promtail works for me only if I create separate instance of promtail for each log file. And the rate limit which I see is about 30 000 lines per second. Above this limit, promtail can miss some log files because of logrotate.

cyriltovena commented 4 years ago

How does promtail misses the second file? Does it gets deleted?

shootkin commented 4 years ago

Here is what actually happens:

  1. I start promtail
  2. I start logrotate with 100 megabytes threshold
  3. I start generating log with rate 50 000 lines per second.
  4. After 150 seconds log file has 7 500 000 lines in it, and 100mb threshold is reached. At this moment, I have only 3 000 000 lines in loki. Then logrotate moves this file to another path, and then new file is created.
  5. Promtail reads original (first) file to the end. When it reaches the end, it starts reading newly created file. At this moment, we already have about 6 000 000 log lines in new file.
  6. Promtail reads the new (second) file to the end. After the end reached, logrotate already rotated next (third) file and the fourth file is in place.
  7. Promtail starts reading the fourth file. But 3rd file was completely missed by promtail: this 3rd file was rotated while promtail was reading 2nd file.

So, I have 1, 2, and 4 files on loki, but 3rd file is missed.

frittentheke commented 1 year ago

May I bring up the suggestion to add inode numbers to the tracking of files to read and ship to make rotations of files a lot less error prone? See my comment at https://github.com/grafana/loki/issues/6503#issuecomment-1191957798

DanielCastronovo commented 1 year ago

Same problem here, did you have a workaround ? Because containerD does not support logrotation

cstyan commented 1 year ago

While we haven't made a formal decision yet, we expect in the near future that all new feature work will be done in the Agent's log collection pipelines rather than in Promtail. That being said, potentially this could be considered a bug fix and could still go into promtail.

Is https://github.com/grafana/loki/issues/6503 still a relevant fix?