grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
22.88k stars 3.32k forks source link

tail channel closed, stopping tailer #6503

Open yangvipguang opened 2 years ago

yangvipguang commented 2 years ago

Describe the bug A clear and concise description of what the bug is. Promtail often fails to read logs. log error: image Restart promtail and return to normal image

Version: 2.5.0 Platform: Centos7

dannykopping commented 2 years ago

Hi @yangvipguang. I think this occurring when your file is rotated.

jeschkies commented 2 years ago

This might be a known issue. I'll take a look.

yangvipguang commented 2 years ago

@dannykopping yes, the node log files are rotated 。 app: node log: log4js-node image lQLPJxZxUXJBXsLMh80CZLAluRzDOVBnNwK6ugzgALwA_612_135

dannykopping commented 2 years ago

@yangvipguang are you actually missing logs when this occurs, or are you just asking about the error in the logs?

This looks like good behaviour because when a file it rotated it will be renamed (but still maintain the same inode) and then eventually deleted when the contents are compressed. I would expect these log lines to show up in those situations.

yangvipguang commented 2 years ago

@dannykopping yes actually missing logs when the log is rolated

frittentheke commented 1 year ago

It appears Promtail really only tracks the positions based on the filename https://github.com/grafana/loki/blob/fb9a1989d856c78c5aaa43c56ece6f215669e066/clients/pkg/promtail/positions/positions.go#L111 and only recognizes the rotation of a file if it's actively running and watching during the rotation:

This is how this is logged:

level=info ts=2022-07-21T21:06:59.436123136Z caller=tailer.go:143 component=tailer msg="tail routine: started" path=/tmp/logs/log1.log
ts=2022-07-21T21:06:59.436160443Z caller=log.go:168 level=info msg="Seeked /tmp/logs/log1.log - &{Offset:12 Whence:0}"
level=info ts=2022-07-21T21:07:29.263830553Z caller=filetargetmanager.go:179 msg="received file watcher event" name=/tmp/logs/log1.log.1 op=CREATE
ts=2022-07-21T21:07:29.293354294Z caller=log.go:168 level=info msg="Re-opening moved/deleted file /tmp/logs/log1.log ..."
ts=2022-07-21T21:07:29.293476537Z caller=log.go:168 level=info msg="Waiting for /tmp/logs/log1.log to appear..."
level=info ts=2022-07-21T21:07:29.436953384Z caller=tailer.go:202 component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/log1.log
level=info ts=2022-07-21T21:07:29.437157557Z caller=filetarget.go:291 msg="removing directory from watcher" directory=/tmp/logs
level=info ts=2022-07-21T21:07:29.437240091Z caller=tailer.go:116 component=tailer msg="position timer: exited" path=/tmp/logs/log1.log
level=info ts=2022-07-21T21:07:29.437294963Z caller=tailer.go:202 component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/log1.log
level=info ts=2022-07-21T21:07:29.43736309Z caller=tailer.go:159 component=tailer msg="tail routine: tail channel closed, stopping tailer" path=/tmp/logs/log1.log reason=null
level=info ts=2022-07-21T21:07:29.437418767Z caller=tailer.go:152 component=tailer msg="tail routine: exited" path=/tmp/logs/log1.log
level=info ts=2022-07-21T21:07:29.437451053Z caller=tailer.go:240 component=tailer msg="stopped tailing file" path=/tmp/logs/log1.log
level=info ts=2022-07-21T21:08:29.436230999Z caller=filetarget.go:278 msg="watching new directory" directory=/tmp/logs
level=info ts=2022-07-21T21:08:29.436281994Z caller=tailer.go:143 component=tailer msg="tail routine: started" path=/tmp/logs/log1.log
ts=2022-07-21T21:08:29.436300892Z caller=log.go:168 level=info msg="Seeked /tmp/logs/log1.log - &{Offset:0 Whence:0}"

Certainly Promtail tries to read the remaining data from the just "deleted" (actually moved) file, but what happens when that fails or does not finish until Promtail is stopped?

There is documenation on how Promtail behaves in case the file was truncated (or replaced by a new file of the same name during logroation): https://github.com/grafana/loki/blob/main/docs/sources/clients/promtail/troubleshooting.md#a-tailed-file-is-truncated-while-promtail-is-not-running=

There it's clearly stated that if the new file was already bigger than the last position saved for this file (by name) there is data loss (as in "not shipped" to Loki) as there is no way to reconcile. There really seems no way around introducing the inode to positions to "abstract" away from the name of the file. The name simply changes for most cases / implementations of log rotation and the positions and filewatcher logic could be much simpler if it was just about finding candidates by pattern (read: name + potential wildcards) and then doing all of the position tracking using the inode as identifier. Adding some optimistic sorting on the creation date of the inode would even allow for backfilling of logs in (likely) the proper order.

Last but not least, using the inodes would also remove the current issue of the wildcard file pattern causing double (or multiple) ingests of the "same" file / data as described in: https://github.com/grafana/loki/pull/3655, but make this a feature:

If the inode was used to identify each file a rotated away file could also be resumed to be read from, just like Elastic documents for their Filebeat shipper: https://www.elastic.co/guide/en/beats/filebeat/current/file-log-rotation.html#file-log-rotation

frittentheke commented 1 year ago

@dannykopping would you potentially be open to discuss adding inode as a key to Promtail file tailing to avoid the known and documented issues with logroation?

cstyan commented 8 months ago

the following may be relevant: https://github.com/grafana/loki/issues/8039#issuecomment-1424204802