We have found that if latency pushing the logs from promtail is high enough and the throughput of logs is high enough then entire log files on kubernetes will be missed due to log rotations and all of the logs within it lost. If the latency and throughput stays consistently high so that rate of input of logs is higher than the rate of output of logs we are not sure that anything can be done in this situation. However, we would like a solution that can help to avoid log loss when latency spikes for brief periods of time, that can enable us to ensure lines are still being read and will be sent ASAP, so data in log files is not missed.
Any feedback is appreciated, we are happy to contribute the change but would like input/buy in.
Describe the solution you'd like
The process of promtail works whereby lines are read and sent down the Tail.Lines channel, they are then picked up and sent down the api.EntryHandler.Chan(). Here they are again picked up and pushed to the backend in the client package. This whole process happens synchronously meaning that while we are waiting to send the batch via the client and receive the response the channels will be blocked and no more lines will be read.
Instead, if we pushed the read lines to a queue we could continue reading lines while waiting for the previous batch to be sent and once latency comes back down we can push the lines waiting in the queue. It would make sense here to cap the size of the queue, by number of lines or by byte volume, to avoid issues with exceeding memory limits where applicable. If the latency does not come back down then the channel will once again become blocked and we could miss log files, which is still the same situation we are in now.
Describe alternatives you've considered
The other approach I considered was to start a new tailer process every time the log file is rotated and a new log file is created. This would help to make sure lines continue to be read although it would be a much larger/more complex change to the implementation of promtail.
Plus when the log file is rotated the name of the log file is changed but the file descriptor remains the same so the tailer still thinks it is tailing the original log file. State is tracked in the agent in a few places by using the log files name, for example with the readers or the positions so I think this would get pretty complicated.
Promtail is now considered “feature complete” and will be in a maintenance mode. Requests for new features should be made against Grafana Alloy, Grafana Labs’ distribution of the OpenTelemetry Collector.
Is your feature request related to a problem? Please describe.
This feature is related to the issue I have raised:
We have found that if latency pushing the logs from promtail is high enough and the throughput of logs is high enough then entire log files on kubernetes will be missed due to log rotations and all of the logs within it lost. If the latency and throughput stays consistently high so that rate of input of logs is higher than the rate of output of logs we are not sure that anything can be done in this situation. However, we would like a solution that can help to avoid log loss when latency spikes for brief periods of time, that can enable us to ensure lines are still being read and will be sent ASAP, so data in log files is not missed.
Any feedback is appreciated, we are happy to contribute the change but would like input/buy in.
Describe the solution you'd like
The process of promtail works whereby lines are read and sent down the Tail.Lines channel, they are then picked up and sent down the api.EntryHandler.Chan(). Here they are again picked up and pushed to the backend in the client package. This whole process happens synchronously meaning that while we are waiting to send the batch via the client and receive the response the channels will be blocked and no more lines will be read.
Instead, if we pushed the read lines to a queue we could continue reading lines while waiting for the previous batch to be sent and once latency comes back down we can push the lines waiting in the queue. It would make sense here to cap the size of the queue, by number of lines or by byte volume, to avoid issues with exceeding memory limits where applicable. If the latency does not come back down then the channel will once again become blocked and we could miss log files, which is still the same situation we are in now.
Describe alternatives you've considered
The other approach I considered was to start a new tailer process every time the log file is rotated and a new log file is created. This would help to make sure lines continue to be read although it would be a much larger/more complex change to the implementation of promtail.
Plus when the log file is rotated the name of the log file is changed but the file descriptor remains the same so the tailer still thinks it is tailing the original log file. State is tracked in the agent in a few places by using the log files name, for example with the readers or the positions so I think this would get pretty complicated.