Open andrejshapal opened 1 month ago
I would very appreciate if alloy maintainers could suggest the best approach how to resolve this issue. Also, would be great to understand if there is any protection against happening the same (kubectl -f due to some circumstances fails) on k8s > 1.29.1, because it looks using k8s to tailing logs is much more dangerous from logs delivery SLA point of view comparing pod_logs approach.
Hello! Apologies for the late response. I'm not very familiar with this, but I don't quite see why restarting the tailer once every hour is a problem if the volume of logs is low. Wouldn't Alloy just resume tailing from the previous log file?
IIUC, the issue with log rotations in k8s < 1.29.1 is that we might miss a log rotation event. But if the volume of the logs is low, presumably no log rotation happened and it doesn't matter when Alloy restarts the tailer?
In your comment on the other issue, you mentioned that at some point even new logs aren't being sent. Did the status of the container change? Maybe Alloy decided that there are no more logs from that container and decided not to start new tailers.
What's wrong?
Hello,
I have described initial issue here: https://github.com/grafana/alloy/issues/281#issuecomment-2366746157
But I will write the summary: We moved from promtail to alloy and on k8s 1.27.9 we noticed missing logs.
We have added some additional logging to alloy and deployed custom image:
Here are the logs:
The tailer restarts when
time_since_last > rolling_average
. Everything was fine before first restart at 2024-09-23 15:27:19.859. After restart, therolling_average
was dropped to default 1h (https://github.com/andrejshapal/alloy/blob/ed2822643060bccd7b7a66d8ecbb764cdc19f589/internal/component/loki/source/kubernetes/kubetail/tail_utils.go#L74). The pod produced 3 logs and at 2024-09-23 15:27:45.893 something happened ok k8s side again. But, the tailer is not calculating newrolling_average
because the logs count is < 100. This means, if log rotation occurs when the logs produced since last tailer restart < 100, the tailer will be restarted only in 1h.I think 1h is very high value as for both
1.29.1 < k8s < 1.29.1
. I think for k8s > 1.29.1 tailer reset should be done every 30m. For k8s < 1.29.1 every 5m when difference between logs can not be calculated.k8s API should be able take such load.
Steps to reproduce
System information
No response
Software version
No response
Configuration
No response
Logs
No response