The tailer logic for k8s < 1.29.1 fails when some log rotation occurs within short period of time

andrejshapal commented 1 month ago

What's wrong?

Hello,

I have described initial issue here: https://github.com/grafana/alloy/issues/281#issuecomment-2366746157

But I will write the summary: We moved from promtail to alloy and on k8s 1.27.9 we noticed missing logs.

We have added some additional logging to alloy and deployed custom image:

Here are the logs:

``` 2024-09-23 15:27:44.894 { "ts": "2024-09-23T15:27:44.894138108Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 534 } 2024-09-23 15:27:43.894 { "ts": "2024-09-23T15:27:43.894268288Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 547 } 2024-09-23 15:27:42.894 { "ts": "2024-09-23T15:27:42.893820096Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 7560 } 2024-09-23 15:27:41.894 { "ts": "2024-09-23T15:27:41.893908432Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 6560 } 2024-09-23 15:27:40.894 { "ts": "2024-09-23T15:27:40.8937875Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 5560 } 2024-09-23 15:27:39.893 { "ts": "2024-09-23T15:27:39.89368671Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 4560 } 2024-09-23 15:27:38.894 { "ts": "2024-09-23T15:27:38.893825227Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 3560 } 2024-09-23 15:27:37.894 { "ts": "2024-09-23T15:27:37.893840295Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 2560 } 2024-09-23 15:27:36.893 { "ts": "2024-09-23T15:27:36.893501471Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 1560 } 2024-09-23 15:27:35.894 { "ts": "2024-09-23T15:27:35.893912521Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 560 } 2024-09-23 15:27:34.894 { "ts": "2024-09-23T15:27:34.894203107Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 564 } 2024-09-23 15:27:33.893 { "ts": "2024-09-23T15:27:33.893419212Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 569 } 2024-09-23 15:27:32.894 { "ts": "2024-09-23T15:27:32.893823596Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 584 } 2024-09-23 15:27:31.893 { "ts": "2024-09-23T15:27:31.893457684Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 7596 } 2024-09-23 15:27:30.894 { "ts": "2024-09-23T15:27:30.893917913Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 6597 } 2024-09-23 15:27:29.894 { "ts": "2024-09-23T15:27:29.893786022Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 5597 } 2024-09-23 15:27:28.894 { "ts": "2024-09-23T15:27:28.893908238Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 4597 } 2024-09-23 15:27:27.894 { "ts": "2024-09-23T15:27:27.894179472Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 3597 } 2024-09-23 15:27:26.894 { "ts": "2024-09-23T15:27:26.89372454Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 2596 } 2024-09-23 15:27:25.894 { "ts": "2024-09-23T15:27:25.893717623Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 1596 } 2024-09-23 15:27:24.894 { "ts": "2024-09-23T15:27:24.893689075Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 596 } 2024-09-23 15:27:23.894 { "ts": "2024-09-23T15:27:23.894213388Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 597 } 2024-09-23 15:27:22.894 { "ts": "2024-09-23T15:27:22.894372921Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 601 } 2024-09-23 15:27:21.894 { "ts": "2024-09-23T15:27:21.89371982Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 0 } 2024-09-23 15:27:20.893 { "ts": "2024-09-23T15:27:20.893379645Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 3600000, "time_since_last": 0 } 2024-09-23 15:27:19.893 { "ts": "2024-09-23T15:27:19.893231414Z", "level": "info", "msg": "opened log stream", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "start time": "2024-09-23T15:27:19.858953096Z" } 2024-09-23 15:27:19.859 { "ts": "2024-09-23T15:27:19.858957491Z", "level": "warn", "msg": "tailer stopped; will retry", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "err": "http2: response body closed" } 2024-09-23 15:27:19.859 { "ts": "2024-09-23T15:27:19.858838042Z", "level": "info", "msg": "have not seen a log line in 3x average time between lines, closing and re-opening tailer", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 2000000000, "time_since_last": 6598969922 } 2024-09-23 15:27:19.858 { "ts": "2024-09-23T15:27:19.858783811Z", "level": "info", "msg": "Checking for gap", "target": "pomerium/pomerium-proxy-769fff946d-q5dhh:pomerium", "component_path": "/default_config.default_config.default", "component_id": "loki.source.kubernetes.pod_logs", "rolling_average": 2000, "time_since_last": 6598 } ```

The tailer restarts when time_since_last > rolling_average. Everything was fine before first restart at 2024-09-23 15:27:19.859. After restart, the rolling_average was dropped to default 1h (https://github.com/andrejshapal/alloy/blob/ed2822643060bccd7b7a66d8ecbb764cdc19f589/internal/component/loki/source/kubernetes/kubetail/tail_utils.go#L74). The pod produced 3 logs and at 2024-09-23 15:27:45.893 something happened ok k8s side again. But, the tailer is not calculating new rolling_average because the logs count is < 100. This means, if log rotation occurs when the logs produced since last tailer restart < 100, the tailer will be restarted only in 1h.

I think 1h is very high value as for both 1.29.1 < k8s < 1.29.1. I think for k8s > 1.29.1 tailer reset should be done every 30m. For k8s < 1.29.1 every 5m when difference between logs can not be calculated.

k8s API should be able take such load.

Steps to reproduce

Start alloy.
Produce few logs in the service.
Log rotate (or somehow make the logs flow break).
See the alloy tailer of this service will be restarted only in 1h.

System information

No response

Software version

No response

Configuration

No response

Logs

No response

andrejshapal commented 1 month ago

I would very appreciate if alloy maintainers could suggest the best approach how to resolve this issue. Also, would be great to understand if there is any protection against happening the same (kubectl -f due to some circumstances fails) on k8s > 1.29.1, because it looks using k8s to tailing logs is much more dangerous from logs delivery SLA point of view comparing pod_logs approach.

ptodev commented 1 month ago

Hello! Apologies for the late response. I'm not very familiar with this, but I don't quite see why restarting the tailer once every hour is a problem if the volume of logs is low. Wouldn't Alloy just resume tailing from the previous log file?

IIUC, the issue with log rotations in k8s < 1.29.1 is that we might miss a log rotation event. But if the volume of the logs is low, presumably no log rotation happened and it doesn't matter when Alloy restarts the tailer?

In your comment on the other issue, you mentioned that at some point even new logs aren't being sent. Did the status of the container change? Maybe Alloy decided that there are no more logs from that container and decided not to start new tailers.

grafana / alloy