Open liam-howe-maersk opened 1 month ago
can you share your loki config file please? also how many pods running on your cluster?
can you share your loki config file please? also how many pods running on your cluster?
@TheRealNoob Sure,
➜ ~ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
distributor 12/12 12 12 354d
envoy 5/5 5 5 354d
querier 28/28 28 28 354d
query-frontend 2/2 2 2 354d
query-scheduler 2/2 2 2 354d
rollout-operator 1/1 1 1 354d
➜ ~ kubectl get statefulsets
NAME READY AGE
compactor 3/3 353d
index-gateway 3/3 352d
ingester-zone-a 15/15 213d
ingester-zone-b 15/15 354d
ingester-zone-c 15/15 354d
memcached 15/15 354d
memcached-frontend 2/2 354d
memcached-index-queries 4/4 354d
ruler 10/10 346d
We found this document regarding logs rotation https://grafana.com/docs/loki/latest/send-data/promtail/logrotation/#configure-promtail. Following the advice at the bottom we have set the batchsize
in promtail to 8M and during our load tests it seems that we are now getting all logs. If I look at the push latency after increasing the batchsize
it has pretty much halved the latency, which I guess explains why we are no longer seeing logs dropped. This is good, but as mentioned in the docs is only a short term workaround, it also mentions
for a long-term solution, we strongly recommend changing the log rotation strategy to rename and create
However, we are hosted on AKS using containerd CRI which according to that same document means we are already using the rename and create strategy, so it seems this is not a permanent solution.
My question is why the otel-collector using the filelog receiver seemed to not have this same issue, is it because it uses a higher batch size setting by default and therefore we don't see the problem there, or is it because it uses a different strategy for tailing logs compared to promtail that doesn't lead to this log loss. I don't have enough insight in to this at the moment so appreciate feedback if anyone else does.
I'm still concerned that batchsize
is a workaround, are there any other suggestions for more permanent fixes that can ensure we don't lose logs when latency is high?
I believe I understand the issue better now, our application is hosted on AKS which means that every 50MB a log file will be rotated. This involves renaming the current log file that is being tailed and recreating it again to append new logs to. If the promtail tailing process is still processing a log file and in the meantime the log file has been rotated twice, this will mean that all logs from between the first rotation and second rotation will not be picked up by promtail and will therefore be lost. Once the promtail process is finished with the original file tailing it will once again pick up the latest log file, meaning the previous one that was rotated out is never tailed.
I'm wondering if there's any change that could be made to the promtail process so that if a log file is rotated while the previous file is still being tailed that this data is not missed even if it is rotated out again. Failing that, I'm also wondering if any logs or metrics could be added to promtail in order to tell that a log file has been completely missed.
FYI, Promtail is now considered “feature complete” and will be in a maintenance mode. New users should use Grafana Alloy, Grafana Labs’ distribution of the OpenTelemetry Collector.
Describe the bug We have observed that when latency pushing logs from promtail to Loki is high and promtail is hosted on kubernetes, that logs can be dropped, it seems that log files on kubernetes are skipped, leading to gaps in the logs. When running an OpenTelemetry Collector instead on the same cluster we observe that all expected logs are pushed to Loki.
To Reproduce Steps to reproduce the behavior:
kubernetes_sd_configs
Expected behavior All logs should be pushed to Loki
Environment:
Screenshots, Promtail config, or terminal output
Click here for Promtail config
```yaml server: log_level: debug log_format: logfmt http_listen_port: 3101 clients: - url: http://loki-instance/loki/api/v1/push oauth2: client_id: ${OAUTH2_CLIENT_ID} client_secret: ${OAUTH2_CLIENT_SECRET} token_url: https://login.microsoftonline.com/.../oauth2/v2.0/token scopes: - "scope.default" external_labels: k8s_cluster: my-cluster env: prod provider: azure backoff_config: min_period: 1s max_period: 2m max_retries: 10 positions: filename: /run/promtail/positions.yaml scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod namespaces: names: - app-namespace - platform-monitoring pipeline_stages: - cri: {} - labeldrop: - filename - stream - level relabel_configs: - source_labels: - __meta_kubernetes_namespace - __meta_kubernetes_pod_name regex: (app-namespace.+|platform-monitoring;promtail-liam-test-logs.+) action: keep - source_labels: - __meta_kubernetes_pod_controller_name regex: ([0-9a-z-.]+?)(-[0-9a-f]{8,10})? action: replace target_label: __tmp_controller_name - source_labels: - __meta_kubernetes_pod_label_app_kubernetes_io_name - __meta_kubernetes_pod_label_app - __tmp_controller_name - __meta_kubernetes_namespace regex: ^;*([^;]+)(;.*)?$ action: replace target_label: app - source_labels: - __meta_kubernetes_pod_label_app_kubernetes_io_component - __meta_kubernetes_pod_label_component regex: ^;*([^;]+)(;.*)?$ action: replace target_label: component - action: replace source_labels: - __meta_kubernetes_pod_node_name target_label: node_name - action: replace source_labels: - __meta_kubernetes_namespace target_label: namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: pod - action: replace source_labels: - __meta_kubernetes_pod_container_name target_label: container - action: replace replacement: /var/log/pods/*$1/*.log separator: / source_labels: - __meta_kubernetes_pod_uid - __meta_kubernetes_pod_container_name target_label: __path__ - action: replace regex: true/(.*) replacement: /var/log/pods/*$1/*.log separator: / source_labels: - __meta_kubernetes_pod_annotationpresent_kubernetes_io_config_hash - __meta_kubernetes_pod_annotation_kubernetes_io_config_hash - __meta_kubernetes_pod_container_name target_label: __path__ - action: replace source_labels: - __meta_kubernetes_pod_label_env - __meta_kubernetes_pod_label_environment regex: ^;*([^;]+)(;.*)?$ target_label: env tracing: enabled: false ```We have tested this by deploying an application to the cluster that will print a log line and increase a counter metric every time it receives a HTTP request. We would therefore expect that the count of log lines and the increase in the counter metric are roughly equal, when we run a load test however, we can see that the count of logs via promtail is far lower. For comparison we have also deployed OpenTelemetry collectors to the cluster, below shows, during a load test, the count of HTTP requests as measured by the counter increase, promtail log count and OpenTelemetry collector log count
As you can see the count for promtail is far lower, it looks like at points it flatlines and this is when I believe log files on the k8s node are being missed because latency pushing is high and it is still trying to process previous log files. The OpenTelemetry collector does not seem to have such problems, I would expect promtail to also buffer the newer log files while it struggles to push the older files.
For further evidence that high latency is the issue, we have another cluster where latency is generally low and we do not observe log loss there. However, recently we had some instability with our Loki ingestion and we saw push latency increase, here you can see that exactly when latency spikes we can see a gap in the applications logs.
In order to identify which log lines were missing after a load test, I gained access to the kubernetes node to look at the log files directly, there I could see a pattern that looked like
None of the logs lines in
0.log.20241011-075031.gz
seemed to be available in Loki while I could see log lines from the others. This further proves that it is entire log files that are getting missed during these high latency issues.