Promtail missing logs when push latency is high on kubernetes

liam-howe-maersk commented 1 month ago

Describe the bug We have observed that when latency pushing logs from promtail to Loki is high and promtail is hosted on kubernetes, that logs can be dropped, it seems that log files on kubernetes are skipped, leading to gaps in the logs. When running an OpenTelemetry Collector instead on the same cluster we observe that all expected logs are pushed to Loki.

To Reproduce Steps to reproduce the behavior:

Started Loki (3.1.1)
Started Promtail (3.1.1) DaemonSet on a k8s cluster to tail k8s pods using kubernetes_sd_configs

Latency pushing must be high, in our case this is due to crossing regions and it averages 1 second per push, calculated using the metrics

sum(rate(promtail_request_duration_seconds_sum[$__rate_interval]))
/ 
sum(rate(promtail_request_duration_seconds_count[$__rate_interval]))

Expected behavior All logs should be pushed to Loki

Environment:

Infrastructure: Kubernetes v1.29.7
Deployment tool: Applying k8s resources

Screenshots, Promtail config, or terminal output

Click here for Promtail config

```yaml server: log_level: debug log_format: logfmt http_listen_port: 3101 clients: - url: http://loki-instance/loki/api/v1/push oauth2: client_id: ${OAUTH2_CLIENT_ID} client_secret: ${OAUTH2_CLIENT_SECRET} token_url: https://login.microsoftonline.com/.../oauth2/v2.0/token scopes: - "scope.default" external_labels: k8s_cluster: my-cluster env: prod provider: azure backoff_config: min_period: 1s max_period: 2m max_retries: 10 positions: filename: /run/promtail/positions.yaml scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod namespaces: names: - app-namespace - platform-monitoring pipeline_stages: - cri: {} - labeldrop: - filename - stream - level relabel_configs: - source_labels: - __meta_kubernetes_namespace - __meta_kubernetes_pod_name regex: (app-namespace.+|platform-monitoring;promtail-liam-test-logs.+) action: keep - source_labels: - __meta_kubernetes_pod_controller_name regex: ([0-9a-z-.]+?)(-[0-9a-f]{8,10})? action: replace target_label: __tmp_controller_name - source_labels: - __meta_kubernetes_pod_label_app_kubernetes_io_name - __meta_kubernetes_pod_label_app - __tmp_controller_name - __meta_kubernetes_namespace regex: ^;*([^;]+)(;.*)?$ action: replace target_label: app - source_labels: - __meta_kubernetes_pod_label_app_kubernetes_io_component - __meta_kubernetes_pod_label_component regex: ^;*([^;]+)(;.*)?$ action: replace target_label: component - action: replace source_labels: - __meta_kubernetes_pod_node_name target_label: node_name - action: replace source_labels: - __meta_kubernetes_namespace target_label: namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: pod - action: replace source_labels: - __meta_kubernetes_pod_container_name target_label: container - action: replace replacement: /var/log/pods/*$1/*.log separator: / source_labels: - __meta_kubernetes_pod_uid - __meta_kubernetes_pod_container_name target_label: __path__ - action: replace regex: true/(.*) replacement: /var/log/pods/*$1/*.log separator: / source_labels: - __meta_kubernetes_pod_annotationpresent_kubernetes_io_config_hash - __meta_kubernetes_pod_annotation_kubernetes_io_config_hash - __meta_kubernetes_pod_container_name target_label: __path__ - action: replace source_labels: - __meta_kubernetes_pod_label_env - __meta_kubernetes_pod_label_environment regex: ^;*([^;]+)(;.*)?$ target_label: env tracing: enabled: false ```

We have tested this by deploying an application to the cluster that will print a log line and increase a counter metric every time it receives a HTTP request. We would therefore expect that the count of log lines and the increase in the counter metric are roughly equal, when we run a load test however, we can see that the count of logs via promtail is far lower. For comparison we have also deployed OpenTelemetry collectors to the cluster, below shows, during a load test, the count of HTTP requests as measured by the counter increase, promtail log count and OpenTelemetry collector log count

As you can see the count for promtail is far lower, it looks like at points it flatlines and this is when I believe log files on the k8s node are being missed because latency pushing is high and it is still trying to process previous log files. The OpenTelemetry collector does not seem to have such problems, I would expect promtail to also buffer the newer log files while it struggles to push the older files.

For further evidence that high latency is the issue, we have another cluster where latency is generally low and we do not observe log loss there. However, recently we had some instability with our Loki ingestion and we saw push latency increase, here you can see that exactly when latency spikes we can see a gap in the applications logs.

In order to identify which log lines were missing after a load test, I gained access to the kubernetes node to look at the log files directly, there I could see a pattern that looked like

-rw-r----- 1 root root 15314989 Oct 11 09:20 0.log
-rw-r--r-- 1 root root  3388094 Oct 11 07:50 0.log.20241011-075001.gz
-rw-r--r-- 1 root root  3612170 Oct 11 07:51 0.log.20241011-075031.gz
-rw-r--r-- 1 root root  3342465 Oct 11 07:51 0.log.20241011-075102.gz
-rw-r----- 1 root root 62686779 Oct 11 07:51 0.log.20241011-075132

None of the logs lines in 0.log.20241011-075031.gz seemed to be available in Loki while I could see log lines from the others. This further proves that it is entire log files that are getting missed during these high latency issues.

TheRealNoob commented 1 month ago

can you share your loki config file please? also how many pods running on your cluster?

liam-howe-maersk commented 1 month ago

can you share your loki config file please? also how many pods running on your cluster?

@TheRealNoob Sure,

Click here for Loki config

```yaml analytics: reporting_enabled: false auth_enabled: true chunk_store_config: chunk_cache_config: background: writeback_buffer: 1000 writeback_goroutines: 1 writeback_size_limit: 500MB memcached: batch_size: 30 parallelism: 2 memcached_client: consistent_hash: true host: memcached.loki.svc.cluster.local service: memcached-client timeout: 60s common: compactor_address: http://compactor.loki.svc.cluster.local.:3100 ring: kvstore: store: memberlist compactor: compactor_ring: kvstore: store: memberlist delete_request_store: azure retention_delete_delay: 2h retention_delete_worker_count: 150 retention_enabled: true working_directory: /data/compactor distributor: ring: kvstore: store: memberlist frontend: compress_responses: true log_queries_longer_than: 5s max_outstanding_per_tenant: 10000 scheduler_address: query-scheduler-discovery.loki.svc.cluster.local.:9095 tail_proxy_url: http://querier:3100 frontend_worker: grpc_client_config: max_send_msg_size: 1.048576e+08 scheduler_address: query-scheduler-discovery.loki.svc.cluster.local.:9095 ingester: chunk_block_size: 262144 chunk_encoding: snappy chunk_idle_period: 1h lifecycler: heartbeat_period: 5s interface_names: - eth0 join_after: 30s num_tokens: 512 ring: heartbeat_timeout: 1m kvstore: store: memberlist replication_factor: 3 max_chunk_age: 2h wal: dir: /loki/wal enabled: true replay_memory_ceiling: 5GB ingester_client: grpc_client_config: max_recv_msg_size: 6.7108864e+07 remote_timeout: 1s limits_config: allow_structured_metadata: false deletion_mode: filter-and-delete discover_service_name: [] ingestion_burst_size_mb: 900 ingestion_rate_mb: 650 ingestion_rate_strategy: global max_cache_freshness_per_query: 10m max_chunks_per_query: 180000 max_concurrent_tail_requests: 300 max_global_streams_per_user: 0 max_line_size: 1MB max_query_bytes_read: 200GB max_query_length: 745h max_query_parallelism: 256 max_query_series: 500 max_streams_per_user: 0 per_stream_rate_limit: 50MB per_stream_rate_limit_burst: 200MB reject_old_samples: true reject_old_samples_max_age: 168h retention_period: 8w split_queries_by_interval: 30m memberlist: abort_if_cluster_join_fails: false bind_addr: - $${POD_IP} bind_port: 7946 cluster_label: loki cluster_label_verification_disabled: false join_members: - dns+gossip-ring.loki.svc.cluster.local:7946 max_join_backoff: 1m max_join_retries: 10 min_join_backoff: 1s randomize_node_name: false pattern_ingester: client_config: grpc_client_config: max_recv_msg_size: 6.7108864e+07 remote_timeout: 1s enabled: false lifecycler: heartbeat_period: 5s interface_names: - eth0 join_after: 30s num_tokens: 512 ring: heartbeat_timeout: 1m kvstore: {} replication_factor: 1 querier: engine: max_look_back_period: 30s max_concurrent: 16 multi_tenant_queries_enabled: true query_ingesters_within: 2h query_range: align_queries_with_step: true cache_results: true max_retries: 5 parallelise_shardable_queries: true results_cache: cache: memcached_client: consistent_hash: true host: memcached-frontend.loki.svc.cluster.local max_idle_conns: 16 service: memcached-client timeout: 500ms update_interval: 1m query_scheduler: max_outstanding_requests_per_tenant: 65536 ruler: alertmanager_client: basic_auth_password: $${ALERTMANAGER_API_PASSWORD} basic_auth_username: $${ALERTMANAGER_API_USERNAME} alertmanager_url: https://alertmanager-url enable_alertmanager_v2: true enable_api: true enable_sharding: true external_labels: pensieve_alert_datasource: loki remote_write: client: url: http://mimir-url enabled: true ring: kvstore: store: memberlist rule_path: /data/rules storage: azure: account_key: $${ACCOUNT_KEY} account_name: accountname container_name: loki-ruler local: directory: /tmp/rules type: azure wal: dir: /data/loki/wal schema_config: configs: - from: "2023-08-03" index: period: 24h prefix: loki_index_ object_store: azure schema: v12 store: tsdb - from: "2024-05-15" index: period: 24h prefix: loki_index_ object_store: azure schema: v13 store: tsdb server: graceful_shutdown_timeout: 5s grpc_server_max_concurrent_streams: 1000 grpc_server_max_recv_msg_size: 1.048576e+08 grpc_server_max_send_msg_size: 1.048576e+08 grpc_server_min_time_between_pings: 10s grpc_server_ping_without_stream_allowed: true http_listen_port: 3100 http_server_idle_timeout: 120s http_server_read_timeout: 60s http_server_write_timeout: 1m storage_config: azure: account_key: $${ACCOUNT_KEY} account_name: accountname container_name: loki max_retry_delay: 1500ms min_retry_delay: 30ms request_timeout: 30s boltdb_shipper: active_index_directory: /data/index cache_location: /data/boltdb-cache index_gateway_client: server_address: dns+index-gateway-headless.loki.svc.cluster.local:9095 hedging: at: 400ms up_to: 2 index_queries_cache_config: memcached: batch_size: 100 parallelism: 100 memcached_client: consistent_hash: true host: memcached-index-queries.loki.svc.cluster.local service: memcached-client tsdb_shipper: active_index_directory: /data/tsdb-index cache_location: /data/tsdb-cache index_gateway_client: server_address: dns+index-gateway-headless.loki.svc.cluster.local:9095 table_manager: {} ```

➜  ~ kubectl get deployments
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
distributor        12/12   12           12          354d
envoy              5/5     5            5           354d
querier            28/28   28           28          354d
query-frontend     2/2     2            2           354d
query-scheduler    2/2     2            2           354d
rollout-operator   1/1     1            1           354d
➜  ~ kubectl get statefulsets
NAME                      READY   AGE
compactor                 3/3     353d
index-gateway             3/3     352d
ingester-zone-a           15/15   213d
ingester-zone-b           15/15   354d
ingester-zone-c           15/15   354d
memcached                 15/15   354d
memcached-frontend        2/2     354d
memcached-index-queries   4/4     354d
ruler                     10/10   346d

liam-howe-maersk commented 1 month ago

We found this document regarding logs rotation https://grafana.com/docs/loki/latest/send-data/promtail/logrotation/#configure-promtail. Following the advice at the bottom we have set the batchsize in promtail to 8M and during our load tests it seems that we are now getting all logs. If I look at the push latency after increasing the batchsize it has pretty much halved the latency, which I guess explains why we are no longer seeing logs dropped. This is good, but as mentioned in the docs is only a short term workaround, it also mentions

for a long-term solution, we strongly recommend changing the log rotation strategy to rename and create

However, we are hosted on AKS using containerd CRI which according to that same document means we are already using the rename and create strategy, so it seems this is not a permanent solution.

My question is why the otel-collector using the filelog receiver seemed to not have this same issue, is it because it uses a higher batch size setting by default and therefore we don't see the problem there, or is it because it uses a different strategy for tailing logs compared to promtail that doesn't lead to this log loss. I don't have enough insight in to this at the moment so appreciate feedback if anyone else does.

I'm still concerned that batchsize is a workaround, are there any other suggestions for more permanent fixes that can ensure we don't lose logs when latency is high?

liam-howe-maersk commented 1 month ago

I believe I understand the issue better now, our application is hosted on AKS which means that every 50MB a log file will be rotated. This involves renaming the current log file that is being tailed and recreating it again to append new logs to. If the promtail tailing process is still processing a log file and in the meantime the log file has been rotated twice, this will mean that all logs from between the first rotation and second rotation will not be picked up by promtail and will therefore be lost. Once the promtail process is finished with the original file tailing it will once again pick up the latest log file, meaning the previous one that was rotated out is never tailed.

I'm wondering if there's any change that could be made to the promtail process so that if a log file is rotated while the previous file is still being tailed that this data is not missed even if it is rotated out again. Failing that, I'm also wondering if any logs or metrics could be added to promtail in order to tell that a log file has been completely missed.

JStickler commented 1 month ago

FYI, Promtail is now considered “feature complete” and will be in a maintenance mode. New users should use Grafana Alloy, Grafana Labs’ distribution of the OpenTelemetry Collector.

grafana / loki

Promtail missing logs when push latency is high on kubernetes #14462