grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.63k stars 3.41k forks source link

Exceptionally large log lines on Kubernetes lead to log corruption in neighbouring workloads #11634

Open verejoel opened 9 months ago

verejoel commented 9 months ago

Describe the bug Very long (~1.5 MiB) log lines produced by one container somehow end up concatenated to log lines in other container's log streams.

To Reproduce This is not trivial to reproduce. We see this running the ArgoCD Applicationset Controller, which periodically produces a log line of the form

time="2024-01-10T12:54:16Z" level=info msg="applications result from the repo service" allPaths="[ ... ]"

Here the allPaths attribute is a slice of strings enumerating all the applications in the application set. In our case, this leads to a log line that is over 1.5 MB in size.

We then notice that this log line appears concatenated with logs from other containers. The log is indexed exactly as if it had been emitted by the other container, but is prefixed by the super log line, with the desired log line concatenated to the end of the block.

Furthermore, it seems that the rogue log line is chunked into multiple lines, all 16384 bytes in length. So what we get, schematically, is the following:

<16384 bytes of huge log line from wrong container>\n
<16384 bytes of huge log line from wrong container>\n
<16384 bytes of huge log line from wrong container>\n
...
<16384 bytes of huge log line from wrong container>\n
<the actual log line from the desired container>

We have attempted to use pipeline stages to drop logs from this namespace, but it seems the logs are already corrupted by the time they get indexed by promtail. Therefore it seems like it could be a bug in the file watcher component. Only logs sharing a node with the noisy neighbour are affected.

Expected behavior The rogue log line should not appear in logs from other containers.

Environment:

Screenshots, Promtail config, or terminal output The affected indexes are restricted to the same node, which is where the container producing the large log outputs is running. This leads us to believe that this is a bug in Promtail file watcher component. image

The log files for this container are rotated fairly rapidly by Kubernetes. The promtail logs contain many updates of this sort as the logs get continuously rotated:

level=info ts=2024-01-10T14:26:29.842964692Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142559.tmp op=CREATE
level=info ts=2024-01-10T14:26:31.631376931Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142559.gz op=CREATE
level=info ts=2024-01-10T14:26:31.651464078Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142631 op=CREATE
level=info ts=2024-01-10T14:26:31.652022887Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log op=CREATE
ts=2024-01-10T14:26:31.794769849Z caller=log.go:168 level=info msg="Re-opening moved/deleted file /var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log ..."
ts=2024-01-10T14:26:31.794885451Z caller=log.go:168 level=info msg="Successfully reopened /var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log"
level=info ts=2024-01-10T14:26:51.696981614Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142631.tmp op=CREATE
level=info ts=2024-01-10T14:26:53.043511119Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142631.gz op=CREATE
level=info ts=2024-01-10T14:26:53.05620024Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142653 op=CREATE
level=info ts=2024-01-10T14:26:53.056847751Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log op=CREATE
ts=2024-01-10T14:26:53.10511109Z caller=log.go:168 level=info msg="Re-opening moved/deleted file /var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log ..."
ts=2024-01-10T14:26:53.105210892Z caller=log.go:168 level=info msg="Successfully reopened /var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log"
level=info ts=2024-01-10T14:27:23.140510858Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142653.tmp op=CREATE
level=info ts=2024-01-10T14:27:24.344375209Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142653.gz op=CREATE
level=info ts=2024-01-10T14:27:24.357194836Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142724 op=CREATE
level=info ts=2024-01-10T14:27:24.357799547Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log op=CREATE
ts=2024-01-10T14:27:24.438376976Z caller=log.go:168 level=info msg="Re-opening moved/deleted file /var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log ..."
ts=2024-01-10T14:27:24.438504778Z caller=log.go:168 level=info msg="Successfully reopened /var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log"
level=info ts=2024-01-10T14:27:44.404487641Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142724.tmp op=CREATE
level=info ts=2024-01-10T14:27:46.245077454Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142724.gz op=CREATE
level=info ts=2024-01-10T14:27:46.263209773Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log.20240110-142746 op=CREATE
level=info ts=2024-01-10T14:27:46.263714382Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log op=CREATE
ts=2024-01-10T14:27:46.494825252Z caller=log.go:168 level=info msg="Re-opening moved/deleted file /var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log ..."
ts=2024-01-10T14:27:46.494907054Z caller=log.go:168 level=info msg="Successfully reopened /var/log/pods/argocd_argocd-applicationset-controller-6c6d5dfb96-4jrl9_73deb7ea-7a49-423e-bcce-05b31ef07f1c/applicationset-controller/0.log"

Promtail Config

server:
  log_level: info
  http_listen_port: 3101

clients:
  - backoff_config:
      max_period: 5m
      max_retries: 22
      min_period: 500ms
    url: http://otel-ingress-central-collector.otel.svc.cluster.local:3500/loki/api/v1/push

positions:
  filename: /run/promtail/positions.yaml

scrape_configs:
  # See also https://github.com/grafana/loki/blob/master/production/ksonnet/promtail/scrape_config.libsonnet for reference
  - job_name: kubernetes-pods
    pipeline_stages:
      - cri: {}
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels:
          - __meta_kubernetes_pod_controller_name
        regex: ([0-9a-z-.]+?)(-[0-9a-f]{8,10})?
        action: replace
        target_label: __tmp_controller_name
      - source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
          - __meta_kubernetes_pod_label_app
          - __tmp_controller_name
          - __meta_kubernetes_pod_name
        regex: ^;*([^;]+)(;.*)?$
        action: replace
        target_label: app
      - source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_instance
          - __meta_kubernetes_pod_label_release
        regex: ^;*([^;]+)(;.*)?$
        action: replace
        target_label: instance
      - source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_component
          - __meta_kubernetes_pod_label_component
        regex: ^;*([^;]+)(;.*)?$
        action: replace
        target_label: component
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: node_name
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      - action: replace
        replacement: $1
        separator: /
        source_labels:
        - namespace
        - app
        target_label: job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_container_name
        target_label: container
      - action: replace
        replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
        - __meta_kubernetes_pod_uid
        - __meta_kubernetes_pod_container_name
        target_label: __path__
      - action: replace
        regex: true/(.*)
        replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
        - __meta_kubernetes_pod_annotationpresent_kubernetes_io_config_hash
        - __meta_kubernetes_pod_annotation_kubernetes_io_config_hash
        - __meta_kubernetes_pod_container_name
        target_label: __path__
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_ip
        target_label: pod_ip

Loki config

analytics:
  reporting_enabled: false
auth_enabled: true
chunk_store_config:
  chunk_cache_config:
    background:
      writeback_buffer: 1000
      writeback_goroutines: 1
      writeback_size_limit: 500MB
    embedded_cache:
      enabled: false
    memcached:
      batch_size: 6
      parallelism: 2
    memcached_client:
      addresses: dnssrv+_memcached-client._tcp.loki-memcached-chunks.loki.svc.cluster.local
      consistent_hash: true
      timeout: 60s
  max_look_back_period: 0s
  write_dedupe_cache_config:
    memcached:
      batch_size: 256
      parallelism: 10
    memcached_client:
      addresses: dnssrv+_memcached-client._tcp.loki-memcached-index-writes.loki.svc.cluster.local
      consistent_hash: true
common:
  compactor_address: ""
  compactor_grpc_address: 'loki-compactor:9095'
  path_prefix: /var/loki
  ring:
    kvstore:
      store: memberlist
compactor:
  retention_enabled: true
  shared_store: azure
distributor:
  rate_store:
    ingester_request_timeout: 1000ms
    stream_rate_update_interval: 1500ms
  ring:
    kvstore:
      store: memberlist
frontend:
  compress_responses: true
  log_queries_longer_than: 5s
  query_stats_enabled: true
  scheduler_address: loki-query-scheduler:9095
  tail_proxy_url: http://loki-querier:3100
frontend_worker:
  match_max_concurrent: true
  scheduler_address: loki-query-scheduler:9095
index_gateway:
  mode: simple
ingester:
  autoforget_unhealthy: true
  chunk_block_size: 262144
  chunk_encoding: snappy
  chunk_idle_period: 30m
  chunk_retain_period: 0s
  chunk_target_size: 1572864
  concurrent_flushes: 16
  lifecycler:
    join_after: 10s
    observe_period: 10s
    readiness_check_ring_health: false
    ring:
      heartbeat_timeout: 10m
      kvstore:
        store: memberlist
      replication_factor: 3
    tokens_file_path: /var/loki/tokens
  max_transfer_retries: 0
  sync_min_utilization: 0.2
  sync_period: 60m
  wal:
    dir: /var/loki/wal
ingester_client:
  grpc_client_config:
    grpc_compression: gzip
    max_recv_msg_size: 104857600
limits_config:
  enforce_metric_name: false
  ingestion_burst_size_mb: 24
  ingestion_rate_mb: 16
  max_cache_freshness_per_query: 10m
  max_global_streams_per_user: 10000
  max_query_length: 0
  query_timeout: 20s
  reject_old_samples: true
  reject_old_samples_max_age: 7d
  retention_period: 7d
  shard_streams:
    desired_rate: 1MB
    enabled: false
    logging_enabled: true
  split_queries_by_interval: 15m
  tsdb_max_bytes_per_shard: 600MB
  tsdb_max_query_parallelism: 512
memberlist:
  join_members:
  - loki-memberlist
  - ruler-memberlist
querier:
  max_concurrent: 8
  multi_tenant_queries_enabled: true
query_range:
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  results_cache:
    cache:
      memcached_client:
        addresses: dnssrv+_memcached-client._tcp.loki-memcached-frontend.loki.svc.cluster.local
        consistent_hash: true
query_scheduler:
  max_outstanding_requests_per_tenant: 32768
  use_scheduler_ring: false
ruler:
  alertmanager_url: http://_tcp-web._tcp.alertmanager.monitoring.svc.cluster.local
  enable_alertmanager_discovery: true
  enable_alertmanager_v2: true
  enable_api: true
  enable_sharding: true
  evaluation:
    mode: remote
    query_frontend:
      address: dns:///loki-query-frontend-headless:9095
  external_url: http://alertmanager.monitoring.svc.cluster.local:9093
  query_stats_enabled: true
  remote_write:
    clients:
      thanos:
        url: http://thanos-gateway.thanos.svc.cluster.local:8080/api/v1/receive
    enabled: true
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/loki/scratch
  storage:
    local:
      directory: /etc/loki/rules
    type: local
  wal:
    dir: /var/loki/wal
runtime_config:
  file: /var/loki-runtime/runtime.yaml
schema_config:
  configs:
  - from: "2023-03-08"
    index:
      period: 24h
      prefix: loki_tsdb_index_
    object_store: azure
    schema: v12
    store: tsdb
server:
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  http_listen_port: 3100
  http_server_read_timeout: 310s
  http_server_write_timeout: 310s
storage_config:
  azure:
    account_key: ${STORAGE_ACCOUNT_KEY}
    account_name: 'loki'
    container_name: loki-default-gen1
    download_buffer_size: 268435456
    request_timeout: 2s
    upload_buffer_size: 268435456
  boltdb_shipper:
    active_index_directory: /var/loki/index
    build_per_tenant_index: true
    cache_location: /var/loki/cache
    cache_ttl: 168h
    index_gateway_client:
      log_gateway_requests: true
      server_address: dns:///loki-index-gateway:9095
    shared_store: azure
  filesystem:
    directory: /var/loki/chunks
  tsdb_shipper:
    active_index_directory: /var/loki/tsdb-index
    cache_location: /var/loki/tsdb-cache
    index_gateway_client:
      log_gateway_requests: true
      server_address: dns:///loki-index-gateway:9095
    query_ready_num_days: 7
    shared_store: azure
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
verejoel commented 9 months ago

Here is a link to a screen recording showing the effect on a neighbouring container's logs, as a screenshot doesn't really do it justice.

dannykopping commented 9 months ago

Thanks for the detailed report @verejoel Can you please include your promtail & Loki configs? (remember to redact sensitive information)

verejoel commented 9 months ago

@dannykopping done, thanks for taking a look. I can provide the exact log files if needed (although I don't think the actual content should matter in this case)