grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.63k stars 3.41k forks source link

Promtail received file watcher event but positions do not increase after a few hours #9258

Open nbisson opened 1 year ago

nbisson commented 1 year ago

Hello,

Describe the bug Promtail received file watcher event but positions do not increase after a few hours of working. From grafana we are experiencing times without logs

Promtail logs :

ts=2023-04-24T13:36:36.775383804Z caller=log.go:168 level=info msg="Waiting for /var/log/pods/traefik_traefik-fb898f676-4gq7r_877b2ddc-a0d6-4f2d-b14b-d61bcfce3a67/traefik/0.log to appear..."
[...]
level=info ts=2023-04-24T13:36:36.783186447Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/traefik_traefik-fb898f676-4gq7r_877b2ddc-a0d6-4f2d-b14b-d61bcfce3a67/traefik/0.log op=CREATE
level=info ts=2023-04-24T13:36:47.509882124Z caller=filetargetmanager.go:175 msg="received file watcher event" name=/var/log/pods/traefik_traefik-fb898f676-4gq7r_877b2ddc-a0d6-4f2d-b14b-d61bcfce3a67/traefik/0.log op=CREATE
[...]

Promtail position file :

root@promtail-2pvkg:/# more /run/promtail/positions.yaml | grep traefik
  /var/log/pods/traefik_traefik-fb898f676-4gq7r_877b2ddc-a0d6-4f2d-b14b-d61bcfce3a67/traefik/0.log: "0"

To Reproduce

resources: requests: cpu: 100m memory: 128Mi

defaultVolumes:

defaultVolumeMounts:

extraVolumes:

extraVolumeMounts:

Expected behavior The logs should be sent in real time and the position file should be increased accordingly.

Environment:

Screenshots, Promtail config, or terminal output

Docker is installed in other partition (/data/docker).

I do not understand if it is a problem of limits, rotate logs, throughput or other. The stack works some hours after pod restarting but not after.

Thanks for your help.

image image

schakko commented 1 year ago

We are experiencing a similar issue (file watcher event has been logged by promtail for large log files, pointer in positions.yaml is 0 for large log files, Grafana does not show anything).

Environment:

Loki's values.yaml:

loki:
  auth_enabled: false
  analytics:
    reporting_enabled: false
  commonConfig:
    replication_factor: 1
  storage:
    type: 'filesystem'
  server:
    log_level: warn
    grpc_server_max_concurrent_streams: 0
    grpc_server_max_recv_msg_size: 8388608
  rulerConfig:
    storage:
      type: 'local'
      local:
        directory: /rules
    rule_path: /tmp/
    alertmanager_url: http://prometheus-alertmanager.telemetry.svc.cluster.local:9093
    ring:
      kvstore:
        store: inmemory
    enable_api: true
  structuredConfig:
    compactor: #https://grafana.com/docs/loki/latest/upgrading/#240
      retention_enabled: true
    limits_config:
      retention_period: 14d
      ingestion_rate_mb: 10
      ingestion_burst_size_mb: 25
      max_streams_per_user: 0
      max_global_streams_per_user: 0
      per_stream_rate_limit: 10MB
      per_stream_rate_limit_burst: 50MB
    chunk_store_config:
      max_look_back_period: 336h
monitoring:
  dashboards:
    enabled: false
    namespace: grafana
  rules:
    enabled: false
    alerting: false
  alerts:
    enabled: false
  serviceMonitor:
    enabled: false
  selfMonitoring:
    enabled: true
    grafanaAgent:
      installOperator: true
    lokiCanary:
      enabled: false

test:
  enabled: false

Promtail's values.yaml:

serviceMonitor:
  enabled: false
extraVolumes:
  - name: journal
    hostPath:
      path: /var/log/journal
extraVolumeMounts:
  - name: journal
    mountPath: /var/log/journal
    readOnly: true
tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule
  - key: node-role.kubernetes.io/etcd
    operator: Exists
    effect: NoExecute
config:
  clients:
    - url: http://loki.loki.svc.cluster.local:3100/loki/api/v1/push
      tenant_id: 1
      external_labels:
        cluster: cl1
  snippets:
    extraScrapeConfigs: |
      - job_name: journal
        journal:
          max_age: 12h
          labels:
            job: systemd-journal
          path: /var/log/journal
        relabel_configs:
          - source_labels:
              - __journal__systemd_unit
            target_label: systemd_unit
          - source_labels:
              - __journal__hostname
            target_label: node
          - source_labels:
              - __journal_syslog_identifier
            target_label: syslog_identifier

    pipelineStages:
      - docker: { }
      # ...

Additional clues:

  1. Querying Promtail's Prometheus API endpoint returns the large log file.
  2. Querying Loki's /loki/api/v1/series endpoint does no longer show the series belonging to the large log file.
  3. Smaller logs from other containers in the same pod are still scraped by Promtail.

Suspects:

  1. We are collecting Java/Tomcat logs with exceptions. Maybe we are affected by #2281 and this has something to do with the issue.
  2. There might be a high load in the cluster so that Promtail is not able to ship the files due to Loki's default policy of denying out-of-order writes.

/cc @thomashoell

liangjunmo commented 1 year ago

I also encountered the same issue.

loki and promtail values.yaml:

loki:
  config:
    auth_enabled: false
    compactor:
      shared_store: filesystem
      working_directory: /data/loki/boltdb-shipper-compactor
    ingester:
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
      wal:
        dir: /data/loki/wal
    limits_config:
      enforce_metric_name: false
      ingestion_burst_size_mb: 100
      ingestion_rate_mb: 50
      per_stream_rate_limit: 5MB
      per_stream_rate_limit_burst: 20MB
      reject_old_samples: true
      reject_old_samples_max_age: 168h
    schema_config:
      configs:
      - from: "2020-10-24"
        index:
          period: 24h
          prefix: index_
        object_store: filesystem
        schema: v11
        store: boltdb-shipper
    server:
      grpc_listen_port: 9095
      http_listen_port: 3100
    storage_config:
      boltdb_shipper:
        active_index_directory: /data/loki/boltdb-shipper-active
        cache_location: /data/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: filesystem
      filesystem:
        directory: /data/loki/chunks
promtail:
  config:
    file: |
      server:
        http_listen_port: 3101
        log_level: info
      clients:
        - url: http://loki:3100/loki/api/v1/push
      positions:
        filename: /run/promtail/positions.yaml
      scrape_configs:
      - job_name: kubernetes-pod-label-app
        pipeline_stages:
          - docker: {}
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels:
          - __meta_kubernetes_pod_label_app
          regex: ''
          action: drop
        - source_labels:
          - __meta_kubernetes_pod_node_name
          target_label: __host__
        - source_labels:
          - __meta_kubernetes_namespace
          target_label: namespace
        - source_labels:
          - __meta_kubernetes_pod_name
          target_label: pod
        - source_labels:
          - __meta_kubernetes_pod_container_name
          target_label: container
        - source_labels:
          - __meta_kubernetes_namespace
          - __meta_kubernetes_pod_label_app
          separator: /
          target_label: job
        - regex: __meta_kubernetes_pod_label_(.+)
          action: labelmap
        - source_labels:
          - __meta_kubernetes_pod_uid
          - __meta_kubernetes_pod_container_name
          separator: /
          replacement: /var/log/pods/*$1/*.log
          target_label: __path__

docker daemon.json:

{
    "registry-mirrors": ["https://docker.mirrors.ustc.edu.cn"]
}

positions.yaml:

> cat /run/promtail/positions.yaml | grep renderelf | grep nginx

/var/log/pods/renderelf_renderelf-nginx-65cc66c854-dz7xn_627486a5-f548-4349-8f3b-887e3bb6d1b2/renderelf-nginx/0.log: "0"

promtail pod logs:

kubectl -n monitoring logs -f loki-promtail-689xv | grep -B 20 -A 20 "Re-open"

level=info ts=2023-07-03T06:15:41.007774517Z caller=filetargetmanager.go:179 msg="received file watcher event" name=/var/log/pods/renderelf_renderelf-frontend-bf89b465c-qwg79_6379d476-a439-499f-99d8-4a1c442349e9/renderelf-frontend/0.log op=CREATE
level=info ts=2023-07-03T06:15:41.028558282Z caller=filetargetmanager.go:179 msg="received file watcher event" name=/var/log/pods/renderelf_renderelf-nginx-65cc66c854-dz7xn_627486a5-f548-4349-8f3b-887e3bb6d1b2/renderelf-nginx/0.log.20230703-141541 op=CREATE
ts=2023-07-03T06:15:41.028686051Z caller=log.go:168 level=info msg="Re-opening moved/deleted file /var/log/pods/renderelf_renderelf-nginx-65cc66c854-dz7xn_627486a5-f548-4349-8f3b-887e3bb6d1b2/renderelf-nginx/0.log ..."
level=info ts=2023-07-03T06:15:41.028695308Z caller=filetargetmanager.go:179 msg="received file watcher event" name=/var/log/pods/renderelf_renderelf-nginx-65cc66c854-dz7xn_627486a5-f548-4349-8f3b-887e3bb6d1b2/renderelf-nginx/0.log op=CREATE
level=info ts=2023-07-03T06:15:51.038185803Z caller=filetargetmanager.go:179 msg="received file watcher event" name=/var/log/pods/renderelf_renderelf-server-worker1-854bb99fdd-x9g9w_d4b461ca-cf6a-4bfa-b442-697a2ada2db9/renderelf-server-worker1/0.log.20230703-141551 op=CREATE
level=info ts=2023-07-03T06:15:51.038362113Z caller=filetargetmanager.go:179 msg="received file watcher event" name=/var/log/pods/renderelf_renderelf-server-worker1-854bb99fdd-x9g9w_d4b461ca-cf6a-4bfa-b442-697a2ada2db9/renderelf-server-worker1/0.log op=CREATE
liangjunmo commented 1 year ago

I solved my problem https://github.com/grafana/loki/issues/9864#issuecomment-1631894467.