grafana / agent

Vendor-neutral programmable observability pipelines.
https://grafana.com/docs/agent/
Apache License 2.0
1.59k stars 486 forks source link

"no space left on device" errors when all devices have plenty of space and inodes #6588

Closed lozbrown closed 6 months ago

lozbrown commented 7 months ago

What's wrong?

Running grafana agent on an EC2, we're monitoring quite a few log files and keep getting the following error : level=error component=logs logs_config=default msg="error adding directory to watcher" error="no space left on device"

df -hi shows that all discs have plenty of space and inodes

(venv) [ec2-user@REDACTED 2024-03-04_10:33:27 UTC ~]$ df -hi Filesystem Inodes IUsed IFree IUse% Mounted on devtmpfs 7.8M 320 7.8M 1% /dev tmpfs 7.8M 2 7.8M 1% /dev/shm tmpfs 7.8M 708 7.8M 1% /run tmpfs 7.8M 16 7.8M 1% /sys/fs/cgroup /dev/nvme0n1p1 40M 311K 40M 1% / 127.0.0.1:/ 0 0 0 - /efs tmpfs 7.8M 10 7.8M 1% /run/user/1000 You have new mail in /var/spool/mail/ec2-user

(venv) [ec2-user@REDACTED 2024-03-04_10:38:48 UTC ~]$ df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 32G 0 32G 0% /dev tmpfs 32G 0 32G 0% /dev/shm tmpfs 32G 1.6M 32G 1% /run tmpfs 32G 0 32G 0% /sys/fs/cgroup /dev/nvme0n1p1 80G 28G 53G 35% / 127.0.0.1:/ 8.0E 19T 8.0E 1% /efs tmpfs 6.3G 0 6.3G 0% /run/user/1000 You have new mail in /var/spool/mail/ec2-user

Steps to reproduce

this happens frequently whe

System information

Amazon linux 2

Software version

agent, version v0.36.0

Configuration

integrations:
  agent:
    enabled: true
  node_exporter:
    disable_collectors:
    - arp
    - bonding
    - fibrechannel
    - hwmon
    - infiniband
    - ipvs
    - mdadm
    - nfs
    - nfsd
    - powersupplyclass
    enabled: true
    include_exporter_metrics: true
    textfile_directory: /var/lib/grafana-agent/textfile
  statsd_exporter:
    enabled: true
    listen_tcp: :8126
    listen_udp: :8126
    scrape_integration: true
logs:
  configs:
  - clients:
    - backoff_config:
        max_period: 10m
        max_retries: 200
        min_period: 1m
      external_labels:
        applicationName: someapp
        applicationService: myapp
        availabilityZone: eu-west-2a
        componentName: master1
        componentType: auto-heal
        environmentAccount: Non
        environmentRegion: lhr
        instanceType: r5a.2xlarge
        instanceid: i-05c470996b0d1cf84
        localIPv4: REDACTED
        pipelineBranch: dev2
        pipelineBuildNumber: '755'
        pipelineDeployAWSRegion: eu-west-2
        user_label_count: 0
      tenant_id: myapp
      url: https://infraos-logs-unix-lhr.myemployer/loki/api/v1/push
    name: default
    positions:
      filename: /var/lib/grafana-agent/positions.yaml
    scrape_configs:
    - job_name: system
      static_configs:
      - labels:
          __path__: /var/log/messages
          job: varlogmessages
        targets:
        - localhost
    - job_name: somestring_daglogs
      static_configs:
      - labels:
          __path__: /myapp/airflow-logs/dag_id*/**/*.log
          job: somestring_daglogs
        targets:
        - localhost
    - job_name: somestring_airflowlogtop
      static_configs:
      - labels:
          __path__: /myapp/airflow-logs/*.log
          job: somestring_airflowlogtop
        targets:
        - localhost
    - job_name: somestring_airflowlogschd
      static_configs:
      - labels:
          __path__: /myapp/airflow-logs/scheduler/latest/*.log
          job: somestring_airflowlogschd
        targets:
        - localhost
    - job_name: somestring_airflowlogaudit
      static_configs:
      - labels:
          __path__: /home/ec2-user/splunklogs/airflowlogaudit/*.log
          job: somestring_airflowlogaudit
        targets:
        - localhost
    - job_name: somestring_cronlogs
      static_configs:
      - labels:
          __path__: /home/ec2-user/applications/cron_logs/*.log
          job: somestring_cronlogs
        targets:
        - localhost
    - job_name: somestring_trth
      static_configs:
      - labels:
          __path__: /home/ec2-user/TRTH/log/*.log
          job: somestring_trth
        targets:
        - localhost
metrics:
  configs:
  - max_wal_time: 24h
    name: default
    remote_write:
    - headers:
        X-Scope-OrgID: myapp
      queue_config:
        max_backoff: 30s
        min_backoff: 5s
      url: https://infraos-metrics-unix-lhr.myemployer/api/v1/push
    scrape_configs:
    - job_name: grafana-agent
      relabel_configs:
      - replacement: myapp-someapp-dev2-755
        target_label: instance
      static_configs:
      - targets:
        - localhost:9100
    - job_name: node_exporter
      metrics_path: /integrations/node_exporter/metrics
      relabel_configs:
      - replacement: myapp-someapp-dev2-755
        target_label: instance
      static_configs:
      - targets:
        - localhost:9100
    - job_name: statsd_exporter
      metrics_path: /integrations/statsd_exporter/metrics
      static_configs:
      - targets:
        - localhost:9100
  global:
    external_labels:
      applicationName: someapp
      applicationService: myapp
      availabilityZone: eu-west-2a
      componentName: master1
      componentType: auto-heal
      environmentAccount: Non
      environmentRegion: lhr
      instanceType: r5a.2xlarge
      instanceid: i-05c470996b0d1cf84
      localIPv4: REDACTED
      pipelineBranch: dev2
      pipelineBuildNumber: '755'
      pipelineDeployAWSRegion: eu-west-2
      user_label_count: 0
    scrape_interval: 30s
  wal_cleanup_age: 96h
  wal_directory: /var/lib/grafana-agent/wal
server:
  log_level: info

Logs

Mar 04 10:20:24 ip-10-167-255-238.lhr.non.c1.somecompany.com grafana-agent[24022]: ts=2024-03-04T10:20:24.372699657Z caller=filetarget.go:313 level=info component=logs logs_config=default msg="watching new directory" directory="/myapp/airflow-logs/dag_id=mydag_s3poll_dynamic_keplerreport/run_id=scheduled__2024-03-04T10:15:00+00:00/task_id=POLL_AND_PREPROCESS_keplt"
Mar 04 10:20:24 ip-10-167-255-238.lhr.non.c1.somecompany.com grafana-agent[24022]: ts=2024-03-04T10:20:24.373553213Z caller=filetargetmanager.go:158 level=error component=logs logs_config=default msg="error adding directory to watcher" error="no space left on device"
mattdurham commented 7 months ago

Hitting the max number of watchers can trigger that error. Can you check the docs here https://github.com/fsnotify/fsnotify/blob/c94b93b0602779989a9af8c023505e99055c8fe5/README.md?plain=1#L153 and adjust to see if the problem goes away?

lozbrown commented 6 months ago

This worked and made the issue go away