grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.42k stars 205 forks source link

agent does not use correct log level when writing to journal #429

Open jan-kaufmann opened 1 year ago

jan-kaufmann commented 1 year ago

What's wrong?

I noticed that most logs of my grafana-agent instance appear to have level=info but there are quite a few warnings and errors misclassified. (Screenshot is only about error but same appears for warnings) image

I see the same results when I filter the local journalctl for log lines with priority info vs warning|error. So this has nothing todo with the scraping or the pipeline. I see a few warnings when filtering for info:

sudo journalctl -u grafana-agent.service -r -p6

image

But once I filter for warnings and errors I only get error messages thrown by the systemd service - but not from the grafana agent application running within.

sudo journalctl -u grafana-agent.service -r -p3..4

image

Steps to reproduce

Generate an error message by modifying the config file (e.g. try to set a field that does not exist because of a typo) and restart the grafana-agent service.

Filter the journal for log level INFO (aka priority) and search for level=warn

sudo journalctl -u grafana-agent.service -r -p6 | grep level=warn

System information

Fedora Linux 35 x86/64

Software version

grafana-agent-0.33.1-1.src.rpm

Configuration

# Configures the server of the Agent used to enable self-scraping of its own metrics.
#server:
#  http_listen_port: 12345

# Configures integrations for the Agent.
#
integrations:
  node_exporter:
    enabled: true
    disable_collectors:
      - ipvs #high cardinality on kubelet
      - btrfs
      - infiniband
      - xfs
      - zfs
    # exclude dynamic interfaces
    netclass_ignored_devices: "^(veth.*|cali.*|[a-f0-9]{15})$"
    netdev_device_exclude: "^(veth.*|cali.*|[a-f0-9]{15})$"
    # disable tmpfs
    filesystem_fs_types_exclude: "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|ugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|ocfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
    # drop extensive scrape statistics
    metric_relabel_configs:
    - action: drop
      regex: node_scrape_collector_.+
      source_labels: [__name__]
    relabel_configs:
      - source_labels: [agent_hostname]
        target_label: instance
  agent:
    enabled: true
    relabel_configs:
      - source_labels: [agent_hostname]
        target_label: instance
    metric_relabel_configs:
    - action: keep
      regex: (prometheus_target_.*|prometheus_sd_discovered_targets|*|agent_wal_samples_appended_total|process_start_time_seconds)
      source_labels:
      - __name__
  prometheus_remote_write:
  - basic_auth:
      username: zzz
      password: zzz
    url: https://prometheus-prod-24-prod-eu-west-2.grafana.net/api/prom/push

# Configures metric collection.
metrics:
  wal_directory: /tmp/grafana-agent-wal
  configs:
  - name: integrations
    remote_write:
    - basic_auth:
        username: zzz
        password: zzz
      url: https://prometheus-prod-24-prod-eu-west-2.grafana.net/api/prom/push
    scrape_configs:

# Configures log collection.
logs:
  positions_directory: /tmp/grafana-agent-pos
  configs:
  - name: integrations
    clients:
    - basic_auth:
        password: zzz
        username: zzz
      url: https://logs-prod-012.grafana.net/loki/api/v1/push

    target_config:
      sync_period: 10s

    scrape_configs:
    - job_name: integrations/node_exporter_journal_scrape
      journal:
        labels:
          instance: logcollector-app-01
          job: integrations/node_exporter
        max_age: 24h
      relabel_configs:
      - source_labels:
        - __journal__systemd_unit
        target_label: systemd_unit
      - source_labels:
        - __journal__boot_id
        target_label: boot_id
      - source_labels:
        - __journal__transport
        target_label: transport
      #- source_labels:
      #  - __journal_priority_keyword
      #  target_label: level

      pipeline_stages:
      - timestamp:
          # Name from extracted data to use for the timestamp.
          source: Time

          # Determines how to parse the time string. Can use
          # pre-defined formats by name: [ANSIC UnixDate RubyDate RFC822
          # RFC822Z RFC850 RFC1123 RFC1123Z RFC3339 RFC3339Nano Unix
          # UnixMs UnixUs UnixNs].
          format: UnixMs

          # Fallback formats to try if the format fails to parse the value
          fallback_formats: [RFC3339 RFC3339Nano Unix UnixUs UnixNs ANSIC UnixDate RubyDate RFC822 RFC822Z RFC850 RFC1123 RFC1123Z]

          # IANA Timezone Database string.
          location:

    - job_name: syslog-relay
      syslog:
        listen_address: 0.0.0.0:1515
        labels:
          collector: logcollector-app-01
          job: remote-syslog
        use_incoming_timestamp: true
        label_structured_data: true
      relabel_configs:
        - source_labels: ['__syslog_message_sd_1pw_mc_country']
          target_label: 'country'
        - source_labels: ['__syslog_message_hostname']
          target_label: 'instance'
        - source_labels: ['__syslog_message_app_name']
          target_label: 'application'
        - source_labels: ['__syslog_message_severity']
          target_label: 'level'

Logs

Jun 12 16:00:26 logcollector-audit-01 grafana-agent[369718]: 2023/06/12 16:00:26 error loading config file /etc/grafana-agent.yaml: yaml: unmarshal errors:
Jun 12 16:00:26 logcollector-audit-01 grafana-agent[369706]:   line 71: field positios_directory not found in type logs.config

May 26 16:41:44 logcollector-audit-01 grafana-agent[230568]: ts=2023-05-26T14:41:44.278181006Z caller=main.go:71 level=error msg="error creating the agent server entrypoint" err="unable to apply config for integrations: unable to create logs instance: failed to make syslog target manager: invalid match stage config: invalid selector syntax for match stage: parse error at line 1, col 5: syntax error: unexpected IDENTIFIER, expecting = or != or =~ or !~"
May 26 16:41:43 logcollector-audit-01 grafana-agent[230556]: ts=2023-05-26T14:41:43.513177839Z caller=main.go:71 level=error msg="error creating the agent server entrypoint" err="unable to apply config for integrations: unable to create logs instance: failed to make syslog target manager: invalid match stage config: invalid selector syntax for match stage: parse error at line 1, col 5: syntax error: unexpected IDENTIFIER, expecting = or != or =~ or !~"
tpaschalis commented 1 year ago

I'm removing the bug label for now in favor of enhancement. The Agent logs everything to stderr which are then picked up by systemd with the same priority.

It would take a lot of effort for the Agent to log directly to journald with the correct level; on the other hand, you can use pipeline_stages to parse the Agent's log line and use that to set the level label instead of depending on the __journal_priority_keyword key.