grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.58k stars 3.41k forks source link

Offical EC2 promtail tutorial doesn't scrape any EC2 instances: relabeling __host__ issue #7958

Open life5ign opened 1 year ago

life5ign commented 1 year ago

Describe the bug

problem 1: ports

The tutorial: https://grafana.com/docs/loki/latest/clients/aws/ec2/

The instructions do not say to set the port: key in ec2_sd_configs to the server: -> http_listen_port value in promtail.yml, and thus port 80 is used to scrape, which is never/rarely used. Even the example in the tutorial uses 3100 for the latter, so I don't know how the authors got it to work, unless I'm missing something.

I have fixed this, but it would be nice to have it in the docs. My real issue is below:

problem 2: the tutorial drops ALL hosts

Another problem: For the step in the tutorial

Ensuring discovered targets are only for the machine Promtail currently runs on. This is achieve by adding the label host using the incoming metadata __meta_ec2_private_dns_name. If it doesn’t match the current HOSTNAME environnement variable, the target will be dropped.

The local env var $HOSTNAME on an ec2 instance is not the same as the meta tag ____meta_ec2_private_dns_name, it is equivalent to __meta_ec2_private_ip, prefixed by ip-, and with dashes instead of periods; or, if you like, the private dns name with only the hostname part, and not additional . separated domain names (at least on my instances); so the item in the tutorial in relabel_configs

- source_labels: [__meta_ec2_private_dns_name]
        regex: "(.*)"
        target_label: __host__

Drops all hosts.

ubuntu@ip-172-31...snip:~/src/prometheus-grafana$ echo $HOSTNAME
ip-172-31....snip

To Reproduce I'm running in a container, I have made sure the env var HOSTNAME is in the running container, and can scrape all my instances, but the relabel_config to that rewrites host just doesn't want to pick out just this host. Also worth noting that HOSTNAME on my instances isn't the same as __ meta_ec2_private_dns_name, but of the form ip-nnn-nnn-nnn-nnn , so I don't think the official docs works in my scenario: https://grafana.com/docs/loki/latest/clients/aws/ec2/

I've been working on this for several hours and am totally stuck. One problem is that I can't get debugging on what capture group I'm effectively grabbing, even with log.level=trace, and am using this config for promtail:

  # it's the job_name that shows up in /targets URI on the promtail web server tab
  # not the label name(s), as headings
  - job_name: ec2_syslog_varlogs
    ec2_sd_configs:
      - region: "${AWS_REGION}"
        access_key: "${AWS_KEY}"
        secret_key: "${AWS_SECRET_KEY}"
        # this needs to be set to the http_listen_port in the server: key
        # otherwise the default port 80 will be used, and nothing will be scraped
        port: 9080
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance_id
      - source_labels: [__meta_ec2_instance_type]
        target_label: instance_type
      - source_labels: [__meta_ec2_instance_state]
        target_label: instance_state
      # make sure this scrape only occurs on the local host; drop all others
      # the first capture group needs to match the value of the target's local
      # environment variable HOSTNAME, which is not the private dns name by default
      # on my instances; this was not explained in the tutorial
      - source_labels: [__meta_ec2_private_dns_name]
        regex: '^(ip-[0-9]+-[0-9]+-[0-9]+-[0-9]+).*'
        # the first capture group
        replacement: '${1}'
        target_label: __host__
      - action: replace
        replacement: /var/log/syslog
        target_label: __path__
    # append a static label
    pipeline_stages:
      - static_labels:
          job: ec2_syslog_varlogs

Expected behavior I want just the host that is running promtail to be scraped by ec2_sd_configs

Environment: Docker compose stack with promtail, loki, and grafana in containers, on an ec2 instance.

Screenshots, Promtail config, or terminal output I'm getting all my hosts image ...snip...

life5ign commented 1 year ago

This is the official customer facing tutorial and there's no reply?

vishwa-trulioo commented 1 year ago

Is there any update on this? I'm also a paying customer. Would it be possible to provide and answer. I suppose the fix should be easy.

rgroothuijsen commented 1 year ago

I've been working on this for several hours and am totally stuck. One problem is that I can't get debugging on what capture group I'm effectively grabbing, even with log.level=trace, and am using this config for promtail:

It did appear to be almost correct. I tested your config by temporarily replacing the __host__ target with something like fake_host so that it would show up in the logs in any case, which gave me the IP part without the rest of the hostname. To get the full hostname, place the final .* within your capture group instead of after it: regex: '^(ip-[0-9]+-[0-9]+-[0-9]+-[0-9]+.*)'. Which would be equivalent to (.*) come to think of it, since you're capturing the entire label that way.

life5ign commented 1 year ago

@rgroothuijsen good idea re. using a dummy label. I did the same, and found that my regex was working as intended. echo $HOSTNAME on the instance showed the same thing. For some reason, the problem was explicitly stating replacement; I removed that line, and it worked. Below is the config that works:

- source_labels: [__meta_ec2_private_dns_name]
  regex: '^(ip-[0-9]+-[0-9]+-[0-9]+-[0-9]+).*'
  # the first capture group      
  # adding this explicitly prevented the relabeling from working
  #replacement: "${1}"
  target_label: __host__

Using (.*) wouldn't work, because it would capture something like ip-xxx-xxx-xxx-xxx.<region>.compute.internal , which wouldn't match $HOSTNAME on the instance, which only has the hostname part of that (private) domain name.

I don't have time to explore why this is happening, but I'm guessing that non-capturing groups need to be used in sub expressions in the regex, or there is a bug with promtail.

Thanks for your inspiration. Glad it's working now!

Ayatallah commented 1 year ago

@life5ign how you could validate its working from grafana? I added your same scrape job except for ec2_sd_configs:, i added only the port cause I need syslogs for hosting nodes not remote one so default values can apply safely, and still can't validate i can get syslog, I check logs for promtail pods I see that promtail adding the target but i doesn't watch it or scrape logs from it

dm-canteen commented 1 year ago

Just ran into this issue, very disappointing as I remember ec2_sd_configs being easy to setup and get working last time I played around with it. I still couldn't get the labels working even with the snippet above.

I note that $HOSTNAME on my instances (Amazon Linux 2) is set to the entire ip-xxx-xxx-xxx-xxx.<region>.compute.internal value. Commenting out the regex: line doesn't help with this.

My relevant YAML:

ec2_sd_configs:
  # This allows promtail to connect to EC2 and gather the following source labels
  - region: ${AWS_DEFAULT_REGION}
    access_key: ${AWS_ACCESS_KEY_ID}
    secret_key: ${AWS_SECRET_ACCESS_KEY}
    port: 9080

relabel_configs:
  # Use the ec2 labels as target loki labels
  - source_labels: [__meta_ec2_tag_Name]
    target_label: name
    action: replace
  - source_labels: [__meta_ec2_tag_AppName]
    target_label: app_name
    action: replace
  - source_labels: [__meta_ec2_instance_id]
    target_label: instance
    action: replace
  - source_labels: [__meta_ec2_private_dns_name]
    # ip-xxx-xxx-xxx-xxx.<region>.compute.internal
    # regex: '^(ip-[0-9]+-[0-9]+-[0-9]+-[0-9]+).*'
    target_label: __host__
  - target_label: source
    replacement: grafana-agent
    action: replace

Fixed this today by realising I was missing a __path__ relabel option; even though __path__ is defined later in the static_configs. Adding the following fixed the issue for me:

- action: replace
  replacement: /var/log/*
  target_label: __path__