failing loki remote backend prevents working backends from receiving data regularly while forwarding logs to multiple loki clients at once

makeittotop commented 6 months ago

What's wrong?

I've noticed that in case of a multi loki clients setup to forward logs to, if one of the loki clients starts failing for some reason, eg. - no process listening on the specified port, etc, it starves other working loki endpoints to receive data as well until the failing client exhausts all of its max_retries (default = 10). Once the loop gets reset, the same issue repeats itself again. In the end, the working clients only get the data every 6 minutes or so based on what the max_period is set to (Default = 5m). This also leads to "gaps" in the grafana dashboard while looking at the data for those clients,

Steps to reproduce

Take a look at this nominal config -

./agent-local-config.yaml

server:
  log_level: info

logs:
  configs:
  - clients:
    - tls_config:
        insecure_skip_verify: true
      basic_auth:
        password: xxxx
        username: loki
      url: https://logs.my-loki-instance.net/loki/api/v1/push
    - tls_config:
        insecure_skip_verify: true
      url: https://localhost:13100/loki/api/v1/push
      # backoff_config:
      #   # max_retries: 10
      #   max_period: 10s
    name: default
    positions:
      filename: /data/grafana_agent/log-positions.yml
    scrape_configs:
    - job_name: nginx
      pipeline_stages:
      - regex:
          expression: (?P<remote_addr>\S+) - (?P<remote_user>\S+) \[(?P<time_local>[^]]+)\]
            "(?P<request_method>[A-Z]+) (?P<request_url>[^? ]+)[?]*(?P<request_url_params>\S*)
            (?P<request_http_version>[^"]+)" (?P<status_code>\d+) (?P<body_bytes_sent>\d+)
            "(?P<http_referer>[^"]+)" "(?P<http_user_agent>[^"]+)" "(?P<http_x_forwarded_for>[^"]+)"
      - labels:
          remote_user: null
          request_http_version: null
          request_method: null
          request_url: null
          status_code: null
      - timestamp:
          format: 02/Jan/2006:15:04:05 -0700
          source: time_local
      static_configs:
      - labels:
          __path__: /var/log/nginx.log
          instance: dist1.foobar.com
          job: nginx
        targets:
        - dist1.foobar.com

Start the agent as

# /tmp/agent: ./grafana-agent --config.file ./agent-local-config.yaml

Now, let's assume that the localhost:13100 instance is missing for some reason. In such a case I expected the other endpoint (logs.my-loki-instance) to be able to receive data at the configured scrape intervals (60s), but that doesn't happen as explained above.

System information

Linux 6.5.0-15-generic

Software version

Grafana Agent 0.35.0 and master atm

Configuration

server:
  log_level: info

logs:
  configs:
  - clients:
    - tls_config:
        insecure_skip_verify: true
      basic_auth:
        password: xxxx
        username: loki
      url: https://logs.my-loki-instance.net/loki/api/v1/push
    - tls_config:
        insecure_skip_verify: true
      url: https://localhost:13100/loki/api/v1/push
      # backoff_config:
      #   # max_retries: 10
      #   max_period: 10s
    name: default
    positions:
      filename: /data/grafana_agent/log-positions.yml
    scrape_configs:
    - job_name: nginx
      pipeline_stages:
      - regex:
          expression: (?P<remote_addr>\S+) - (?P<remote_user>\S+) \[(?P<time_local>[^]]+)\]
            "(?P<request_method>[A-Z]+) (?P<request_url>[^? ]+)[?]*(?P<request_url_params>\S*)
            (?P<request_http_version>[^"]+)" (?P<status_code>\d+) (?P<body_bytes_sent>\d+)
            "(?P<http_referer>[^"]+)" "(?P<http_user_agent>[^"]+)" "(?P<http_x_forwarded_for>[^"]+)"
      - labels:
          remote_user: null
          request_http_version: null
          request_method: null
          request_url: null
          status_code: null
      - timestamp:
          format: 02/Jan/2006:15:04:05 -0700
          source: time_local
      static_configs:
      - labels:
          __path__: /var/log/nginx.log
          instance: dist1.foobar.com
          job: nginx
        targets:
        - dist1.foobar.com


### Logs

```text
Mar 18 20:28:22 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:22.36522416Z caller=client.go:430 level=error component=logs logs_config=default component=client host=localhost:13100 msg="final error sending batch" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:22 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:22.507835563Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:23 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:23.271720016Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:25 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:25.123445134Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:28 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:28.795872338Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:35 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:35.337596441Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:28:51 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:28:51.028375765Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:29:08 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:29:08.033159675Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:29:40 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:29:40.383066904Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"
Mar 18 20:31:09 dist1.foobar.com grafana-agent[115847]: ts=2024-03-18T20:31:09.086003766Z caller=client.go:419 level=warn component=logs logs_config=default component=client host=localhost:13100 msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://localhost:13100/loki/api/v1/push\": dial tcp 127.0.0.1:13100: connect: connection refused"

makeittotop commented 6 months ago

From whatever I can tell with my limited knowledge of golang, and channels, it appears that there are 2 goroutines (in this case) - one for localhost:13100, other for logs.my-loki-instance.net in the grafana-agent process. Both of them are reading form the same channel (api.Entry) which is being populated in the promtail package in grafana/clients/pkg/promtail/targets/file/tailer.go readLines() function. As the localhost:13100 goroutine gets blocked due to failling into retries and exponential backoffs, it delays the other my-loki goroutine from receiving data too - atleast my tests confirm this. Is this due to the fact that the underlying api.Entry channel is "full" due to 1 of the 2 receivers being tied up elsewhere? My tests show that as soon as the failing goroutine unblocks after exhausting its retries, both receivers receive data pretty much immediately.

rfratto commented 5 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

github-actions[bot] commented 4 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

github-actions[bot] commented 2 months ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

grafana / agent