grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.76k stars 3.43k forks source link

Promtail stops shipping logs when an error is encountered in journald #4426

Open tifling85 opened 3 years ago

tifling85 commented 3 years ago

Describe the bug This problem is very similar to #2812. We have configured promtail to forward logs from journald running as a SystemD module. Promtail can start writing errors to the journald, while stopping the transfer of logs to Loki.

To Reproduce Steps to reproduce the behavior:

  1. Run promtail as a SystemD unit streaming and forwarding journald.
  2. Received error during sdjournal follow (we are working on trying to figure out what might be causing this error here, bad message seems to indicate the stream encountered a non text/string character)
  3. New log entries will not appear in Loki until the promtail SystemD unit is restarted.

Expected behavior Promtail continues to stream and forward logs or exits so it can be restarted by SystemD.

Environment: Infrastructure:

Screenshots, Promtail config, or terminal output Log from promtail:

promtail[7497]: level=error ts=2021-10-07T04:48:15.338399632Z caller=journaltarget.go:186 msg="received error during sdjournal follow" err="failed to iterate journal: bad message"

And similar logs in Promtail there is a huge amount.

Promtail config:

server:
  disable: true
  http_listen_address: 192.168.199.173
  http_listen_port: 9080
  grpc_listen_address: 192.168.199.173

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://192.168.199.173:3100/loki/api/v1/push

scrape_configs:

- job_name: journal
  journal:
    max_age: 12h
    labels:
      job: systemd-journal
      host: 192.168.199.173
  relabel_configs:
    - source_labels: ['__journal__systemd_unit']
      target_label: 'unit'
    - source_labels: ['__journal__hostname']
      target_label: 'hostname'
wang1219 commented 3 years ago

The same problem

level=error ts=2021-10-25T02:58:46.562434288Z caller=journaltarget.go:187 msg="received error during sdjournal follow" err="Timeout expired"
level=error ts=2021-10-25T02:58:46.562514578Z caller=journaltarget.go:190 msg="unable to follow journal" err="Timeout expired"
- job_name: journal
  journal:
    max_age: 1h
    labels:
      _app_: journal
      _env_: dev
  relabel_configs:
  - source_labels:
    - __journal__systemd_unit
    target_label: unit
tomjohnburton commented 2 years ago

Same here

ecusnir commented 2 years ago

Also facing this problem using v2.4.1

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

tifling85 commented 2 years ago

Bump!

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

tifling85 commented 2 years ago

bump please

zilahu commented 2 years ago

bump.

fluential commented 2 years ago

bump.

rmalenko commented 2 years ago

up

yurividal commented 2 years ago

same problem here...

cervajs commented 2 years ago

+1 promtail 2.6.1 binary release (running in console without systemd unit now) host centos 8 stream, systemd-239-58.el8.x86_64

level=error ts=2022-09-05T11:31:13.498145588Z caller=journaltarget.go:192 msg="received error during sdjournal follow" err="Timeout expired" level=error ts=2022-09-05T11:31:13.498291736Z caller=journaltarget.go:195 msg="unable to follow journal" err="Timeout expired"

jmcfarlane commented 2 years ago

same problem here (fc36)

jmcfarlane commented 2 years ago

Actually mine was not the same problem, I was missing:

--volume=/etc/machine-id:/etc/machine-id:ro

:+1:

Pikw commented 2 years ago

UP

esthon commented 2 years ago

Same for me. Does anyone have a workaround for it?

ashdavid commented 2 years ago

Hey,

The original issue posted seems to be different to some of the discussion threads, but maybe this will help some who see this error: msg="unable to follow journal" err="Timeout expired"

I had the same error, the reason being I do not have persistent storage enabled for journal (e.g. I have a /run/log/journal path, but no /var/log/journal path)

The docs suggest it will check both paths, but in my case I had to explicitly add the path var and limit to runtime journal. e.g.

  - job_name: journal
    journal:
      path: /run/log/journal
      max_age: 1h
      labels:
        job: systemd-journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'

As soon as I added that path everything started working, I'm running promtail v2.6.1 and centos 7.

Schmoho commented 1 year ago

I am also having the "unable to follow journal" "Timeout expired" issue. It applies to all my journald scrapers and thereby fully undermines confidence in journald scraping.

Here is the line of code where this error is thrown.

ipHeaders commented 1 year ago

I had the same issue but fixed it by replacing path: /run/log/journal with path: /var/log/journal

in my Ubuntu 22.04 VM the journal logs are in path: /var/log/journal

Logicwax commented 1 year ago

I've noticed this same problem sometimes occurs if the network connection between promtail and loki is severed. I'm not able to reproduce this, however, but worth noting as a data point here.

Logicwax commented 1 year ago

actually, I'm about to reliably reproduce this now. The issue is that I have promtail connecting to a remote host via an ssh tunnel. If I take down the tunnel, sever it in any way, and bring it back up, then promtail will never auto-reconnect properly. The same issue pops up until I restart promtail.

remram44 commented 1 year ago

Running into this too:

level=error ts=2023-04-25T19:01:05.140074112Z caller=journaltarget.go:205 msg="received error during sdjournal follow" err="failed to iterate journal: bad message"

Is there a workaround, short of deleting the entire systemd journal?

Logicwax commented 1 year ago

@remram44 yes just restart promtail sudo systemctl restart promtail and it should sync back up.

farodin91 commented 1 year ago

https://github.com/grafana/loki/pull/9155 this should fix the issue by letting promtail restart automatically

frittentheke commented 1 year ago

9155 this should fix the issue by letting promtail restart automatically

I just observed an constant stream of these errors:

promtail-linux-amd64[3829221]: level=error ts=2023-07-19T08:51:50.694001179Z caller=journaltarget.go:205 msg="received error during sdjournal follow" err="failed to iterate journal: bad message"

with Promtail 2.8.2.

A restart of Promtail then "fixed" it. Isn't this supposed to be covered by #9155 or likely a new issue @dannykopping @farodin91?

remram44 commented 1 year ago

As far as I can tell, the fix was backported May 5 (#9403) and 2.8.2 was May 3, so it will be in the next release.

cstyan commented 11 months ago

Is anyone still experiencing this issue on the latest 2.8.x versions?

tifling85 commented 11 months ago

version 2.9.2 the problem is still there

aabramov-ionos commented 8 months ago

Can confirm that the 2.9.2 is affected. I have reports that some 2.9.2 promtails are experiencing this behaviour more often than on 2.6.4 before.

acedrew commented 5 months ago

Can confirm 2.9.7 still affected.

simonknittel commented 1 month ago

Still happening on 3.1.1

Image

My config:

- job_name: journal
  journal:
    max_age: 12h
    labels:
      job: systemd-journal
    path: /tmp/host_logs/journal
  relabel_configs:
    - source_labels: ['__journal__systemd_unit']
      target_label: 'unit'
    - source_labels: ['__journal__hostname']
      target_label: 'host'
  pipeline_stages:
    - drop:
        expression: '.*mount: Succeeded.*'