grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.37k stars 3.39k forks source link

promtail: journald/systemd 251>252 seems to break log-scraping #8163

Open DennisGlindhart opened 1 year ago

DennisGlindhart commented 1 year ago

Describe the bug After system updated to systemd 252 from 251 (OpenSuSE MicroOS/Tumbleweed in my case) promtail will no longer read journald logs.

level=info ts=2023-01-16T09:14:47.958685214Z caller=main.go:171 msg="Starting Promtail" version="(version=2.7.1, branch=HEAD, revision=e0af1cc8a)"
level=warn ts=2023-01-16T09:14:47.958805407Z caller=promtail.go:217 msg="disable watchConfig" reason="promtailServer cast fail"

and will not send any logs - no more output.

To Reproduce

  1. System with journald/systemd version 252 or higher
  2. Start promtail with following -config.file=promtail.yaml

promtail.yaml config

server:
  disable: true

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: https://<lokiserver>/loki/api/v1/push

scrape_configs:
  - job_name: journal
    journal:
     max_age: 12h
     labels:
       job: systemd-journal
      path: /var/log/journal
   relabel_configs:
     - source_labels: ['__journal__systemd_unit']
       target_label: 'unit'
   pipeline_stages:

Expected behavior Promtail being able to read jounrnald log and send to Loki

Environment:

Screenshots, Promtail config, or terminal output I've tried appending -log.level=debug to command, but no extra output is printed.

DennisGlindhart commented 1 year ago

Seems actually to work when running the binary directly from the host without Docker. Maybe a discrepancy between systemd-libraries in Docker image and on host not being compatible somehow.

DennisGlindhart commented 1 year ago

Probaly a "newer" duplicate of issue #2792

DennisGlindhart commented 1 year ago

I tested/reproduced on different servers, got a little confused along the way, with not errors and then errrors etc, but here we are :)

So:

After upgrade to systemd 252 promtail will print on startup (If it has been running before and positions-file contains some position/offset)

level=error ts=2023-01-16T09:14:47.948379349Z caller=journaltarget.go:261 msg="received error reading saved journal position" msg="received error reading saved journal position" err="failed to get realtime timestamp: cannot assign requested address"

That can be resolved by removing the positions-file, restarting promtail and the error is gone. But still no logs being send to Loki etc. This indicates for me that some timestamp-related format or so changed in journald.

Logs is being send when running the promtail-binary directly on the host-system (which has systemd 252), but not when using Dockerhub-image, which reveals that it will work as long as the runtime-systemd library is the right version.

Creating an image with the following docker-file hotfixes the problem. It uses the newest tumbleweed image (which contains systemd 252 libs) and simply copies the binary from the original promtail-docker image.

FROM docker.io/grafana/promtail:2.7.1 as orig
FROM registry.opensuse.org/opensuse/tumbleweed:latest
COPY --from=orig /usr/bin/promtail /usr/bin/promtail
ENTRYPOINT ["/usr/bin/promtail"]
chaudum commented 1 year ago

Hey @DennisGlindhart thanks for reporting and posting the workaround.

This change in 252 may be responsible for the breaking of the journal target:

        * Journal files gained a new compatibility flag
          'HEADER_INCOMPATIBLE_COMPACT'. Files with this flag implement changes
          to the storage format that allow reducing size on disk. As with other
          compatibility flags, older journalctl versions will not be able to
          read journal files using this new format. The environment variable
          'SYSTEMD_JOURNAL_COMPACT=0' can be passed to systemd-journald to
          disable this functionality. It is enabled by default.

Source: https://lwn.net/Articles/913287/

That said, it's just a guess. I will try to verify this by setting the env variable SYSTEMD_JOURNAL_COMPACT=0 ~in the Docker image.~

Update: The env variable of course needs to be set on the host's systemd-journald.

chaudum commented 1 year ago

Update: The env variable of course needs to be set on the host's systemd-journald.

I've set the env variable in the systemd-journald service file, executed systemctl daemon-reload and then restarted the service. However, this did not solve the problem.

srclosson commented 1 year ago

Seeing this on:

systemd 247 (247.3-7+deb11u1)
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified

Command:

promtail -config.file /etc/promtail/config.yaml -log.level=debug
level=debug ts=2023-04-10T21:43:59.926223768Z caller=promtail.go:115 msg="Reloading configuration file"
level=info ts=2023-04-10T21:43:59.926700163Z caller=promtail.go:123 msg="Reloading configuration file" md5sum=b2ac30f717e75c5b157ee17a74fa6e6e
level=info ts=2023-04-10T21:43:59.927765209Z caller=main.go:171 msg="Starting Promtail" version="(version=2.7.5, branch=HEAD, revision=fd33f6a18)"
level=warn ts=2023-04-10T21:43:59.927816456Z caller=promtail.go:217 msg="disable watchConfig" reason="promtailServer cast fail"
Histalek commented 1 year ago

I'm also running into this problem. In my case i upgraded from Fedora 37 to Fedora 38. Which translates to a systemd version jump from 251 -> 253

systemd 253 (253.2-1.fc38)
+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified

After stopping the container -> deleting the positions.yml -> starting the container again the sending of journal logs now fails completely silent.

If this is happening for systemd version 252+ then it should currently only affect the following distros:

As the workaround of using a newer baseimage seems to work, this seems to be a missing runtime dependency.

The following 2 Workarounds worked for me:

Workaround 1

Workaround 2

RUN apt-get update && apt-get install -t bullseye-backports -qy systemd


- Build the image with podman/docker e.g. `podman build -t promtail-new-systemd .`
- Run that image instead of the 'official' one
- Loki shows logs of that host again

## Proposed Solution
As the bullseye-backports are already used in the official image installing a newer systemd from there seems easy enough to do and would prevent this issue from exploding once more distros update to systemd v252 and above.
growse commented 7 months ago

Building promtail on Debian bookworm-slim (with libsystemd0 and libsystemd-dev 252), I still get the msg="disable watchConfig" reason="promtailServer cast fail error and no logs streaming when running it on another bookworm system.

The SYSTEMD_JOURNAL_COMPACT workaround makes no difference.

edit

Scratch that, I had a completely different problem. I was looking for journals in /var/log/journal because Storage=Persistent, but logs are actually being written to /run/log/journal. No idea why. Not a promtail issue though.