scottmuc / infrastructure

Documentation / Automation for personal third-party infrastructure
The Unlicense
11 stars 2 forks source link

Migrate Log Forwarding to Better Stack #60

Closed scottmuc closed 1 year ago

scottmuc commented 1 year ago

Currently, logs are being forwarded to Papertrail.

Recently, the application I use to monitor and alert when my websites go down announced a bit of a rebranding and a direction towards building a platform. I've been impressed with the experience with Better Uptime (now becoming Uptime as part of the Better Stack), and consolidating these tools makes sense to me.

Here's the current logging configuration.

As an exercise, I'd like to experiment with having logs going to both services. I believe this will provide some useful real-world experience and allow me to do a side-by-side review before doing the switch-over.

If logs cannot be forwarded with TLS, I won't do the migration.

scottmuc commented 1 year ago

Doing some initial research and found some good resources. Important context is that I'm using rsyslog.

Forwarding to 2 servers seems to be relatively straightforward according to this Serverfault answer.

The Betterstack logging guide looks to be really well done. That being said, that referenced guide doesn't include TLS so reading the docs is also required.

Tinkering with this is reminding me of the rfc5424 work I did at Pivotal. I may work that into this task as well.

scottmuc commented 1 year ago

Created a log source with Betterstack and followed the Serverfault guidance on forwarding to 2 servers. Worked on the first try:

image

hint the "testing" string is what I was looking for after running logger "testing" on the PI.

At the moment, I'm trusting the configuration that the traffic is in fact encrypted.

Also, it appears rsyslog is complaining about duplicated defaultnetstreamdrivercafile so I'll see if I can specify that on the specific action or if a list can be provided.

scottmuc commented 1 year ago

To resolve the separate CA files per log server destination this answer hinted at concatenating all CAs into 1 file.

What's really interesting is that this was posted as an issue (just 3 months ago) to rsyslog (https://github.com/rsyslog/rsyslog/issues/5150). Some noteworthy take-aways from this is that this feature is supported if running a modern version. Looking at my current system, I found that I'm running a very old version (8.2102.0 released 2 years ago). @mrdavidlaing, you may get a kick out of the maintainer of rsyslog being pretty direct with the message of "your version is too old, so no support for you".

With this discovery, I'd rather upgrade rsyslog and use the new feature rather than concatenating CA bundles.

scottmuc commented 1 year ago

It looks like the bullseye repo is far behind. Fortunately, new versions are in the bullseye-backports repo. This is something I'm not super familiar with but it looks like it's a matter of adding the backports repo to the sources (source). This will be a good exercise to keep a more up-to-date old version of Raspian.

Firstly, I'll commit and push my work in progress. It still seems to be forwarding logs despite the warning message, but I'd prefer to silence the warnings by having a better configured and up-to-date system. Plus, it'll be a good exercise in some linux admin fundamentals.

scottmuc commented 1 year ago

Upgrading rsyslog using bullseye-backports was surprisingly easy! The above commit was all that was necessary.

scottmuc commented 1 year ago

The above commit resolves the CA file warnings, but now logs are not forwarding to Better Stack. Funny enough, through the partial implementation of this task, I can see the issue in the logs that are reaching Papertrail:

Jun 25 19:38:23 raspberrypi not permitted to talk to peer, certificate invalid: signer not found [v8.2302.0]
Jun 25 19:38:23 raspberrypi rsyslogd invalid cert info: peer provided 3 certificate(s). Certificate 1 info: certificate valid from Tue May 23 08:10:58 2023 to Mon Aug 21 08:10:57 2023; Certificate public key: EC/ECDSA; DN: CN=*.logs.betterstack.com; Issuer DN: C=US,O=Let's Encrypt,CN=R3; SAN:DNSname: *.logs.betterstack.com;  [v8.2302.0]

I'm puzzled as to how logs got to Better Stack in the first place. I'm curious if the bundle from Papertrail has the relevant signing certs included in it.

scottmuc commented 1 year ago

Resolved above errors with a bit of a workaround in the above commit. I'm seeing logs going to bother services now. I'm also seeing the following message a few times:

Jun 25 20:30:30 raspberrypi rsyslogd unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2302.0 try https://www.rsyslog.com/e/2078 ] Jun 25 20:30:30 raspberrypi rsyslogd omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ] Jun 25 20:30:30 raspberrypi rsyslogd action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]

scottmuc commented 1 year ago

TIL that the way to specify the RFC5424 template is to use RSYSLOG_SyslogProtocol23Format (source, docs)

scottmuc commented 1 year ago

After reading this issue https://github.com/rsyslog/rsyslog/issues/4447, this could be because the folks at Better Stack have stricter timeouts on the TCP connections. I will observe this a bit more. If these messages occur after some pause between sending logs, that might be a hint. I might be try using TCP keepalive settings or chat with the folks at Better Stack.

I'm now seeing warnings that the certificate and key file are not specified. According to this issue https://github.com/rsyslog/rsyslog/issues/3413, it looks like it's ok for them to be left out. They are needed if client authentication needs to be done on the server side. In my situation, I'm just wanting to ensure that I verify the identity of where I'm sending the logs. A random person could flood my log collectors if they need a couple private details for either Papertrail (host and port) or Better Stack (token).

scottmuc commented 1 year ago

Feedback video recorded and published (20m): https://www.youtube.com/watch?v=PDDnvp-M5mg

I've rotated the exposed token and have pinged BetterStack (also sent an email since the Twitter account generally redirects to either DMs or hello@betterstack.com) for a contact. I couldn't find a direct way to messaging them on their website without navigating a maze of menu options.

Now I'll continue with a bit of configuration consistency changes. If it fixes things, I'll report here.

scottmuc commented 1 year ago

The above commits result in a more consistent configuration. Now I see duplicate warnings like the following:

warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]

Which leads me to believe that setting StreamDriverAuthMod="x509/name" (rather than anon) sends rsyslog down a configuration validation path that checks for these config options. Though after reading the docs, I'm not sure if x509/certvalid means client cert or server cert. Maybe that's the setting one would use on a receiving server to require client cert authentication. shrug

Also... now after a bit of periodic usage, I do find Logtail a bit sluggish from a UI perspective compared to Papertrail. The page refresh speed is a fraction of a second for Papertrail while it's seconds for Logtail. At this moment I do not plan on removing Papertrail until Logtail improves a few things.

scottmuc commented 1 year ago

Experiment 1 - TCP_Framing

Tried experimenting with setting TCP_Framing to traditional. This was only a hunch because I recall an experience where I worked on a platform team and something was sending logs to our fluentd instance that couldn't handle octet-streaming. I can't recall what was sending the logs but I do remember seeing the frame size via tcpdump.

Anyways... experiment didn't resolve it as per these logs (I manually logged a message post configuration change):

image

scottmuc commented 1 year ago

Experitment 2 - KeepAlive

Setting KeepAlive="on". Didn't resolve the issue.

image

scottmuc commented 1 year ago

Got a response to my e-mail from Filip at BetterStack. Happy that the feedback will be shared with the team.

scottmuc commented 1 year ago

Summary

At the moment, I cannot commit to switching to BetterStack Logtail. I'll continue to forward logs to both locations. Current blockers to turning off Papertrail are:

I'm guessing the gtls issue might require chatting with the rsyslog team. For now I can dissect https://github.com/rsyslog/rsyslog/issues/4447 further and learn how to setup the debugging so I can get the relevant logs around this issue. I'll leave this task open for the time being to track the learning and progress.

scottmuc commented 1 year ago

Analysis

The Problem

Logs are dropped when forwarding to BetterStack Logtail and ERR logs report some form of TCP connection issue bubbling up from the TLS/TCP dependencies of rsyslog.

Replication

I'm happy that I have a consistent way of replicating the error and supressing it in an artificial way. My Raspberry PI doesn't emit many logs so this analysis might not be relevant for most environments. Still, this error could come up in testing or in product comparisons. Given this fact, I emit logs with the following command running on the PI:

while sleep 31; do logger "slept 31s"; done

Logs received by Papertrail

Jul 01 12:54:30 raspberrypi ansible slept 31s
Jul 01 12:55:01 raspberrypi ansible slept 31s
Jul 01 12:55:01 raspberrypi rsyslogd unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function.  [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
Jul 01 12:55:01 raspberrypi rsyslogd omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
Jul 01 12:55:01 raspberrypi rsyslogd action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]
Jul 01 12:55:01 raspberrypi rsyslogd warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
Jul 01 12:55:01 raspberrypi rsyslogd warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]
Jul 01 12:55:01 raspberrypi rsyslogd action 'action-0-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2302.0 try https://www.rsyslog.com/e/2359 ]
Jul 01 12:55:32 raspberrypi ansible slept 31s
Jul 01 12:56:03 raspberrypi ansible slept 31s
Jul 01 12:56:03 raspberrypi rsyslogd unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function.  [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
Jul 01 12:56:03 raspberrypi rsyslogd omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
Jul 01 12:56:03 raspberrypi rsyslogd action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]
Jul 01 12:56:03 raspberrypi rsyslogd warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
Jul 01 12:56:03 raspberrypi rsyslogd warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]
Jul 01 12:56:03 raspberrypi rsyslogd action 'action-0-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2302.0 try https://www.rsyslog.com/e/2359 ]

Logs received by BetterStack Logtail

2023-07-01 12:55:01.340 [raspberry_pi] Syslog [NOTICE] slept 31s
2023-07-01 12:55:01.340 [raspberry_pi] Syslog [ERR] unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function.  [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
2023-07-01 12:55:01.340 [raspberry_pi] Syslog [ERR] omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
2023-07-01 12:55:01.348 [raspberry_pi] Syslog [WARNING] action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]
2023-07-01 12:55:01.400 [raspberry_pi] Syslog [WARNING] warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
2023-07-01 12:55:01.400 [raspberry_pi] Syslog [WARNING] warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]
2023-07-01 12:55:01.552 [raspberry_pi] Syslog [INFO] action 'action-0-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2302.0 try https://www.rsyslog.com/e/2359 ]
2023-07-01 12:56:03.383 [raspberry_pi] Syslog [NOTICE] slept 31s
2023-07-01 12:56:03.383 [raspberry_pi] Syslog [ERR] unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function.  [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
2023-07-01 12:56:03.383 [raspberry_pi] Syslog [ERR] omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
2023-07-01 12:56:03.391 [raspberry_pi] Syslog [WARNING] action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]
2023-07-01 12:56:03.443 [raspberry_pi] Syslog [WARNING] warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
2023-07-01 12:56:03.443 [raspberry_pi] Syslog [WARNING] warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]
2023-07-01 12:56:03.591 [raspberry_pi] Syslog [INFO] action 'action-0-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2302.0 try https://www.rsyslog.com/e/2359 ]

note The logs at 12:54:30 and 12:55:32 did not appear in BetterStack Logtail. What preceded (guessing it preceded) are ERR logs bubbling up from Rsyslog -> GnuTLS -> Glibc(?).

Work Around

By changing the frequency of my log emitions I get the following results:

while sleep 29; do logger "slept 29s"; done

Logs received by Papertrail

Jul 01 13:06:35 raspberrypi ansible slept 29s
Jul 01 13:07:04 raspberrypi ansible slept 29s
Jul 01 13:07:33 raspberrypi ansible slept 29s
Jul 01 13:08:02 raspberrypi ansible slept 29s
Jul 01 13:08:32 raspberrypi ansible slept 29s

Logs received by BetterStack Logtail

2023-07-01 13:06:35.804 [raspberry_pi] Syslog [NOTICE] slept 29s
2023-07-01 13:07:04.823 [raspberry_pi] Syslog [NOTICE] slept 29s
2023-07-01 13:07:33.837 [raspberry_pi] Syslog [NOTICE] slept 29s
2023-07-01 13:08:02.851 [raspberry_pi] Syslog [NOTICE] slept 29s
2023-07-01 13:08:31.874 [raspberry_pi] Syslog [NOTICE] slept 29s

Environment

ansible@raspberrypi:~ $ rsyslogd -v
rsyslogd  8.2302.0 (aka 2023.02) compiled with:
        PLATFORM:                               aarch64-unknown-linux-gnu
        PLATFORM (lsb_release -d):
        FEATURE_REGEXP:                         Yes
        GSSAPI Kerberos 5 support:              Yes
        FEATURE_DEBUG (debug build, slow code): No
        32bit Atomic operations supported:      Yes
        64bit Atomic operations supported:      Yes
        memory allocator:                       system default
        Runtime Instrumentation (slow code):    No
        uuid support:                           Yes
        systemd support:                        Yes
        Config file:                            /etc/rsyslog.conf
        PID file:                               /run/rsyslogd.pid
        Number of Bits in RainerScript integers: 64

See https://www.rsyslog.com for more information.
ansible@raspberrypi:~ $ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye
ansible@raspberrypi:~ $ uname -a
Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr  3 17:24:16 BST 2023 aarch64 GNU/Linux
ansible@raspberrypi:~ $ dpkg -l | grep gnutls
ii  libcurl3-gnutls:arm64                7.74.0-1.3+deb11u7               arm64        easy-to-use client-side URL transfer library (GnuTLS flavour)
ii  libgnutls30:arm64                    3.7.1-5+deb11u3                  arm64        GNU TLS library - main runtime library
ii  rsyslog-gnutls                       8.2302.0-1~bpo11+1               arm64        TLS protocol support for rsyslog (GnuTLS)
ansible@raspberrypi:~ $ ldd --version
ldd (Debian GLIBC 2.31-13+rpt2+rpi1+deb11u5) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

Analysis

The GnuTLS docs doesn't have much to add around: GnuTLS error -53

Searching for "TCPSendBuf+error+-2078" in a few search engines results in rsyslog issues such as the following:

These all seemed to signal towards more fundamental TCP/IP issues not TLS. I can't find TCPSendBuf error -2078 maps to in the glibc codebase. I'm a little out of my deptch when trying to understand C. I found what I hoped to be a useful GNU libc error codes document, but it doesn't show what integers that codes match with.

Looking further, it looks like my version of glibc is old. I have to give kudos to perror's stackoverflow answer on upgrading glibc. It provided some background and yet more maintainer politics. I also appreciated their sense of humour with step 4 being Pray. Still... it looks like eglibc stopped updates in 2017.

I was also curious as to what type of TLS connections were created for both:

Papertrail

ansible@raspberrypi:~ $ gnutls-cli -d 1 logs.papertrailapp.com -p <redacted>
Processed 129 CA certificate(s).
Resolving 'logs.papertrailapp.com:<redacted>'...
Connecting to '169.46.82.165:<redacted>'...
- Certificate type: X.509
- Got a certificate list of 4 certificates.
- Certificate[0] info:
 - subject `CN=*.papertrailapp.com', issuer `CN=Sectigo RSA Domain Validation Secure Server CA,O=Sectigo L
imited,L=Salford,ST=Greater Manchester,C=GB', serial 0x00b468131d4e02b0e4f7640793347b94e3, RSA key 2048 bi
ts, signed using RSA-SHA256, activated `2022-08-25 00:00:00 UTC', expires `2023-08-25 23:59:59 UTC', pin-s
ha256="FYui1R7uT22TT2nmoniwAI5UNTaBXsuSKCcxUusX4kM="
        Public Key ID:
                sha1:d84d168a641513ec6f750a0f19fc5a109d770a36
                sha256:158ba2d51eee4f6d934f69e6a278b0008e543536815ecb9228273152eb17e243
        Public Key PIN:
                pin-sha256:FYui1R7uT22TT2nmoniwAI5UNTaBXsuSKCcxUusX4kM=

- Certificate[1] info:
 - subject `CN=Sectigo RSA Domain Validation Secure Server CA,O=Sectigo Limited,L=Salford,ST=Greater Manch
ester,C=GB', issuer `CN=USERTrust RSA Certification Authority,O=The USERTRUST Network,L=Jersey City,ST=New
 Jersey,C=US', serial 0x7d5b5126b476ba11db74160bbc530da7, RSA key 2048 bits, signed using RSA-SHA384, acti
vated `2018-11-02 00:00:00 UTC', expires `2030-12-31 23:59:59 UTC', pin-sha256="4a6cPehI7OG6cuDZka5NDZ7FR8
a60d3auda+sKfg4Ng="
- Certificate[2] info:
 - subject `CN=USERTrust RSA Certification Authority,O=The USERTRUST Network,L=Jersey City,ST=New Jersey,C
=US', issuer `CN=AAA Certificate Services,O=Comodo CA Limited,L=Salford,ST=Greater Manchester,C=GB', seria
l 0x3972443af922b751d7d36c10dd313595, RSA key 4096 bits, signed using RSA-SHA384, activated `2019-03-12 00
:00:00 UTC', expires `2028-12-31 23:59:59 UTC', pin-sha256="x4QzPSC810K5/cMjb05Qm4k3Bw5zBn4lTdO/nEW/Td4="
- Certificate[3] info:
 - subject `CN=AAA Certificate Services,O=Comodo CA Limited,L=Salford,ST=Greater Manchester,C=GB', issuer
`CN=AAA Certificate Services,O=Comodo CA Limited,L=Salford,ST=Greater Manchester,C=GB', serial 0x01, RSA k
ey 2048 bits, signed using RSA-SHA1 (broken!), activated `2004-01-01 00:00:00 UTC', expires `2028-12-31 23
:59:59 UTC', pin-sha256="vRU+17BDT2iGsXvOi76E7TQMcTLXAqj0+jGPdW7L1vM="
- Status: The certificate is trusted.
- Description: (TLS1.2-X.509)-(ECDHE-X25519)-(RSA-PSS-RSAE-SHA256)-(AES-128-CBC)-(SHA1)
- Session ID: BF:EE:26:C5:6F:B8:94:E1:1F:A1:23:39:B7:89:F0:72:13:66:0D:73:DA:20:05:19:D5:4A:32:3A:DA:94:69
:98
- Options: extended master secret, safe renegotiation,
- Handshake was completed

- Simple Client Mode:

BetterStack Logtail

ansible@raspberrypi:~ $ gnutls-cli -d 1 in.logs.betterstack.com -p 6514
Processed 129 CA certificate(s).
Resolving 'in.logs.betterstack.com:6514'...
Connecting to '65.108.132.43:6514'...
- Certificate type: X.509
- Got a certificate list of 3 certificates.
- Certificate[0] info:
 - subject `CN=*.logs.betterstack.com', issuer `CN=R3,O=Let's Encrypt,C=US', serial 0x03780622f2b6932da177
a04a000fc26234a4, EC/ECDSA key 256 bits, signed using RSA-SHA256, activated `2023-05-23 07:10:58 UTC', exp
ires `2023-08-21 07:10:57 UTC', pin-sha256="RYhEZU9LPm42uFSu/21kM2hFmawIOjFbG2xNGCT1rDQ="
        Public Key ID:
                sha1:7e03bca5cb61f21fd5bfc6e390b6e2068a2d0ee6
                sha256:458844654f4b3e6e36b854aeff6d6433684599ac083a315b1b6c4d1824f5ac34
        Public Key PIN:
                pin-sha256:RYhEZU9LPm42uFSu/21kM2hFmawIOjFbG2xNGCT1rDQ=

- Certificate[1] info:
 - subject `CN=R3,O=Let's Encrypt,C=US', issuer `CN=ISRG Root X1,O=Internet Security Research Group,C=US',
 serial 0x00912b084acf0c18a753f6d62e25a75f5a, RSA key 2048 bits, signed using RSA-SHA256, activated `2020-
09-04 00:00:00 UTC', expires `2025-09-15 16:00:00 UTC', pin-sha256="jQJTbIh0grw0/1TkHSumWb+Fs0Ggogr621gT3P
vPKG0="
- Certificate[2] info:
 - subject `CN=ISRG Root X1,O=Internet Security Research Group,C=US', issuer `CN=DST Root CA X3,O=Digital
Signature Trust Co.', serial 0x4001772137d4e942b8ee76aa3c640ab7, RSA key 4096 bits, signed using RSA-SHA25
6, activated `2021-01-20 19:14:03 UTC', expires `2024-09-30 18:14:03 UTC', pin-sha256="C5+lpZ7tcVwmwQIMcRt
PbsQtWLABXhQzejna0wHFr8M="
- Status: The certificate is trusted.
- Description: (TLS1.3-X.509)-(ECDHE-SECP256R1)-(ECDSA-SECP256R1-SHA256)-(AES-256-GCM)
- Session ID: DC:04:49:0B:7C:F7:59:37:60:B8:4F:93:E5:8E:67:D5:50:A0:7F:48:1F:3F:27:AC:0F:91:E1:2B:9D:25:98
:09
- Options:
- Handshake was completed

- Simple Client Mode:

The few things I noticed between these connections are as follow:

Summary

Given the replication and remediation steps are around changing the length of time of absence of logs, I'm leaning towards investigating what sort of timeouts are implemented at BetterStack. Not sure of the right terminology, but something like read timeout or idle timeout seems about right.

Next Steps

scottmuc commented 1 year ago

Tried changing the TLD driver to ossl as per Rainer's suggestion. Interesting to see the following error message. Looks like the peer name resolved with openssl is different than with gnutls

Jul 01 14:57:45 raspberrypi rsyslogd nsd_ossl:TLS session terminated with remote syslog server: peer name not authorized, not permitted to talk to name: /CN=*.logtail.com;  [v8.2302.0 try https://www.rsyslog.com/e/2088 ]

By updating the configuration to use StreamDriverPermittedPeers="*.logtail.com", the previous error stopped appearing.

As Rainer mentioned, ossl does have a bit more descriptive messages.

Jul 01 15:02:04 raspberrypi rsyslogd SSL_ERROR_SYSCALL Error in 'Send': 'error:00000005:lib(0):func(0):DH lib(5)' with ret=-1, errno=32, sslapi='SSL_write'  [v8.2302.0]
Jul 01 15:02:04 raspberrypi rsyslogd omfwd: TCPSendBuf error -1, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0]
Jul 01 15:02:04 raspberrypi rsyslogd nsd_ossl: TLS session terminated successfully to remote syslog server 'in.logs.betterstack.com' with SSL Error '-1': End Session [v8.2302.0]

What I notice with this one is it mentions in.logs.betterstack.com. I'll add that as a permitted peer and see if that changes things.

Unfortunately it did not change things, but did see errno=32 which means a broken pipe.

update the same 29s vs 31s replication and resolution steps apply to the ossl TLS driver.

scottmuc commented 1 year ago

Looked at trying unencrypted TCP but the documentation shows UDP configuration for that. So can't really test the TCP connectivity in isolation of TLS.

scottmuc commented 1 year ago

Forgot that I already tried turning on TCP KeepAlive, but given that this only sends packets with out content body, it wasn't enough to prevent the connection from dropping. I'm starting to believe that some form of ingress device on the BetterStack side is hanging up.

I followed David Lang's suggestion in (https://github.com/rsyslog/rsyslog/issues/5173#issuecomment-1616161699) to use omprog to emit a keep alive like log. You can see the logging results below:

Jul 02 13:53:10 raspberrypi ansible slept 31s
Jul 02 13:53:35 raspberrypi root logalive.sh: haven't read a log in 25s, staying alive
Jul 02 13:53:41 raspberrypi ansible slept 31s
Jul 02 13:54:06 raspberrypi root logalive.sh: haven't read a log in 25s, staying alive
Jul 02 13:54:12 raspberrypi ansible slept 31s
Jul 02 13:54:37 raspberrypi root logalive.sh: haven't read a log in 25s, staying alive
Jul 02 13:54:43 raspberrypi ansible slept 31s
Jul 02 13:55:08 raspberrypi root logalive.sh: haven't read a log in 25s, staying alive
Jul 02 13:55:18 raspberrypi ansible slept 15s
Jul 02 13:55:33 raspberrypi ansible slept 15s
Jul 02 13:55:48 raspberrypi ansible slept 15s
Jul 02 13:56:03 raspberrypi ansible slept 15s
Jul 02 13:56:18 raspberrypi ansible slept 15s

What we see is that the GnuTLS errors no longer occur when I'm doing the 31s sleep/log/loop. Instead we see a log saying that nothing has been logged for 25s. This appears to be sufficient to keep the connection alive. When I switch the loop to 15s, the keepalive observes that logs have been sent so it doesn't write a keepalive log.

Rather than doing the rest of the experiments, I'm going to wait to hear from BetterStack to confirm whether or not they have some idle timeout settings on their ingress.

The keepalive script and configuration will be in the next commit.

scottmuc commented 1 year ago

I'm back from my trip. When I looked at my logs is was fully of the omprog logs. This seemed peculiar because I know that there should be more logs. Things like cron, systemd, grafana, navidrome etc... weren't showing up. I rebooted the PI and now they do show up! This is unrelated to this issue. I did some journald and other service restarts while troubleshooting so I must have done something that broke the log funnel. This signals to me that I need to better understand how the logging flows work when services are being supervised by systemd.

Future reading: https://askubuntu.com/questions/1190144/where-do-log-messages-go-with-journald-and-rsyslog

Still waiting to hear from BetterStack to confirm if there's some idle timeout configured on their end.

scottmuc commented 1 year ago

How A Log Logs

A short aside to describe my mental model on how systemd service logs stopped appearing in /var/log/syslog (and thus stopped appearing in Papertrail and BetterStack).

I'll start working backwards from rsyslog

rsyslog

The following input module configuration is how rsyslog receives logs as inputs:

module(load="imuxsock") # provides support for local system logging
module(load="imklog")   # provides kernel logging support

imuxsock enables the pulling of logs from the /dev/log socket (docs). This is the one relevant for the logs coming from the systemd service units.

And here is the rule that says to write things to /var/log/syslog

#
# Log anything besides private authentication messages to a single log file
#
*.*;auth,authpriv.none      -/var/log/syslog

journald + systemd

My journald.conf has the default of ForwardtoSyslog=yes. I'm not 100% sure but I believe this enables the communication of logs to rsyslog via /dev/log which is symlinked to /run/systemd/journal/dev-log (docs)

This "understanding logging in linux" stackoverflow question/answer is an excellent resource to piece this together.

My mental model of this system isn't very clear, but I do feel confident that /dev/log is the bridge between this system and rsyslog.

Systemd Services

Since the services I am running log to stdout, these kept picked up by the journald service somehow. Guessing that systemd provides a socket to ingest these.

I could bypass journald though with unit config as follows (source):

StandardOutput=syslog
StandardError=syslog

Loss of service logs cause hypothesis

At some point I did a reload or restart of the systemd-journald service, and I believe that's when my service's logs weren't reaching rsyslog. I'm not going to dissect this further since this isn't the main issue I'm dealing with, but hope this comment will come in handy in the future. The logger "foo" worked because it bypassed the systemd/journald system as did a few of the other logs (e.g.: auth logs).

Thanks @ohm for the rubber ducking!

scottmuc commented 1 year ago

Closing Decision

Decided to remove the BetterStack logging. I don't like the workaround to output a line every 30s silence the errors. With the workaround, I have logs full of the keep alive messages like below:

with_logalive

vs less full logs but with ERR and reconnection messages occupying the space instead:

image

With Papertrail, I don't have either of these issues.