Closed scottmuc closed 1 year ago
Doing some initial research and found some good resources. Important context is that I'm using rsyslog
.
Forwarding to 2 servers seems to be relatively straightforward according to this Serverfault answer.
The Betterstack logging guide looks to be really well done. That being said, that referenced guide doesn't include TLS so reading the docs is also required.
Tinkering with this is reminding me of the rfc5424 work I did at Pivotal. I may work that into this task as well.
Created a log source with Betterstack and followed the Serverfault guidance on forwarding to 2 servers. Worked on the first try:
hint the "testing" string is what I was looking for after running logger "testing"
on the PI.
At the moment, I'm trusting the configuration that the traffic is in fact encrypted.
Also, it appears rsyslog
is complaining about duplicated defaultnetstreamdrivercafile
so I'll see if I can specify that on the specific action
or if a list can be provided.
To resolve the separate CA files per log server destination this answer hinted at concatenating all CAs into 1 file.
What's really interesting is that this was posted as an issue (just 3 months ago) to rsyslog (https://github.com/rsyslog/rsyslog/issues/5150). Some noteworthy take-aways from this is that this feature is supported if running a modern version. Looking at my current system, I found that I'm running a very old version (8.2102.0
released 2 years ago). @mrdavidlaing, you may get a kick out of the maintainer of rsyslog
being pretty direct with the message of "your version is too old, so no support for you".
With this discovery, I'd rather upgrade rsyslog
and use the new feature rather than concatenating CA bundles.
It looks like the bullseye repo is far behind. Fortunately, new versions are in the bullseye-backports repo. This is something I'm not super familiar with but it looks like it's a matter of adding the backports repo to the sources (source). This will be a good exercise to keep a more up-to-date old version of Raspian.
Firstly, I'll commit and push my work in progress. It still seems to be forwarding logs despite the warning message, but I'd prefer to silence the warnings by having a better configured and up-to-date system. Plus, it'll be a good exercise in some linux admin fundamentals.
Upgrading rsyslog
using bullseye-backports
was surprisingly easy! The above commit was all that was necessary.
The above commit resolves the CA file warnings, but now logs are not forwarding to Better Stack. Funny enough, through the partial implementation of this task, I can see the issue in the logs that are reaching Papertrail:
Jun 25 19:38:23 raspberrypi not permitted to talk to peer, certificate invalid: signer not found [v8.2302.0]
Jun 25 19:38:23 raspberrypi rsyslogd invalid cert info: peer provided 3 certificate(s). Certificate 1 info: certificate valid from Tue May 23 08:10:58 2023 to Mon Aug 21 08:10:57 2023; Certificate public key: EC/ECDSA; DN: CN=*.logs.betterstack.com; Issuer DN: C=US,O=Let's Encrypt,CN=R3; SAN:DNSname: *.logs.betterstack.com; [v8.2302.0]
I'm puzzled as to how logs got to Better Stack in the first place. I'm curious if the bundle from Papertrail has the relevant signing certs included in it.
Resolved above errors with a bit of a workaround in the above commit. I'm seeing logs going to bother services now. I'm also seeing the following message a few times:
Jun 25 20:30:30 raspberrypi rsyslogd unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2302.0 try https://www.rsyslog.com/e/2078 ] Jun 25 20:30:30 raspberrypi rsyslogd omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ] Jun 25 20:30:30 raspberrypi rsyslogd action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]
After reading this issue https://github.com/rsyslog/rsyslog/issues/4447, this could be because the folks at Better Stack have stricter timeouts on the TCP connections. I will observe this a bit more. If these messages occur after some pause between sending logs, that might be a hint. I might be try using TCP keepalive settings or chat with the folks at Better Stack.
I'm now seeing warnings that the certificate and key file are not specified. According to this issue https://github.com/rsyslog/rsyslog/issues/3413, it looks like it's ok for them to be left out. They are needed if client authentication needs to be done on the server side. In my situation, I'm just wanting to ensure that I verify the identity of where I'm sending the logs. A random person could flood my log collectors if they need a couple private details for either Papertrail (host and port) or Better Stack (token).
Feedback video recorded and published (20m): https://www.youtube.com/watch?v=PDDnvp-M5mg
I've rotated the exposed token and have pinged BetterStack (also sent an email since the Twitter account generally redirects to either DMs or hello@betterstack.com) for a contact. I couldn't find a direct way to messaging them on their website without navigating a maze of menu options.
Now I'll continue with a bit of configuration consistency changes. If it fixes things, I'll report here.
The above commits result in a more consistent configuration. Now I see duplicate warnings like the following:
warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]
Which leads me to believe that setting StreamDriverAuthMod="x509/name"
(rather than anon
) sends rsyslog
down a configuration validation path that checks for these config options. Though after reading the docs, I'm not sure if x509/certvalid
means client cert or server cert. Maybe that's the setting one would use on a receiving server to require client cert authentication. shrug
Also... now after a bit of periodic usage, I do find Logtail a bit sluggish from a UI perspective compared to Papertrail. The page refresh speed is a fraction of a second for Papertrail while it's seconds for Logtail. At this moment I do not plan on removing Papertrail until Logtail improves a few things.
Tried experimenting with setting TCP_Framing
to traditional
. This was only a hunch because I recall an experience where I worked on a platform team and something was sending logs to our fluentd
instance that couldn't handle octet-streaming
. I can't recall what was sending the logs but I do remember seeing the frame size via tcpdump
.
Anyways... experiment didn't resolve it as per these logs (I manually logged a message post configuration change):
Setting KeepAlive="on"
. Didn't resolve the issue.
Got a response to my e-mail from Filip at BetterStack. Happy that the feedback will be shared with the team.
At the moment, I cannot commit to switching to BetterStack Logtail. I'll continue to forward logs to both locations. Current blockers to turning off Papertrail are:
unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2302.0 try https://www.rsyslog.com/e/2078
omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
I'm guessing the gtls
issue might require chatting with the rsyslog
team. For now I can dissect https://github.com/rsyslog/rsyslog/issues/4447 further and learn how to setup the debugging so I can get the relevant logs around this issue. I'll leave this task open for the time being to track the learning and progress.
Logs are dropped when forwarding to BetterStack Logtail and ERR
logs report some form of TCP connection issue bubbling up from the TLS/TCP dependencies of rsyslog
.
I'm happy that I have a consistent way of replicating the error and supressing it in an artificial way. My Raspberry PI doesn't emit many logs so this analysis might not be relevant for most environments. Still, this error could come up in testing or in product comparisons. Given this fact, I emit logs with the following command running on the PI:
while sleep 31; do logger "slept 31s"; done
Logs received by Papertrail
Jul 01 12:54:30 raspberrypi ansible slept 31s
Jul 01 12:55:01 raspberrypi ansible slept 31s
Jul 01 12:55:01 raspberrypi rsyslogd unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
Jul 01 12:55:01 raspberrypi rsyslogd omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
Jul 01 12:55:01 raspberrypi rsyslogd action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]
Jul 01 12:55:01 raspberrypi rsyslogd warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
Jul 01 12:55:01 raspberrypi rsyslogd warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]
Jul 01 12:55:01 raspberrypi rsyslogd action 'action-0-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2302.0 try https://www.rsyslog.com/e/2359 ]
Jul 01 12:55:32 raspberrypi ansible slept 31s
Jul 01 12:56:03 raspberrypi ansible slept 31s
Jul 01 12:56:03 raspberrypi rsyslogd unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
Jul 01 12:56:03 raspberrypi rsyslogd omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
Jul 01 12:56:03 raspberrypi rsyslogd action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]
Jul 01 12:56:03 raspberrypi rsyslogd warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
Jul 01 12:56:03 raspberrypi rsyslogd warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]
Jul 01 12:56:03 raspberrypi rsyslogd action 'action-0-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2302.0 try https://www.rsyslog.com/e/2359 ]
Logs received by BetterStack Logtail
2023-07-01 12:55:01.340 [raspberry_pi] Syslog [NOTICE] slept 31s
2023-07-01 12:55:01.340 [raspberry_pi] Syslog [ERR] unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
2023-07-01 12:55:01.340 [raspberry_pi] Syslog [ERR] omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
2023-07-01 12:55:01.348 [raspberry_pi] Syslog [WARNING] action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]
2023-07-01 12:55:01.400 [raspberry_pi] Syslog [WARNING] warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
2023-07-01 12:55:01.400 [raspberry_pi] Syslog [WARNING] warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]
2023-07-01 12:55:01.552 [raspberry_pi] Syslog [INFO] action 'action-0-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2302.0 try https://www.rsyslog.com/e/2359 ]
2023-07-01 12:56:03.383 [raspberry_pi] Syslog [NOTICE] slept 31s
2023-07-01 12:56:03.383 [raspberry_pi] Syslog [ERR] unexpected GnuTLS error -53, wantsWriteData=1 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
2023-07-01 12:56:03.383 [raspberry_pi] Syslog [ERR] omfwd: TCPSendBuf error -2078, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0 try https://www.rsyslog.com/e/2078 ]
2023-07-01 12:56:03.391 [raspberry_pi] Syslog [WARNING] action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2302.0 try https://www.rsyslog.com/e/2007 ]
2023-07-01 12:56:03.443 [raspberry_pi] Syslog [WARNING] warning: certificate file is not set [v8.2302.0 try https://www.rsyslog.com/e/2330 ]
2023-07-01 12:56:03.443 [raspberry_pi] Syslog [WARNING] warning: key file is not set [v8.2302.0 try https://www.rsyslog.com/e/2331 ]
2023-07-01 12:56:03.591 [raspberry_pi] Syslog [INFO] action 'action-0-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2302.0 try https://www.rsyslog.com/e/2359 ]
note The logs at 12:54:30
and 12:55:32
did not appear in BetterStack Logtail. What preceded (guessing it preceded) are ERR logs bubbling up from Rsyslog -> GnuTLS -> Glibc(?).
By changing the frequency of my log emitions I get the following results:
while sleep 29; do logger "slept 29s"; done
Logs received by Papertrail
Jul 01 13:06:35 raspberrypi ansible slept 29s
Jul 01 13:07:04 raspberrypi ansible slept 29s
Jul 01 13:07:33 raspberrypi ansible slept 29s
Jul 01 13:08:02 raspberrypi ansible slept 29s
Jul 01 13:08:32 raspberrypi ansible slept 29s
Logs received by BetterStack Logtail
2023-07-01 13:06:35.804 [raspberry_pi] Syslog [NOTICE] slept 29s
2023-07-01 13:07:04.823 [raspberry_pi] Syslog [NOTICE] slept 29s
2023-07-01 13:07:33.837 [raspberry_pi] Syslog [NOTICE] slept 29s
2023-07-01 13:08:02.851 [raspberry_pi] Syslog [NOTICE] slept 29s
2023-07-01 13:08:31.874 [raspberry_pi] Syslog [NOTICE] slept 29s
ansible@raspberrypi:~ $ rsyslogd -v
rsyslogd 8.2302.0 (aka 2023.02) compiled with:
PLATFORM: aarch64-unknown-linux-gnu
PLATFORM (lsb_release -d):
FEATURE_REGEXP: Yes
GSSAPI Kerberos 5 support: Yes
FEATURE_DEBUG (debug build, slow code): No
32bit Atomic operations supported: Yes
64bit Atomic operations supported: Yes
memory allocator: system default
Runtime Instrumentation (slow code): No
uuid support: Yes
systemd support: Yes
Config file: /etc/rsyslog.conf
PID file: /run/rsyslogd.pid
Number of Bits in RainerScript integers: 64
See https://www.rsyslog.com for more information.
ansible@raspberrypi:~ $ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 11 (bullseye)
Release: 11
Codename: bullseye
ansible@raspberrypi:~ $ uname -a
Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux
ansible@raspberrypi:~ $ dpkg -l | grep gnutls
ii libcurl3-gnutls:arm64 7.74.0-1.3+deb11u7 arm64 easy-to-use client-side URL transfer library (GnuTLS flavour)
ii libgnutls30:arm64 3.7.1-5+deb11u3 arm64 GNU TLS library - main runtime library
ii rsyslog-gnutls 8.2302.0-1~bpo11+1 arm64 TLS protocol support for rsyslog (GnuTLS)
ansible@raspberrypi:~ $ ldd --version
ldd (Debian GLIBC 2.31-13+rpt2+rpi1+deb11u5) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
The GnuTLS docs doesn't have much to add around: GnuTLS error -53
Searching for "TCPSendBuf+error+-2078"
in a few search engines results in rsyslog
issues such as the following:
These all seemed to signal towards more fundamental TCP/IP issues not TLS. I can't find TCPSendBuf error -2078
maps to in the glibc
codebase. I'm a little out of my deptch when trying to understand C. I found what I hoped to be a useful GNU libc error codes document, but it doesn't show what integers that codes match with.
Looking further, it looks like my version of glibc
is old. I have to give kudos to perror's stackoverflow answer on upgrading glibc
. It provided some background and yet more maintainer politics. I also appreciated their sense of humour with step 4 being Pray. Still... it looks like eglibc
stopped updates in 2017.
I was also curious as to what type of TLS connections were created for both:
Papertrail
ansible@raspberrypi:~ $ gnutls-cli -d 1 logs.papertrailapp.com -p <redacted>
Processed 129 CA certificate(s).
Resolving 'logs.papertrailapp.com:<redacted>'...
Connecting to '169.46.82.165:<redacted>'...
- Certificate type: X.509
- Got a certificate list of 4 certificates.
- Certificate[0] info:
- subject `CN=*.papertrailapp.com', issuer `CN=Sectigo RSA Domain Validation Secure Server CA,O=Sectigo L
imited,L=Salford,ST=Greater Manchester,C=GB', serial 0x00b468131d4e02b0e4f7640793347b94e3, RSA key 2048 bi
ts, signed using RSA-SHA256, activated `2022-08-25 00:00:00 UTC', expires `2023-08-25 23:59:59 UTC', pin-s
ha256="FYui1R7uT22TT2nmoniwAI5UNTaBXsuSKCcxUusX4kM="
Public Key ID:
sha1:d84d168a641513ec6f750a0f19fc5a109d770a36
sha256:158ba2d51eee4f6d934f69e6a278b0008e543536815ecb9228273152eb17e243
Public Key PIN:
pin-sha256:FYui1R7uT22TT2nmoniwAI5UNTaBXsuSKCcxUusX4kM=
- Certificate[1] info:
- subject `CN=Sectigo RSA Domain Validation Secure Server CA,O=Sectigo Limited,L=Salford,ST=Greater Manch
ester,C=GB', issuer `CN=USERTrust RSA Certification Authority,O=The USERTRUST Network,L=Jersey City,ST=New
Jersey,C=US', serial 0x7d5b5126b476ba11db74160bbc530da7, RSA key 2048 bits, signed using RSA-SHA384, acti
vated `2018-11-02 00:00:00 UTC', expires `2030-12-31 23:59:59 UTC', pin-sha256="4a6cPehI7OG6cuDZka5NDZ7FR8
a60d3auda+sKfg4Ng="
- Certificate[2] info:
- subject `CN=USERTrust RSA Certification Authority,O=The USERTRUST Network,L=Jersey City,ST=New Jersey,C
=US', issuer `CN=AAA Certificate Services,O=Comodo CA Limited,L=Salford,ST=Greater Manchester,C=GB', seria
l 0x3972443af922b751d7d36c10dd313595, RSA key 4096 bits, signed using RSA-SHA384, activated `2019-03-12 00
:00:00 UTC', expires `2028-12-31 23:59:59 UTC', pin-sha256="x4QzPSC810K5/cMjb05Qm4k3Bw5zBn4lTdO/nEW/Td4="
- Certificate[3] info:
- subject `CN=AAA Certificate Services,O=Comodo CA Limited,L=Salford,ST=Greater Manchester,C=GB', issuer
`CN=AAA Certificate Services,O=Comodo CA Limited,L=Salford,ST=Greater Manchester,C=GB', serial 0x01, RSA k
ey 2048 bits, signed using RSA-SHA1 (broken!), activated `2004-01-01 00:00:00 UTC', expires `2028-12-31 23
:59:59 UTC', pin-sha256="vRU+17BDT2iGsXvOi76E7TQMcTLXAqj0+jGPdW7L1vM="
- Status: The certificate is trusted.
- Description: (TLS1.2-X.509)-(ECDHE-X25519)-(RSA-PSS-RSAE-SHA256)-(AES-128-CBC)-(SHA1)
- Session ID: BF:EE:26:C5:6F:B8:94:E1:1F:A1:23:39:B7:89:F0:72:13:66:0D:73:DA:20:05:19:D5:4A:32:3A:DA:94:69
:98
- Options: extended master secret, safe renegotiation,
- Handshake was completed
- Simple Client Mode:
BetterStack Logtail
ansible@raspberrypi:~ $ gnutls-cli -d 1 in.logs.betterstack.com -p 6514
Processed 129 CA certificate(s).
Resolving 'in.logs.betterstack.com:6514'...
Connecting to '65.108.132.43:6514'...
- Certificate type: X.509
- Got a certificate list of 3 certificates.
- Certificate[0] info:
- subject `CN=*.logs.betterstack.com', issuer `CN=R3,O=Let's Encrypt,C=US', serial 0x03780622f2b6932da177
a04a000fc26234a4, EC/ECDSA key 256 bits, signed using RSA-SHA256, activated `2023-05-23 07:10:58 UTC', exp
ires `2023-08-21 07:10:57 UTC', pin-sha256="RYhEZU9LPm42uFSu/21kM2hFmawIOjFbG2xNGCT1rDQ="
Public Key ID:
sha1:7e03bca5cb61f21fd5bfc6e390b6e2068a2d0ee6
sha256:458844654f4b3e6e36b854aeff6d6433684599ac083a315b1b6c4d1824f5ac34
Public Key PIN:
pin-sha256:RYhEZU9LPm42uFSu/21kM2hFmawIOjFbG2xNGCT1rDQ=
- Certificate[1] info:
- subject `CN=R3,O=Let's Encrypt,C=US', issuer `CN=ISRG Root X1,O=Internet Security Research Group,C=US',
serial 0x00912b084acf0c18a753f6d62e25a75f5a, RSA key 2048 bits, signed using RSA-SHA256, activated `2020-
09-04 00:00:00 UTC', expires `2025-09-15 16:00:00 UTC', pin-sha256="jQJTbIh0grw0/1TkHSumWb+Fs0Ggogr621gT3P
vPKG0="
- Certificate[2] info:
- subject `CN=ISRG Root X1,O=Internet Security Research Group,C=US', issuer `CN=DST Root CA X3,O=Digital
Signature Trust Co.', serial 0x4001772137d4e942b8ee76aa3c640ab7, RSA key 4096 bits, signed using RSA-SHA25
6, activated `2021-01-20 19:14:03 UTC', expires `2024-09-30 18:14:03 UTC', pin-sha256="C5+lpZ7tcVwmwQIMcRt
PbsQtWLABXhQzejna0wHFr8M="
- Status: The certificate is trusted.
- Description: (TLS1.3-X.509)-(ECDHE-SECP256R1)-(ECDSA-SECP256R1-SHA256)-(AES-256-GCM)
- Session ID: DC:04:49:0B:7C:F7:59:37:60:B8:4F:93:E5:8E:67:D5:50:A0:7F:48:1F:3F:27:AC:0F:91:E1:2B:9D:25:98
:09
- Options:
- Handshake was completed
- Simple Client Mode:
The few things I noticed between these connections are as follow:
safe renegotiation
in the Options:
section (maybe this isn't allowed in TLS1.3?).Given the replication and remediation steps are around changing the length of time of absence of logs, I'm leaning towards investigating what sort of timeouts are implemented at BetterStack. Not sure of the right terminology, but something like read timeout
or idle timeout
seems about right.
rsyslog
, though I do want to post an issue there a reference this one to provide more context rich details that may help others.GnuTLS
is the issue, but there are experiments I can do to test this. If either of these fail, I would be more inclined to believe the issue is in TCP/IP connection handling.
ossl
(this https://github.com/rsyslog/rsyslog/issues/3915 issue show evidence)libc6
or try some tests with a container with a newer libc6
(more info)Raspian
will be based on bookworm
or how I can install a more up-to-date OS in general.Tried changing the TLD driver to ossl
as per Rainer's suggestion. Interesting to see the following error message. Looks like the peer name resolved with openssl
is different than with gnutls
Jul 01 14:57:45 raspberrypi rsyslogd nsd_ossl:TLS session terminated with remote syslog server: peer name not authorized, not permitted to talk to name: /CN=*.logtail.com; [v8.2302.0 try https://www.rsyslog.com/e/2088 ]
By updating the configuration to use StreamDriverPermittedPeers="*.logtail.com"
, the previous error stopped appearing.
As Rainer mentioned, ossl
does have a bit more descriptive messages.
Jul 01 15:02:04 raspberrypi rsyslogd SSL_ERROR_SYSCALL Error in 'Send': 'error:00000005:lib(0):func(0):DH lib(5)' with ret=-1, errno=32, sslapi='SSL_write' [v8.2302.0]
Jul 01 15:02:04 raspberrypi rsyslogd omfwd: TCPSendBuf error -1, destruct TCP Connection to in.logs.betterstack.com:6514 [v8.2302.0]
Jul 01 15:02:04 raspberrypi rsyslogd nsd_ossl: TLS session terminated successfully to remote syslog server 'in.logs.betterstack.com' with SSL Error '-1': End Session [v8.2302.0]
What I notice with this one is it mentions in.logs.betterstack.com
. I'll add that as a permitted peer and see if that changes things.
Unfortunately it did not change things, but did see errno=32
which means a broken pipe.
update the same 29s vs 31s replication and resolution steps apply to the ossl
TLS driver.
Looked at trying unencrypted TCP but the documentation shows UDP configuration for that. So can't really test the TCP connectivity in isolation of TLS.
Forgot that I already tried turning on TCP KeepAlive, but given that this only sends packets with out content body, it wasn't enough to prevent the connection from dropping. I'm starting to believe that some form of ingress device on the BetterStack side is hanging up.
I followed David Lang's suggestion in (https://github.com/rsyslog/rsyslog/issues/5173#issuecomment-1616161699) to use omprog
to emit a keep alive like log. You can see the logging results below:
Jul 02 13:53:10 raspberrypi ansible slept 31s
Jul 02 13:53:35 raspberrypi root logalive.sh: haven't read a log in 25s, staying alive
Jul 02 13:53:41 raspberrypi ansible slept 31s
Jul 02 13:54:06 raspberrypi root logalive.sh: haven't read a log in 25s, staying alive
Jul 02 13:54:12 raspberrypi ansible slept 31s
Jul 02 13:54:37 raspberrypi root logalive.sh: haven't read a log in 25s, staying alive
Jul 02 13:54:43 raspberrypi ansible slept 31s
Jul 02 13:55:08 raspberrypi root logalive.sh: haven't read a log in 25s, staying alive
Jul 02 13:55:18 raspberrypi ansible slept 15s
Jul 02 13:55:33 raspberrypi ansible slept 15s
Jul 02 13:55:48 raspberrypi ansible slept 15s
Jul 02 13:56:03 raspberrypi ansible slept 15s
Jul 02 13:56:18 raspberrypi ansible slept 15s
What we see is that the GnuTLS errors no longer occur when I'm doing the 31s sleep/log/loop. Instead we see a log saying that nothing has been logged for 25s. This appears to be sufficient to keep the connection alive. When I switch the loop to 15s, the keepalive observes that logs have been sent so it doesn't write a keepalive log.
Rather than doing the rest of the experiments, I'm going to wait to hear from BetterStack to confirm whether or not they have some idle timeout settings on their ingress.
The keepalive script and configuration will be in the next commit.
I'm back from my trip. When I looked at my logs is was fully of the omprog
logs. This seemed peculiar because I know that there should be more logs. Things like cron
, systemd
, grafana
, navidrome
etc... weren't showing up. I rebooted the PI and now they do show up! This is unrelated to this issue. I did some journald
and other service restarts while troubleshooting so I must have done something that broke the log funnel. This signals to me that I need to better understand how the logging flows work when services are being supervised by systemd
.
Future reading: https://askubuntu.com/questions/1190144/where-do-log-messages-go-with-journald-and-rsyslog
Still waiting to hear from BetterStack to confirm if there's some idle timeout configured on their end.
A short aside to describe my mental model on how systemd
service logs stopped appearing in /var/log/syslog
(and thus stopped appearing in Papertrail and BetterStack).
I'll start working backwards from rsyslog
The following input module configuration is how rsyslog
receives logs as inputs:
module(load="imuxsock") # provides support for local system logging
module(load="imklog") # provides kernel logging support
imuxsock
enables the pulling of logs from the /dev/log
socket (docs). This is the one relevant for the logs coming from the systemd
service units.
And here is the rule that says to write things to /var/log/syslog
#
# Log anything besides private authentication messages to a single log file
#
*.*;auth,authpriv.none -/var/log/syslog
My journald.conf
has the default of ForwardtoSyslog=yes
. I'm not 100% sure but I believe this enables the communication of logs to rsyslog
via /dev/log
which is symlinked to /run/systemd/journal/dev-log
(docs)
This "understanding logging in linux" stackoverflow question/answer is an excellent resource to piece this together.
My mental model of this system isn't very clear, but I do feel confident that /dev/log
is the bridge between this system and rsyslog
.
Since the services I am running log to stdout
, these kept picked up by the journald
service somehow. Guessing that systemd
provides a socket to ingest these.
I could bypass journald
though with unit config as follows (source):
StandardOutput=syslog
StandardError=syslog
At some point I did a reload or restart of the systemd-journald
service, and I believe that's when my service's logs weren't reaching rsyslog
. I'm not going to dissect this further since this isn't the main issue I'm dealing with, but hope this comment will come in handy in the future. The logger "foo"
worked because it bypassed the systemd
/journald
system as did a few of the other logs (e.g.: auth
logs).
Thanks @ohm for the rubber ducking!
Decided to remove the BetterStack logging. I don't like the workaround to output a line every 30s silence the errors. With the workaround, I have logs full of the keep alive messages like below:
vs less full logs but with ERR and reconnection messages occupying the space instead:
With Papertrail, I don't have either of these issues.
Currently, logs are being forwarded to Papertrail.
Recently, the application I use to monitor and alert when my websites go down announced a bit of a rebranding and a direction towards building a platform. I've been impressed with the experience with Better Uptime (now becoming Uptime as part of the Better Stack), and consolidating these tools makes sense to me.
Here's the current logging configuration.
As an exercise, I'd like to experiment with having logs going to both services. I believe this will provide some useful real-world experience and allow me to do a side-by-side review before doing the switch-over.
If logs cannot be forwarded with TLS, I won't do the migration.