troglobit / sysklogd

BSD syslog daemon with syslog()/syslogp() API replacement for Linux, RFC3164 + RFC5424
https://troglobit.com/sysklogd.html
Other
86 stars 19 forks source link

Upgrade from 2.2.2 to 2.2.3: First mark following daemon restart until after 30 minutes #37

Closed opty77 closed 11 months ago

opty77 commented 3 years ago

The title says it all. :-)

Cold boot today with 2.2.3 seems fine.

troglobit commented 3 years ago

I think you need to explain a bit more here, because I cannot reproduce this. The MARK messages don't come every 20 mins, they come after 20 mins of inactivity per log target. The first MARK after a restart comes after half the configured interval. So if anything is logged before that you'll get an extended delay before the first MARK is logged.

Also, is this really a high priority issue for you? I'm leaning towards won't fix, unless you can provide a reliable way to reproduce.

troglobit commented 2 years ago

Closing due to lack of response for almost a month.

opty77 commented 1 year ago

I should have found this earlier:

    /*
     * Set up timer callbacks for -- MARK -- et al
     */
    if (MarkInterval > 0) {
        int interval = MarkInterval / 2;

        if (interval < 30)
            interval = 30;
        timer_add(interval, domark, NULL);
    }

Maybe mention it in the syslogd(8) manual page?

troglobit commented 1 year ago

Would you mind spelling it out, what's missing from the man page? It really isn't obvious from your comment. What do you mean it is you have found? The -m interval command line option is documented here. As I mentioned before, the default is to log a MARK after 20 mins of inactivity.

Just to make sure, I double checked it with a Qemu image from the latest master of Buildroot, modifying the default args (which disables MARK with -m 0) to enable the default behavior:

Jun  6 13:42:17 buildroot kernel: random: crng init done
Jun  6 14:01:58 buildroot syslogd[89]: -- MARK --

Approximately 20 mins after the last activity, but nowhere near 30 minutes.

opty77 commented 1 year ago

Maybe I got it wrong after all? Let me show you the case I mentioned on IRC instead.

Configuration:

opty@vodopnik:~$ grep -vE '^(#|$)' tmp/backup/qeporkak/etc/syslog.conf
*.info;*.!warn;\
        authpriv.none;cron.none;mail.none;news.none     -/var/log/messages      ;RFC5424
*.warn;\
        authpriv.none;cron.none;mail.none;news.none     -/var/log/syslog        ;RFC5424
*.=debug                                                -/var/log/debug         ;RFC5424
authpriv.*                                              -/var/log/secure        ;RFC5424
cron.*                                                  -/var/log/cron          ;RFC5424
mail.*                                                  -/var/log/maillog       ;RFC5424
*.emerg                                                 *                       ;RFC5424
include /etc/syslog.d/*.conf
opty@vodopnik:~$ ls --full-time tmp/backup/qeporkak/var/log/maillog*
-rw-r----- 1 root root    0 2022-03-13 10:50:07.012753860 +0100 tmp/backup/qeporkak/var/log/maillog
-rw-r----- 1 root root 1188 2021-02-16 08:44:10.560000000 +0100 tmp/backup/qeporkak/var/log/maillog-20210216
-rw-r----- 1 root root 4586 2021-02-17 20:41:55.546567566 +0100 tmp/backup/qeporkak/var/log/maillog-20210221
-rw-r----- 1 root root  690 2021-05-03 12:39:02.624709587 +0200 tmp/backup/qeporkak/var/log/maillog-20210504
-rw-r----- 1 root root  690 2021-05-08 14:51:34.493007618 +0200 tmp/backup/qeporkak/var/log/maillog-20210509
-rw-r----- 1 root root  345 2021-06-19 17:25:57.920640757 +0200 tmp/backup/qeporkak/var/log/maillog-20210620
-rw-r----- 1 root root  402 2021-07-31 15:35:02.940822350 +0200 tmp/backup/qeporkak/var/log/maillog-20210801
-rw-r----- 1 root root  402 2021-12-05 13:47:17.464726328 +0100 tmp/backup/qeporkak/var/log/maillog-20211211
-rw-r----- 1 root root  399 2022-03-12 19:13:33.681645529 +0100 tmp/backup/qeporkak/var/log/maillog-20220313
opty@vodopnik:~$ ls --full-time tmp/backup/qeporkak/etc/syslog.d/*.conf
/bin/ls: cannot access 'tmp/backup/qeporkak/etc/syslog.d/*.conf': No such file or directory

messages-20230604:

2023-06-03T13:57:23.000000+02:00 qeporkak ntpd 6664 - - 0.0.0.0 c615 05 clock_sync
2023-06-03T14:28:57.674231+02:00 qeporkak syslogd 819 - - -- MARK --

syslog-20230604:

2023-06-03T13:57:23.000000+02:00 qeporkak ntpd 6664 - - ntpd: daemon failed to notify parent!
2023-06-03T16:53:10.000000+02:00 qeporkak elogind-daemon 1107 - - Received signal 15 [TERM]

debug-20230604:

2023-06-03T13:08:46.872892+02:00 qeporkak kernel - - -     BOOT_IMAGE=/system/boot/vmlinuz-generic-5.15.94
2023-06-04T09:48:17.001812+02:00 qeporkak kernel - - - e820: update [mem 0x00000000-0x00000fff] usable ==> reserved

secure-20230604:

2023-06-03T13:58:00.000000+02:00 qeporkak sudo 6671 - - pam_unix(sudo:session): session opened for user root(uid=0) by opty(uid=1000)
2023-06-03T14:32:37.000000+02:00 qeporkak sudo 6671 - - pam_unix(sudo:session): session closed for user root

cron-20230604:

2023-06-03T13:47:01.000000+02:00 qeporkak crond 1002 - - FILE /var/spool/cron/crontabs/root USER root PID 4367 /usr/bin/run-parts /etc/cron.hourly 1> /dev/null
2023-06-03T14:47:01.000000+02:00 qeporkak crond 1002 - - FILE /var/spool/cron/crontabs/root USER root PID 7745 /usr/bin/run-parts /etc/cron.hourly 1> /dev/null

Edit: But with 2.3.0 so I guess you will want me to try 2.5.0?

troglobit commented 1 year ago

What can I tell you, it works for me? What I need is a way to easily reproduce the problem in an environment where I can observe and attach a debugger to it.

Edit: But with 2.3.0 so I guess you will want me to try 2.5.0?

That would help, yes. But I've got to be honest with you, I've really tried to reproduce your issue and have not been able to in any of the environments/systems I've tried it on. So any more information/detail would be useful.

opty77 commented 1 year ago

I upgraded from 2.3.0 to 2.5.0, restarted the daemon and /var/log/messages shows:

2023-06-10T16:58:11.478377+02:00 qeporkak rc.syslog 17775 - - syslogd[17813] v2.5.0: restarted.
2023-06-10T17:26:44.441336+02:00 qeporkak smartd 1004 - - Device: /dev/sda [SAT], SMART Usage Attribute: 9 Power_On_Hours changed from 100 [Raw 2889] to 100 [Raw 2890]

I'd expect a -- MARK -- between those two messages. (No activity elsewhere.)

opty77 commented 1 year ago

Just to make sure, I double checked it with a Qemu image from the latest master of Buildroot, modifying the default args (which disables MARK with -m 0) to enable the default behavior:

Jun  6 13:42:17 buildroot kernel: random: crng init done
Jun  6 14:01:58 buildroot syslogd[89]: -- MARK --

I got an idea: Could you try again with ;RFC5424?

troglobit commented 1 year ago

Interesting, will do.

troglobit commented 1 year ago

Sorry, same result:

2023-06-13T07:01:23.462642+00:00 buildroot syslogd 118 - - -- MARK --
2023-06-13T07:21:23.093715+00:00 buildroot syslogd 118 - - -- MARK --

Don't really know what to do here. Not that I want to point blame elsewhere, but are you sure this isn't caused by sleep cycles, or some misconfiguration of timers or scheduling in your kernel?

opty77 commented 1 year ago

What about the message before the first -- MARK --?

troglobit commented 1 year ago

You mean in my log? It's b0rked because I suspended my laptop to go to the office, but it looks like this:

Jun 13 06:07:18 buildroot syslogd[89]: exiting on signal 15
2023-06-13T06:07:18.080121+00:00 buildroot syslogd 118 - - syslogd v2.5.0: restart.
2023-06-13T06:27:18.211101+00:00 buildroot syslogd 118 - - -- MARK --
2023-06-13T07:01:23.462642+00:00 buildroot syslogd 118 - - -- MARK --
2023-06-13T07:21:23.093715+00:00 buildroot syslogd 118 - - -- MARK --
2023-06-13T07:41:23.221621+00:00 buildroot syslogd 118 - - -- MARK --
2023-06-13T08:01:23.345838+00:00 buildroot syslogd 118 - - -- MARK --
2023-06-13T08:21:23.501651+00:00 buildroot syslogd 118 - - -- MARK --
2023-06-13T08:41:23.645672+00:00 buildroot syslogd 118 - - -- MARK --
2023-06-13T09:01:24.246295+00:00 buildroot syslogd 118 - - -- MARK --
2023-06-13T09:21:24.353886+00:00 buildroot syslogd 118 - - -- MARK --

Hence my question about sleep cycles above. Since time skips like that, e.g. sleep or ntp, are some of the things I can think of that can cause such delays. Other things are extreme system overload or misconfigured kernel, in order of most to least likely.

opty77 commented 1 year ago

I don't use sleep and NTP daemon stepped just a bit during boot:

2023-06-10T13:21:45.000000+02:00 qeporkak ntpd 914 - - 0.0.0.0 c61c 0c clock_step -0.534089 s
2023-06-10T13:21:44.000000+02:00 qeporkak ntpd 914 - - 0.0.0.0 c615 05 clock_sync

I can't recall any extreme system overload and I use distribution kernel.

I will try to do my own test(s) but the machine doesn't run continuously or unattended so it will take some time.

troglobit commented 1 year ago

That's fine, if you find anything of interest, let me know. If it's tangible and I can reproduce, then we can reopen the issue. OK?

opty77 commented 1 year ago

debug-syslogd basically contains:

/usr/sbin/syslogd -dsF -f $CONF -p $SOCK -C $CACHE -P $PID -K &

I should redirect the output to a file next time:

opty@qeporkak:~$ debug-syslogd
mkdir: created directory '/home/opty/tmp/debug-syslogd-mark/'
Starting.
cfline[*.* -/home/opty/tmp/debug-syslogd-mark/debug-syslogd-mark.log ;RFC5424]
symbolic name: * ==> 255
leading char in action: /
filename: '/home/opty/tmp/debug-syslogd-mark/debug-syslogd-mark.log'
RFC5424 format enabled
Created UNIX socket 5 ...
nslookup 'none:syslog'
Created inet socket 7 for (null):syslog ...
Created inet socket 8 for (null):syslog ...
FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FILE: /home/opty/tmp/debug-syslogd-mark/debug-syslogd-mark.log       ;RFC5424
flog<46>: syslogd v2.5.0: restart.
logmsg: syslog.info<46>, flags 0, from qeporkak, app-name syslogd procid 6843 msgid nil sd nil msg syslogd v2.5.0: restart.
Called fprintlog_first(), fmt5424(46, 0x00, qeporkak, syslogd, 6843, -, -, syslogd v2.5.0: restart.) logging to FILE /home/opty/tmp/debug-syslogd-mark/debug-syslogd-mark.log
syslogd: restarted.
Debugging disabled, SIGUSR1 to turn on debugging.
opty@qeporkak:~$ Message from UNIX socket #5: <13>Jun 16 10:02:00 opty: test
logmsg: user.notice<13>, flags 0, from qeporkak, app-name opty procid nil msgid nil sd nil msg test
Called fprintlog_first(), fmt5424(13, 0x00, qeporkak, opty, -, -, -, test) logging to FILE /home/opty/tmp/debug-syslogd-mark/debug-syslogd-mark.log
flog<198>: -- MARK --
logmsg: mark.info<198>, flags 8, from qeporkak, app-name syslogd procid 6843 msgid nil sd nil msg -- MARK --
flog<198>: -- MARK --
logmsg: mark.info<198>, flags 8, from qeporkak, app-name syslogd procid 6843 msgid nil sd nil msg -- MARK --
Called fprintlog_first(), fmt5424(198, 0x08, qeporkak, syslogd, 6843, -, -, -- MARK --) logging to FILE /home/opty/tmp/debug-syslogd-mark/debug-syslogd-mark.log
flog<198>: -- MARK --
logmsg: mark.info<198>, flags 8, from qeporkak, app-name syslogd procid 6843 msgid nil sd nil msg -- MARK --
Called fprintlog_first(), fmt5424(198, 0x08, qeporkak, syslogd, 6843, -, -, -- MARK --) logging to FILE /home/opty/tmp/debug-syslogd-mark/debug-syslogd-mark.log
syslogd: exiting on signal 15
flog<46>: exiting on signal 15
logmsg: syslog.info<46>, flags 0, from qeporkak, app-name syslogd procid 6843 msgid nil sd nil msg exiting on signal 15
Called fprintlog_first(), fmt5424(46, 0x00, qeporkak, syslogd, 6843, -, -, exiting on signal 15) logging to FILE /home/opty/tmp/debug-syslogd-mark/debug-syslogd-mark.log
Closing socket 5 ...
Closing socket 7 ...
Closing socket 8 ...

debug-syslogd-mark.log:

2023-06-16T09:57:00.947346+02:00 qeporkak syslogd 6843 - - syslogd v2.5.0: restart.
2023-06-16T10:02:00.401681+02:00 qeporkak opty - - - test
2023-06-16T10:27:00.973801+02:00 qeporkak syslogd 6843 - - -- MARK --
2023-06-16T10:47:00.988426+02:00 qeporkak syslogd 6843 - - -- MARK --
2023-06-16T10:49:39.922740+02:00 qeporkak syslogd 6843 - - exiting on signal 15
troglobit commented 1 year ago

OK, reproduced:

2023-06-16T17:29:06.845843+02:00 luthien syslogd 1591703 - - syslogd v2.4.4: restart.
2023-06-16T17:29:55.636916+02:00 luthien jocke - - - hej
2023-06-16T17:59:06.858416+02:00 luthien syslogd 1591703 - - -- MARK --
2023-06-16T18:05:44.071503+02:00 luthien syslogd 1591703 - - exiting on signal 2
2023-06-16T18:06:39.481698+02:00 luthien syslogd 1597307 - - syslogd v2.5.0: restart.
2023-06-16T18:07:13.102275+02:00 luthien jocke - - - hej
2023-06-16T18:36:39.493393+02:00 luthien syslogd 1597307 - - -- MARK --
2023-06-16T18:56:39.501102+02:00 luthien syslogd 1597307 - - -- MARK --
2023-06-16T19:16:39.510271+02:00 luthien syslogd 1597307 - - -- MARK --
2023-06-16T19:36:39.521658+02:00 luthien syslogd 1597307 - - -- MARK --
2023-06-16T19:56:39.531555+02:00 luthien syslogd 1597307 - - -- MARK --
2023-06-16T20:16:39.541021+02:00 luthien syslogd 1597307 - - -- MARK --
opty77 commented 1 year ago

32-bit Slackware 15.0 (another machine):

2023-06-20T12:33:00.878381+02:00 vodopnik syslogd 17236 - - syslogd v2.5.0: restart.
2023-06-20T12:43:00.888081+02:00 vodopnik syslogd 17236 - - -- MARK --
2023-06-20T13:03:00.907393+02:00 vodopnik syslogd 17236 - - -- MARK --
2023-06-20T13:23:00.924910+02:00 vodopnik syslogd 17236 - - -- MARK --
2023-06-20T13:23:39.769060+02:00 vodopnik syslogd 17236 - - exiting on signal 15
troglobit commented 11 months ago

Having looked into this now, the MARK function uses a global timer (as you might have seen already), which runs with the resolution of half the configured mark interval. E.g, if -m10 then sysklogd wakes up every 5 mins to check if there's been any activity in a log file within the last 10 min period. That's why it cannot log on exact 10 min intervals for each log file.

This also explains the 30 mins before initial MARK you've reported. It took 1.5 intervals to detect inactivity.

The reason for this implementation, I'm guessing, is legacy. To keep down the overhead and system wakeups needed for an admittedly low-priority function.

There are two ways to improve upon this:

  1. redesign to have a timer per logfile
  2. increase the resolution of the global timer

I'm leaning towards the second option since there is very little payoff in the added complexity of maintaining per-logfile timers. By increasing the resolution, e.g., by always running the timer at mininmal-mark-interval, i.e., 30 sec. we'd still be OK resource-wise and possibly also meet the expectations of this bug report.

Do you agree this would be sufficient?

opty77 commented 11 months ago

Yes, it seems like a good compromise at the moment.

troglobit commented 11 months ago

Should work better now.

Jul 26 07:21:41 anarchy syslogd[419]: exiting on signal 15
Jul 26 07:21:41 anarchy syslogd[481]: syslogd v2.5.1-dev: restart.
Jul 26 07:41:41 anarchy syslogd[481]: -- MARK --