logging: add comment about use of unconnected sockets

dfreese commented 1 year ago

This references some earlier discussion as to why connect is not used when setting up the socket used to write to the systemd journal. Looking at error reports, it seems as though most of the restarts to journald were due to watchdog timeouts when calling fsync under heavy I/O load. This appears to be fixed in v230, however, the same rationale still holds; defensively specify the address each time instead of relying on journald to stay running.

See: https://www.suse.com/support/kb/doc/?id=000019921

dfreese commented 1 year ago

Looks like connected_to_journal returns false on CI despite the socket being available. I've gone ahead and removed that check (and the nested Option it implied) for now.

swsnr commented 1 year ago

Are you looking for a Log implementation for the journal? In this case I've written one already: https://github.com/swsnr/systemd-journal-logger.rs

swsnr commented 1 year ago

@dfreese By your commit and your comment, I think you misunderstood what connected_to_journal does. You can't use this function to check that journald runs on the current system. To check that you'd just check that the socket exists, and–if you'd like to be extra sure–send an empty message over the socket (journald explicitly ignores this, so an empty message is the way to check that journald's listening on the other side).

Instead connected_to_journal checks whether the current process can safely upgrade its logging from unstructured text to stderr to structured messages for journald. A process can only do this of course, if its logging output ends up in the journal anyway, i.e. if the stderr of the process directly writes to the systemd journal (instead of, say, a tty or a log file), and that's what connected_to_journal checks. It's normally only ever true for processes directly spawned by systemd (or indirectly if they inherit their parents stderr).

That's a very specific use case, and shouldn't be done in general, because there are valid reasons to write to the journal in other circumstances as well: For instance, if a configuration file explicitly configures logging to journal, a process would not check connected_to_journal but would log to journald anyway.

Hence, adding a connected_to_journal to your new struct would have been quite wrong.

Edit: You can see how this works in our corresponding test. It explicitly starts the test executable under systemd-run to test the function.

lucab commented 1 year ago

I basically agree with @swsnr's review above, I wouldn't land this change.

This mostly touches on the same topic of https://github.com/lucab/libsystemd-rs/issues/37. I'm still of the idea that a log interface should be built on top of this, not directly baked in. This is what @swsnr did, but we should probably tweak the docs to advertise this a bit more prominently.

swsnr commented 1 year ago

However, one of the problems that I have with the current API is that no potential errors are checked until the first call to journal_send, which I feel like is a problem.

I understand the idea but as said in my previous comment, I don't think these changes check any relevant errors upfront. As said, I believe all you'll ever see with your changes is a process having exhausted its file descriptors, but you'll likely not see this as early in a process as logging setup (which is likely the very first thing you'll do in main()), and then I'd say it's the applications fault for not having raised its FD soft limits early on.

Any "real" error (i.e. journald not running) will still not surface before the first call to journal_send, because even with these changes you still create an unbound socket.

And the unbound socket is by design to allow application logging survive journald crashes, and it's also how upstream does it, so binding early to see if journald is running is not an option, in my opinion.

You could just check that the socket exists early on, e.g. by sending an empty payload (that's what tracing-journald does), but I don't think it's worth a major refactoring, because it doesn't guarantee that journald is still running by the time you try to send something 🤷🏿

We already discussed to wrap the socket in a struct in https://github.com/lucab/libsystemd-rs/issues/87, and didn't see a convincing use case back then. While I see that your change is kinda the "natural" way to design an API around the journald socket, I also believe that it's an API without a use case 🤷🏿

dfreese commented 1 year ago

@swsnr thanks for the context. I didn't realize that was the reason an unconnected socket was used. It does seem like something is being pessimized by not using a reconnect-on-failure approach, but it seems like most implementations disagree with me, and I haven't measured that.

Since it seems votes are typically against wrapping the socket in a struct I've updated this to just add a comment providing the context you provided here.

lucab / libsystemd-rs

logging: add comment about use of unconnected sockets #138