tokio-rs / tracing

Application level tracing for Rust.
https://tracing.rs
MIT License
5.53k stars 725 forks source link

`NonBlocking` doesn't flush writer if buffer never empties #1931

Open lilyball opened 2 years ago

lilyball commented 2 years ago

Bug Report

Version

current master of tracing-appender, via source inspection

Crates

tracing-appender

Description

tracing_appender::non_blocking::NonBlocking does not flush its inner writer until the message buffer is empty. This means if I'm logging messages fast enough I can keep it from flushing for arbitrarily long. Depending on what the underlying writer is, this may cause it to buffer too much, leading to delays in log ingestion (for anything tailing the log) or data loss if the process aborts or is killed.

Besides that, the arbitrarily delayed flush can make it harder to implement useful logic in smart writers such as rolling logs. If I'm rolling a log, I don't want to do so in the middle of a log record, so I need a way to identify when it's safe to roll. Just looking for newlines isn't enough as the log record may include embedded newlines. Right now it looks like I can key off of write_all() to signal that an entire record is being written, but this is an implementation detail of NonBlocking and not something that can reasonably be relied upon indefinitely.

Looking at tracing-subscriber right now it also appears to use write_all() with no flush(), so relying on flush would cause issues if my writer is not used with NonBlocking, but that's a separate issue and it's less obvious what the solution there is (since it's synchronous, so flushing would potentially be a performance issue). But for the moment I just care about using something with NonBlocking.

There's two potential solutions here:

  1. When there are buffered messages, flush after N records. I don't know what N should be though.
  2. Document that NonBlocking uses io::Write::write_all() to write each record (and write a test to enforce this).

The latter is probably better in that it gives my underlying writer the ability to be smarter about buffering (e.g. my writer can then internally flush after N records if it wants, or however it decides to handle this), and in general it allows my writer to recognize log records. The downside is just that it locks NonBlocking in to having the full buffer ready prior to each write.

lilyball commented 2 years ago

I suppose this also relies on tracing-subscriber's behavior, in that NonBlocking needs to identify an incoming record. Right now it just packages up every call to write() (or write_all()) as a message for the worker. Given tracing_subscriber::fmt::Subscriber's current behavior that works, but tracing_subscriber would need to make similar guarantees for this to be effective. Either that or NonBlocking would need to use make_writer() to identify each record, as tracing_subscriber appears to call that once per record (I haven't dug deeply into this part so I'm not sure if there's any scenario in which it reuses a writer for multiple records).

Edit: MakeWriter::make_writer() has an implementor's note that says it's called once per event, but this is framed as a "so don't lose state" rather than a guarantee of future behavior. If NonBlocking wants to rely on that for framing purposes then tracing-subscriber would need to guarantee that make_writer() is invoked once per event (and dropped immediately afterwards) as an API contract rather than just a caution.

hawkw commented 2 years ago

I think it's quite reasonable to make the make_writer() call a guaranteed boundary for the log record. It certainly has that semantics now, and I think we can safely make that the documented semantics of the API. That feels like a nicer solution than putting special behavior in write_all() that isn't present in write(), IMO.

lilyball commented 2 years ago

I like guaranteeing that, though it has a downside which is that write_all() provides a slice of data which at the moment comprises the entire record, whereas with make_writer() the record termination occurs outside of the lifetime of that slice.

For most writers this shouldn't matter, but NonBlocking wraps a Write rather than a MakeWriter and so it cannot currently propagate the same record framing to its inner Write. It could still guarantee that it uses write_all(), but this does mean it would have to buffer the data internally until the record termination before it can call that on the inner Write.

It could also synthesize some other indicator, such as having some sort of thread-local wrapping record counter that is incremented on each record termination. This way something like RollingFileAppender could consult that to figure out when it's being asked to start a new record (see #1932 for why this matters). But this approach, besides being awkward, also means the writer doesn't have any callback when the record ends, which means something like a size-based log roller would not be able to roll after writing the record but would have to wait until the next record begins (which means delaying writes instead of potentially rolling while the thread is otherwise idle). NonBlocking could (in a breaking API change) switch to wrapping MakeWriter, but that makes it harder to use arbitrary Writes[^1]

[^1]: They'd need to be wrapped in something. Right now the simplest wrapper is probably Arc<T>, but that moves the writer to the heap. A new wrapper could be made that just returns &'a W as its writer but that's still API friction in that arbitrary writers that aren't already MakeWriter won't work as-is with NonBlocking. Though from the perspective of "I want to take a writer and make it async", this requirement would actually mean that wrapping your writer in NonBlocking doesn't change semantics, so maybe that's a good thing.

lilyball commented 2 years ago

I'm going to file an issue about NonBlocking being changed to take a MakeWriter because the more I think about it, the more I think it's valuable to say "wrapping your writer in NonBlocking doesn't change semantics of how it works, it just makes it async". And this would solve the problem of propagating record termination indicators to the worker thread.