Open mbr opened 6 years ago
I think these plans are fine but I can't help digressing just a bit...
Let me just reiterate my position that I think logging within a library (not including the library's tests!) should be done sparingly, if at all. I realize my stance seems a bit severe (though I'm not alone). The thing is that we library authors tend to be lazy. When some non-critical event occurs, our first instinct is simply to throw it somewhere so that we don't have to think about it or deal with it. Logging becomes a catch-all grab bag for these kinds of events.
It's much more important to minimize the amount of noise we generate and instead seriously consider the implications of the various exceptions that we would otherwise log, ideally doing something intelligent about them (handling them or redesigning something). If that is absolutely not possible, only then passing them back to the caller to be handled (as we do with FaultLog
). Any messages other than that should be eliminated.
I'm not saying let's remove logging from hbbft tomorrow but I am saying let's at least start considering moving in that direction. As we begin integration into Parity it's going to become more and more important that we reduce or eliminate our 'environmental impact'.
All that being said @mbr, logging properly within tests is important so please don't interpret my post as a criticism of what you're planning to do. The only thing I'd have to say there is to keep it as simple as possible for others to write tests using whichever logging framework you decide (though this is probably an argument against slog
). The caching plan is great and I fully support that.
A path to overcome our (mine in particular) dependence on excessive logging might be to embrace proptest
regressions. The problem that logs easily "solve" (modulo your ability to construct succinct regexps) is availability of intermediate states. It is easy to build tests from outputs of DistAlgorithm::handle_input
and DistAlgorithm::handle_message
. However, if some internal bugs are not present in that output, such a test would not find any useful regression. Tests don't have access to the Step
while it is being constructed and cannot find bugs in that process of construction, which is the bulk of the inner workings of a DistAlgorithm
: step.extend
or step.extend_with
. I still think regressions are applicable in this wider setting but in order to construct useful regressions, we need to leave execution trace from functions that mutate steps. When those functions their return output and the trace to the test, the test can treat the trace as a kind of output and perform regressions on the trace as well.
That's not a bad idea. It's may also be wise to consider redesigning things into small, independently testable components as much as possible as well as to write tests that properly isolate those components (using test ~harnesses~ beds, etc.). Small tests could then more precisely and accurately determine if and where a regression has occurred.
In other words, if there is some potentially fragile process happening within a method or component, figuring out a way to break it down to expose that process by creating a more composable design is probably a good design principle to apply as much as possible.
When running tests to track down bugs or in production, it would be very beneficial to collect traces, networking logs or simple status messages for analysis. Currently, there is a mix of logging techniques employed, for example, new-style tests use mainly
println!
debugging and manual log output, while there are logging macros likedebug!
already employed by the crate itself.Unifying test logs
The
VirtualNet
class currently writes a copy of every networking message into a text file iftrace()
is enabled, aiding in the search for bugs in case of failure. Additional information, such as the random seed used to instantiate the test (for reproducability), or information about the internal state of the HBBFT algorithm family being tested are currently things we want to be able to record as well.If we are to add additional logging, it should be unified all in one facility. Getting rid of the custom packet dump and instead logging each packet through the same facilities that log status messages use will hopefully reduce the amount of custom implementation. To be able to retain all the functionality currently used, the resulting logfile should be easily greppable for the networking logs. As an added benefit, it would be possible to see a packets path side-by-side with the node's information.
Complexity (println/log/slog)
Three approaches are available, which shall be named println, log and slog:
println is the most common way of outputting "logs" in tests by using the
println!
andeprintln!
macros. The advantage is sheer simplicity and the good integration into the unit testing framework that ships with rust; by default, each test will collect all output to stdout and stderr and only display it if the test failed.log is using the log facade and a suitable logger, typically env_logger is employed in many applications, but file-based logging solutions are also available. The crate uses global macros to record messages:
info!("This is a sample log message")
and most implementations require a small setup function to be called for initialization.slog stands for "structured logging", using the slog crate. It mainly adds scope and structure to the model of log, resulting in different macros with the same naming convention being called:
info!(some_log, "Something happend in node {node_id}"; node_id => self.id)
. Here,some_log
is a specific logger instance (opposed to theinfo!
macro oflog
, which knows no such context) andnode_id
is available as structured data in the logs.Complexity increases with each;
println!
does not require any external crates or setup. Thelog
crate requires one additional crate dependency for the library and another for the implementation using it, if it wants to process log messages in any way, as well as a setup function.slog
is the most complex beast, typically not only requiring multiple additional crates and a setup function, but also code level changes, since logger instances should be passed around to create context (as opposed to thelog
variant, which has a single global logger "instance"). A workaround exists in the form ofslog_scope
, but is not advised to be used in libraries.Performance considerations
Right now, one successful
net_dynamic_honey_badger
test produces between 0.5 and 6 megabytes of network log output, during a runtime of 0.5 to 2 minutes. This does not include log messages from thedist_algorithm
itself, so an upper bound of 5 megabytes per minute is probably optimistic.The change to mocktography will yield an up to 200X speed-up, turning the log output to a potential 1 GB per core (it will finally be possible to run tests multiple times with different input parameters). Additionally, running on multiple threads (the average hardware thread count between all developers on the projects is more than 20 threads per person) can multiply this figure.
All of the logging solutions are somewhat lightweight, although it's possible that due to the low CPU complexity of mocktography the relative overhead of logging (or even
VirtualNet
) might be quite high, even with otherwise fast solutions. In the case ofslog
, some parts will need to be restructured to pass in or at least cacheLogger
instances, as creation involves multiple.clone()
s otherwise.Output format
While the output format can easily be changed we have two options with
slog
; either outputting in a traditional text-based format or outputting structured information. The latter might be handy if there are specialized tools to navigate/filter the output (suggestions welcome), but with JSON we are likely to lose readability or grep'ability, so "traditional" should be the way to go for now.Possibilites for optimization
Implementing custom drains should be simple and allow for some optimization. One core idea is using a "spillover" or spooled log drain, similar to Python's SpooledTemporaryFile. With a buffer size of maybe 500 to 1000 megabytes, this would put our RAM requirements at roughtly 1 GB per core, but could eliminate almost all disk writes if only failed test's logs are kept. Ideally, a very large
BufWriter
would suffice.Data could also be compressed transparently with a fast algorithm like gzip with a simple stream decorator, likely saving a lot of space (the space saving using compression on the current networking logs is almost 97%). When compressing in-memory as well, this would drastically reduce memory requirements as well.
Log rotation
Log rotation, that is filling up a log file until a certain size is reached, then renaming it and continuing in another file is a method to limit disk space, since older logs can periodically be removed.
Likely it is not a good fit for us here. For one, we almost always want to look at log message from a specific test only, so one file per test is much more reasonable. Rotating then only prevents an overflow, but hard limit should suffice as well. Rotation is much more beneficial when all logs for all tests are collected in a central place, but this would result in interleaved log files that quickly become hard to handle (tests generate upwards of a million entries).
Logs would therefore be removed automatically, unless a test failed. Since compressed logs are a few megabytes each, I think it's reasonable to force the user to manually delete them.
Requirements & the plan
The following features are desirable (if there is something missing here, feel free to chime in):
VirtualNet
through tracing.hbbft
, with context (e.g. node ID).slog
.For now, I would suggest implementing everything using a
slog
based implementation, realizing the features as follows:trace!
instead.slog
support, mainly passing loggers around. This would make 4. obsolete and can be done earlier to save the implementation of the workaround, but it's also probably the most complicated step in the whole process.