Open arkpar opened 4 years ago
I'm not a devops so this is maybe a stupid question, but isn't it possible to pipe stdout/stderr to some utility that does the log rotation?
AFAIK that is already done.
AFAIK that is already done.
It is done rather inefficiently. By copying gigabytes of data and losing some data in the process. logrotate
copies all of the file and calls ftruncate
on it. See copytruncate
description here Simply split
ting the output would be more efficient indeed, but that would require using custom log file management, instead of relying on logrotate
.
logrotate
is quite standard in linux operating systems and modern service managers like systemd
also connect to stdout
resp stderr
to pipe it into their journal. Even further container based deployments also rely on log output to stdout/err
. The issue here that i see is that tracing produces too much data for that to handle.
The most easy solution would probably be to scale up the disk of the hosts (and use smaller log chunks to rotate) and be more specific in trace logging. Apart from that if one wants to deal with that on a host basis i suggest to make the logfile management at least optional, in larger setups it's probably easier to directly send the data off the host via rsyslog
or something.
The issue here that i see is that tracing produces too much data for that to handle.
No it is not. I've already described the actual issue twice here. Namely copying the log data needlessly, but more importantly losing log data. logrotate does not account for new log writes while the file is being copied. This is documented in man page that I've linked earlier. And it is already interfering with out ability to fix bugs. The important logs to fix https://github.com/paritytech/stakingops-issues/issues/15 or https://github.com/paritytech/stakingops-issues/issues/1 are missing exactly because of that.
I assume the is a correlation between the total amount of logs and the amount of logs being lost. my argument was refering to that systemd's journald cannot cope with the amount of data.
would it be possible for the client to respond to a signal e.g. SIGUSR1
for closing and reopening stdout/stderr
? not sure how that would pass through systemd here though.
I assume the is a correlation between the total amount of logs and the amount of logs being lost. my argument was refering to that systemd's journald cannot cope with the amount of data.
Sure, if you produce less logs you'll lose less of them. With zero logs there won't be any losses at all. But we do need these log to fight bugs.
would it be possible for the client to respond to a signal e.g.
SIGUSR1
for closing and reopeningstdout/stderr
? not sure how that would pass through systemd here though.
That's what I suggested in the issue description. Typically SIGHUP
is used for ths.
Would be best to use a logger that supports rotation out of the box, or at least allows reopening output file on SIGHUP
That's what I suggested in the issue description. Typically SIGHUP is used for ths.
if output file can also still be stdout/-err
then yes :)
I'm not sure I follow here: flexi_logger does support log rotation out of the box. Would the ideal setup be:
Note that this does not do anything special with SIGHUP / SIGUSRX. Also note that we could also add a feature with duplicate_to_stderr where the user could have logs written to stderr and to a log with rotation :)
edit: I see it's already been done here https://github.com/paritytech/substrate/blob/ffcce8538f08ad64656ac5eb8da783dce1a79eae/client/cli/src/lib.rs#L267-L270
cc @gabreal @arkpar
Reasigning to @pscott because he already started.
- Check if output is atty: 1.1 If it is, then do not log to a file, basically do nothing. 1.2 If it is NOT, then setup a log rotation (with what size limit / age limit?)
Looks about right. I'd also add a CLI option for configuring size/age limit. Unlimited by default.
So I've made some progress on this. I think the only way we can actually do log rotation with flexi_logger is if we specify from the beginning that our target is a file (i.e. not something like substrate 2>my_file
.
What I'm thinking here is that we should have --log-directory PATH
and one of (--log-size U64
/ --log-age U64
) flags. Last but not least I would add the --log-duplicate BOOL
flag, for the user to choose rather he'd like the log to be duplicated to stderr.
Does that sound good @arkpar ?
This avoid doing things like checking for isatty
and trying to interpret what the user actually wants. Also makes logging much clearer and simpler imo :)
In that case --log-file
should capture stderr indeed. We already have duplicating of INFO and WARN messages to stdout when stderr is redirected to a file. So that user could see default info in the console while extended log is written to a file. Would be ideal to preserve that.
So to make sure we're on the same page here: --log-duplicate
should be set to default to INFO level?
We'd have this UX:
--log-file
provided -> all logs are written to stderr--log-file
provided -> logs are written to file, and logs >= INFO level are written to stderrI would personally have added the --log-duplicate
, making the logging to stderr an opt-in rather than default.
PS: to work with flexi_logger, I had to remove the piece of code that automatically duplicated the INFO and WARM messages to stderr (this was located in the format function passed in to the log builder). I can still add it back with a simple call to .duplicate()
, so it's just a matter of how we want it to behave :)
More like this:
--log-file
provided -> all logs are written to stderr, INFO messages are duplicated to stdout but only if stderr is a file.--log-file provided
-> logs are written to file, and logs >= INFO level are duplicated to stdout. stderr is emptyBasically --log-file
would simply redirect stderr to a file preserving existing behaviour and making stderr empty. --log-duplicate
would duplicate all messages to stderr. But again, INFO messages should be additionally copied to stdout if stderr is not attached to a console.
Should be possible once https://github.com/emabee/flexi_logger/pull/51 gets merged :)
Reopening after #6627
So there were a couple of problems with #6564 :
init_logger()
was modified to error when it failed, instead of warning. This resulted in a big API break, which is not desirable..replace('-', '_')
on the filter we wish to add. My bad on this one!flexi_logger
uses a similar syntax to env_logger
, specific targets cannot be targeted. For example, using env_logger
, one can add a filter that says my_target=trace
, and have logs like info!(target: "my_target")
be properly logged. Unfortunately, this is not possible (yet) with flexi_logger (from what I know). This is a big deal, as this means that there is no way to run for example substrate --log sub-libp2p=trace
.Last but not least:
LogRotationOpt
in an Option
. We talked about it with @cecton and think LogRotationOpt should have public fields in order for it to be easy to construct without necessarily requiring a cli :) (no other constructor are implemented by LogRotationOpt
).So all and all this was a mess! As soon as targeting specific targets gets supported by flexi_logger, everything should be fine as these are all really small and easy tweaks to do! :)
As soon as targeting specific targets gets supported by flexi_logger,
What you mean by this?
As soon as targeting specific targets gets supported by flexi_logger,
What you mean by this?
I created an issue on flexi_logger repo and I'm currently trying to see if I can solve it myself and submit a PR :)
EDIT: another useful PR for log rotation is on its way :)
I'm still more in favor of switching to slog
. I know that we need to rewrite some parts of telemetry, but I started this here already: https://github.com/paritytech/substrate/commit/b91620bef6b7eaa3c35e08bf65f0ec3eb1561165
I'm still more in favor of switching to
slog
. I know that we need to rewrite some parts of telemetry, but I started this here already: b91620b
I'll let other participants jump in here :) @tomaka @arkpar @gabreal @marcio-diaz
from the devops perspective the particular implementation is not important. what is required is the ability to log everything to stdout/stderr in order to have systemd's journal or dockerized deployments easily aggregate the logs.
from the devops perspective the particular implementation is not important. what is required is the ability to log everything to stdout/stderr in order to have systemd's journal or dockerized deployments easily aggregate the logs.
If we use systemd with journald I don't see how we lose logs. If I understand right, we lose logs only because we use logrotate somewhere (please correct me if I'm wrong). Could we get rid of logrotate and use systemd instead everywhere?
that's correct yes (for loosing log entries with logrotate). afaik logrotate is not used anymore (as now systemd's journal is in use everywhere). maybe this could be useful for heavy trace logging. @ddorgan what do you think?
oh... ok :thinking: @arkpar do you know if we still lose logs somewhere?
Flushing the logs on SIGHUP
would be really great. We're currently working around the logrotate losing some logs by using journald anyway, but it would nice to have the option of using logrotate without data loss.
Indeed this issue was originally filed to tackle log issues caused by logrotate. This is includes loss of log data and node stalling because copying around huge log files choked disk IO. If logrotate is no longer used, this is less of an issue. Though custom log rotation would still be useful I think. Not all platforms have journald and for and end users it might be easier to use something build-in.
Actually it would be great to move away from journald logging if possible, there's quite a big performance overhead right now. So how far off is this? Would be very happy to test once ready.
@ddorgan A PR that adds log rotation out of the box will be ready as soon as this issue on flexi_logger's repo is fixed (this one basically).
This is, unless we decide to focus on slog
or to keep on using env_logger
:)
@pscott great, just ping me when it's ready to be tested.
Sorry but I'm going to play devil's advocate here.
it would nice to have the option of using logrotate without data loss.
Apparently log_rotate = data loss. The way the fix was designed is that substrate would play the role of log rotate. The next fix would be the same. Logrotate is out in all scenarios afaiu.
there's quite a big performance overhead right now.
I would really like to know more about that.
Also it's a sign that log rotation is not exactly easy to implement. Therefore a custom solution instead of a well known, battle tested and commonly used solution might not be the best idea.
My point is: handling logs might be best handled by the OS.
Not all platforms have journald and for and end users it might be easier to use something build-in.
I assume every platform has its own logging mechanism. On Linux it's mostly system.d. I'm not a fan of system.d but if it does it's job...
Overall I think our motivations are questionable. I'm really not against adding the feature of log rotation but I think it is important to not rush it. Don't forget that we just reverted our first attempt already. If there is no more urgency (since we use journald now), then we should consider taking our time and thinking this though.
Substrate logs to stderr and stdout. It is impossible to organize log rotation without copying previously logged data. E.g.
logrotate
has to be configured withcopytruncate
option that copies the file and truncates it, losing some data in the process.Would be best to use a logger that supports rotation out of the box, or at least allows reopening output file on
SIGHUP
https://github.com/emabee/flexi_logger looks promising, as long as there's no performance regression.