paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Log rotation #5926

Open arkpar opened 4 years ago

arkpar commented 4 years ago

Substrate logs to stderr and stdout. It is impossible to organize log rotation without copying previously logged data. E.g. logrotate has to be configured with copytruncate option that copies the file and truncates it, losing some data in the process.

Would be best to use a logger that supports rotation out of the box, or at least allows reopening output file on SIGHUP https://github.com/emabee/flexi_logger looks promising, as long as there's no performance regression.

tomaka commented 4 years ago

I'm not a devops so this is maybe a stupid question, but isn't it possible to pipe stdout/stderr to some utility that does the log rotation?

bkchr commented 4 years ago

AFAIK that is already done.

arkpar commented 4 years ago

AFAIK that is already done.

It is done rather inefficiently. By copying gigabytes of data and losing some data in the process. logrotate copies all of the file and calls ftruncate on it. See copytruncate description here Simply splitting the output would be more efficient indeed, but that would require using custom log file management, instead of relying on logrotate.

gabreal commented 4 years ago

logrotate is quite standard in linux operating systems and modern service managers like systemd also connect to stdout resp stderr to pipe it into their journal. Even further container based deployments also rely on log output to stdout/err. The issue here that i see is that tracing produces too much data for that to handle. The most easy solution would probably be to scale up the disk of the hosts (and use smaller log chunks to rotate) and be more specific in trace logging. Apart from that if one wants to deal with that on a host basis i suggest to make the logfile management at least optional, in larger setups it's probably easier to directly send the data off the host via rsyslog or something.

arkpar commented 4 years ago

The issue here that i see is that tracing produces too much data for that to handle.

No it is not. I've already described the actual issue twice here. Namely copying the log data needlessly, but more importantly losing log data. logrotate does not account for new log writes while the file is being copied. This is documented in man page that I've linked earlier. And it is already interfering with out ability to fix bugs. The important logs to fix https://github.com/paritytech/stakingops-issues/issues/15 or https://github.com/paritytech/stakingops-issues/issues/1 are missing exactly because of that.

gabreal commented 4 years ago

I assume the is a correlation between the total amount of logs and the amount of logs being lost. my argument was refering to that systemd's journald cannot cope with the amount of data.

would it be possible for the client to respond to a signal e.g. SIGUSR1 for closing and reopening stdout/stderr? not sure how that would pass through systemd here though.

arkpar commented 4 years ago

I assume the is a correlation between the total amount of logs and the amount of logs being lost. my argument was refering to that systemd's journald cannot cope with the amount of data.

Sure, if you produce less logs you'll lose less of them. With zero logs there won't be any losses at all. But we do need these log to fight bugs.

would it be possible for the client to respond to a signal e.g. SIGUSR1 for closing and reopening stdout/stderr? not sure how that would pass through systemd here though.

That's what I suggested in the issue description. Typically SIGHUP is used for ths.

gabreal commented 4 years ago

Would be best to use a logger that supports rotation out of the box, or at least allows reopening output file on SIGHUP

That's what I suggested in the issue description. Typically SIGHUP is used for ths.

if output file can also still be stdout/-err then yes :)

pscott commented 4 years ago

I'm not sure I follow here: flexi_logger does support log rotation out of the box. Would the ideal setup be:

  1. Check if output is atty: 1.1 If it is, then do not log to a file, basically do nothing. 1.2 If it is NOT, then setup a log rotation (with what size limit / age limit?)
  2. Profit?

Note that this does not do anything special with SIGHUP / SIGUSRX. Also note that we could also add a feature with duplicate_to_stderr where the user could have logs written to stderr and to a log with rotation :)

edit: I see it's already been done here https://github.com/paritytech/substrate/blob/ffcce8538f08ad64656ac5eb8da783dce1a79eae/client/cli/src/lib.rs#L267-L270

cc @gabreal @arkpar

marcio-diaz commented 4 years ago

Reasigning to @pscott because he already started.

arkpar commented 4 years ago
  1. Check if output is atty: 1.1 If it is, then do not log to a file, basically do nothing. 1.2 If it is NOT, then setup a log rotation (with what size limit / age limit?)

Looks about right. I'd also add a CLI option for configuring size/age limit. Unlimited by default.

pscott commented 4 years ago

So I've made some progress on this. I think the only way we can actually do log rotation with flexi_logger is if we specify from the beginning that our target is a file (i.e. not something like substrate 2>my_file.

What I'm thinking here is that we should have --log-directory PATH and one of (--log-size U64 / --log-age U64) flags. Last but not least I would add the --log-duplicate BOOL flag, for the user to choose rather he'd like the log to be duplicated to stderr.

Does that sound good @arkpar ?

pscott commented 4 years ago

This avoid doing things like checking for isatty and trying to interpret what the user actually wants. Also makes logging much clearer and simpler imo :)

arkpar commented 4 years ago

In that case --log-file should capture stderr indeed. We already have duplicating of INFO and WARN messages to stdout when stderr is redirected to a file. So that user could see default info in the console while extended log is written to a file. Would be ideal to preserve that.

pscott commented 4 years ago

So to make sure we're on the same page here: --log-duplicate should be set to default to INFO level? We'd have this UX:

I would personally have added the --log-duplicate, making the logging to stderr an opt-in rather than default.

PS: to work with flexi_logger, I had to remove the piece of code that automatically duplicated the INFO and WARM messages to stderr (this was located in the format function passed in to the log builder). I can still add it back with a simple call to .duplicate(), so it's just a matter of how we want it to behave :)

arkpar commented 4 years ago

More like this:

Basically --log-file would simply redirect stderr to a file preserving existing behaviour and making stderr empty. --log-duplicate would duplicate all messages to stderr. But again, INFO messages should be additionally copied to stdout if stderr is not attached to a console.

pscott commented 4 years ago

Should be possible once https://github.com/emabee/flexi_logger/pull/51 gets merged :)

tomaka commented 4 years ago

Reopening after #6627

pscott commented 4 years ago

So there were a couple of problems with #6564 :

  1. init_logger() was modified to error when it failed, instead of warning. This resulted in a big API break, which is not desirable.
  2. Add to that that it was very easy to break stuff because I did not notice that flexi_logger only accepts filters that do not contain hyphens. This is easily fixable by .replace('-', '_') on the filter we wish to add. My bad on this one!
  3. On top of that, even though flexi_logger uses a similar syntax to env_logger, specific targets cannot be targeted. For example, using env_logger, one can add a filter that says my_target=trace, and have logs like info!(target: "my_target") be properly logged. Unfortunately, this is not possible (yet) with flexi_logger (from what I know). This is a big deal, as this means that there is no way to run for example substrate --log sub-libp2p=trace.

Last but not least:

So all and all this was a mess! As soon as targeting specific targets gets supported by flexi_logger, everything should be fine as these are all really small and easy tweaks to do! :)

bkchr commented 4 years ago

As soon as targeting specific targets gets supported by flexi_logger,

What you mean by this?

pscott commented 4 years ago

As soon as targeting specific targets gets supported by flexi_logger,

What you mean by this?

I created an issue on flexi_logger repo and I'm currently trying to see if I can solve it myself and submit a PR :)

EDIT: another useful PR for log rotation is on its way :)

bkchr commented 4 years ago

I'm still more in favor of switching to slog. I know that we need to rewrite some parts of telemetry, but I started this here already: https://github.com/paritytech/substrate/commit/b91620bef6b7eaa3c35e08bf65f0ec3eb1561165

pscott commented 4 years ago

I'm still more in favor of switching to slog. I know that we need to rewrite some parts of telemetry, but I started this here already: b91620b

I'll let other participants jump in here :) @tomaka @arkpar @gabreal @marcio-diaz

gabreal commented 4 years ago

from the devops perspective the particular implementation is not important. what is required is the ability to log everything to stdout/stderr in order to have systemd's journal or dockerized deployments easily aggregate the logs.

cecton commented 4 years ago

from the devops perspective the particular implementation is not important. what is required is the ability to log everything to stdout/stderr in order to have systemd's journal or dockerized deployments easily aggregate the logs.

If we use systemd with journald I don't see how we lose logs. If I understand right, we lose logs only because we use logrotate somewhere (please correct me if I'm wrong). Could we get rid of logrotate and use systemd instead everywhere?

gabreal commented 4 years ago

that's correct yes (for loosing log entries with logrotate). afaik logrotate is not used anymore (as now systemd's journal is in use everywhere). maybe this could be useful for heavy trace logging. @ddorgan what do you think?

cecton commented 4 years ago

oh... ok :thinking: @arkpar do you know if we still lose logs somewhere?

ddorgan commented 4 years ago

Flushing the logs on SIGHUP would be really great. We're currently working around the logrotate losing some logs by using journald anyway, but it would nice to have the option of using logrotate without data loss.

arkpar commented 4 years ago

Indeed this issue was originally filed to tackle log issues caused by logrotate. This is includes loss of log data and node stalling because copying around huge log files choked disk IO. If logrotate is no longer used, this is less of an issue. Though custom log rotation would still be useful I think. Not all platforms have journald and for and end users it might be easier to use something build-in.

ddorgan commented 4 years ago

Actually it would be great to move away from journald logging if possible, there's quite a big performance overhead right now. So how far off is this? Would be very happy to test once ready.

pscott commented 4 years ago

@ddorgan A PR that adds log rotation out of the box will be ready as soon as this issue on flexi_logger's repo is fixed (this one basically). This is, unless we decide to focus on slog or to keep on using env_logger :)

ddorgan commented 4 years ago

@pscott great, just ping me when it's ready to be tested.

cecton commented 4 years ago

Sorry but I'm going to play devil's advocate here.

it would nice to have the option of using logrotate without data loss.

Apparently log_rotate = data loss. The way the fix was designed is that substrate would play the role of log rotate. The next fix would be the same. Logrotate is out in all scenarios afaiu.

there's quite a big performance overhead right now.

I would really like to know more about that.

Also it's a sign that log rotation is not exactly easy to implement. Therefore a custom solution instead of a well known, battle tested and commonly used solution might not be the best idea.

My point is: handling logs might be best handled by the OS.

Not all platforms have journald and for and end users it might be easier to use something build-in.

I assume every platform has its own logging mechanism. On Linux it's mostly system.d. I'm not a fan of system.d but if it does it's job...

Overall I think our motivations are questionable. I'm really not against adding the feature of log rotation but I think it is important to not rush it. Don't forget that we just reverted our first attempt already. If there is no more urgency (since we use journald now), then we should consider taking our time and thinking this though.