[log] More readable error/warn log

yosukehara commented 8 years ago

We need to revise those log format to more readable because LeoFS' administrators are not able to understand the logs in detail.

mocchira commented 7 years ago

My thoughts.

Two logs

provide 2 types of logs
- one for admins to inform what happened and what to do next when things can be recoverable
- add a whole new one with the above new rule.
- one for devs to dig into the problem when stuff goes bad into an un(known|expected) state
- basically should be taken over the current error/warn because there are someone who can understand the current format and operate leo based on that ( kind of backward-compatible )

New log for admins

When log should be written

there is something admins have to take a action. NOT log an error like network timeout happened only once. LOG like network timeout happened many times for a long time to a specific node so that there is a case some network|machine trouble have happened, admins have to dig into with their favorite tools.

Format

there is no reason to be formatted as json. just for pretty printing on github.

format

{
"when": "when this error happend",
"what": "what happend (should be consisted of words general admins should know)",
"next action": "what to do next for admins. if it's difficult to explain with one line, guide them with `please see the below link for more details`",
"detail": "the detailed procedure for what to do next if needed"
}

example

{
"when": "2016-12-02 07:15:36.35493 +0000",
"what": "The disk usage is getting close to 90%",
"next action": "leofs-adm suspend storage_1@192.168.0.5;leofs-adm stop storage_1@192.168.0.5; then add disk capacity",
"detail": "http://github.com/leo-project/leofs/wiki/Disk_Nealy_Full"
}

windkit commented 7 years ago

I think it is useful to provide the "where" information to admins.

We can first define a set of errors that could be easily recognized and therefore with well defined set of error logs.

Logs like

From: leo_storage_0@192.168.0.1 not found
From: leo_storage_1@192.168.0.2 unavailable

could be useful, as user can easily tell where the problem comes from and have a brief picture

With a defined set of errors, we can write a documentation page for them about possible root causes, actions to take, etc.

Undefined errors could just be categorized as "internal trouble" and details output to the dev logs.

windkit commented 7 years ago

We also need a standard format for error log messages, for examples, fields are tab separated So administrators can easily parse them and pass to their monitoring systems.

mocchira commented 6 years ago

Reference
- http://dev.splunk.com/view/logging/SP-CAAAFCK

mocchira commented 6 years ago

No we've reached the consensus that we will rely on third-party log analysis tools like kibana, logstash etc to analyze log files and provide user-friendly error messages so we will document how to integrate LeoFS with third-party log analysis tools in our official document.

vstax commented 6 years ago

Just wanted to share a working example for fluentd (td-agent.conf) that ships logs to elastic (relies on global paths to log files):

<source>
  @type tail
  path /var/log/leofs/*/app/info,/var/log/leofs/*/app/error
  pos_file /var/log/leofs/leofs.log.pos
  tag leofs.app
  format /^\[(?<level>[^\t]*)\]\t(?<node>[^\t]*)\t(?<time>[^\t]*)\t(?<timestamp>[^\t]*)\t(?<method>[^\t]*)\t((?<line>[^t]*)\t)?(?<message>[^\t]*)/
  time_format %Y-%m-%d %H:%M:%S.%L %z
</source>
<match **>
    @type forward
    require_ack_response false
    heartbeat_type tcp
    phi_failure_detector false
    expire_dns_cache 0
    <server>
      name fluentd-file
      host fluentd.lan
      port 5180
    </server>
</match>

mocchira commented 6 years ago

@vstax Thanks! That's really helpful to us. we are going to cite the above as a fluentd's example.

leo-project / leofs