Open yosukehara opened 8 years ago
My thoughts.
there is something admins have to take a action. NOT log an error like network timeout happened only once. LOG like network timeout happened many times for a long time to a specific node so that there is a case some network|machine trouble have happened, admins have to dig into with their favorite tools.
there is no reason to be formatted as json. just for pretty printing on github.
{
"when": "when this error happend",
"what": "what happend (should be consisted of words general admins should know)",
"next action": "what to do next for admins. if it's difficult to explain with one line, guide them with `please see the below link for more details`",
"detail": "the detailed procedure for what to do next if needed"
}
{
"when": "2016-12-02 07:15:36.35493 +0000",
"what": "The disk usage is getting close to 90%",
"next action": "leofs-adm suspend storage_1@192.168.0.5;leofs-adm stop storage_1@192.168.0.5; then add disk capacity",
"detail": "http://github.com/leo-project/leofs/wiki/Disk_Nealy_Full"
}
I think it is useful to provide the "where" information to admins.
We can first define a set of errors that could be easily recognized and therefore with well defined set of error logs.
Logs like
From: leo_storage_0@192.168.0.1 not found
From: leo_storage_1@192.168.0.2 unavailable
could be useful, as user can easily tell where the problem comes from and have a brief picture
With a defined set of errors, we can write a documentation page for them about possible root causes, actions to take, etc.
Undefined errors could just be categorized as "internal trouble" and details output to the dev logs.
We also need a standard format for error log messages, for examples, fields are tab separated So administrators can easily parse them and pass to their monitoring systems.
No we've reached the consensus that we will rely on third-party log analysis tools like kibana, logstash etc to analyze log files and provide user-friendly error messages so we will document how to integrate LeoFS with third-party log analysis tools in our official document.
Just wanted to share a working example for fluentd (td-agent.conf
) that ships logs to elastic (relies on global paths to log files):
<source>
@type tail
path /var/log/leofs/*/app/info,/var/log/leofs/*/app/error
pos_file /var/log/leofs/leofs.log.pos
tag leofs.app
format /^\[(?<level>[^\t]*)\]\t(?<node>[^\t]*)\t(?<time>[^\t]*)\t(?<timestamp>[^\t]*)\t(?<method>[^\t]*)\t((?<line>[^t]*)\t)?(?<message>[^\t]*)/
time_format %Y-%m-%d %H:%M:%S.%L %z
</source>
<match **>
@type forward
require_ack_response false
heartbeat_type tcp
phi_failure_detector false
expire_dns_cache 0
<server>
name fluentd-file
host fluentd.lan
port 5180
</server>
</match>
@vstax Thanks! That's really helpful to us. we are going to cite the above as a fluentd's example.
We need to revise those log format to more readable because LeoFS' administrators are not able to understand the logs in detail.