gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.69k stars 1.08k forks source link

Health Checker tool #313

Closed aravindavk closed 4 years ago

aravindavk commented 7 years ago

Gluster needs a health checker tool which can be scheduled to run in each node in regular interval to check the status of the Cluster/Node.

Listed a few ideas here, feel free to add if any more ideas/metrics can be added.

Idea 1 - Check for errors and warnings in all Gluster log files and report

Check Every log message which Timestamp is greater than the previous run TIMESTAMP and print the number of errors and warnings.

For example,

{
    "log": "glusterd.log",
    "errors": 0,
    "warnings": 10
}

Idea 2 - Uptime Report

Look for all gluster processes using ps command and collect the details about Uptime. This will help to identify if any process is restarted recently. Also pid change can be compared with previous report

Command:

ps --no-header -ww -o pid,etimes,comm -C glusterd,glusterfsd,glustershd

Get command line args from /proc/<PID>/cmdline instead of from ps command to avoid issues while splitting the args with space.

Example,

{
    "name": "glusterd",
    "pid": 1254,
    "uptime_sec": 126
}

Above example, shows that glusterd started around 2 minutes back.

amarts commented 7 years ago

Can we do more of a 'Health Report' Tool instead? I am anticipating a tool which can have plugin for any new diagnostics and it can report, '[OK], [NOT OK], [WARNING]' as the status. More elaborate information can be logged to another file.

Think of something like below:


bash# gluster-health-report
CPU Usage: [OK]
Network Health: [OK]
Disconnect events: [WARNING]
Memory Usage: [WARNING]
Log rotate setup: [NOT OK]
Error logs in last day: [OK]
Changelog size: [WARNING]
....
You can find the detailed health-report at /var/log/glusterfs/health-report-$timestamp.log

It should output only the status, and more detailed reasoning, and numbers to arrive at that conclusions can be in the log file.

Any feature can add their own health-report by providing either bash or python (or anything else) which runs fine to give the above output. The tool should run each of these tests together and give a summary.

Any further idea on this would be welcome.

aravindavk commented 7 years ago

Can we do more of a 'Health Report' Tool instead?

+1

Started working on this tool https://github.com/aravindavk/gluster-health-report

The tool is in usable state(only one report exists to check glusterd is running or not). Installation and usage instructions are updated in README file or the repo.

Adding new report is very easy and documented in the README file. Please feel free to send pull request with your report idea.

amarts commented 7 years ago

Tried the above tool, looks neat, and works almost as I expected. Only question I have is, what if I want to write a bash script? We don't have to answer it immediately, but would be a good thing to pick up to make it more generic.

aravindavk commented 7 years ago

Support can be added for running bash scripts or any executable scripts.

HaroldMiller commented 7 years ago

This is a good idea. Anything we can do to help the user/customer administrate their system will make for a happier experience.

1) Disk space - running out of disk space can cause serious issues. We should warn, then yell (grin) if necessary to prevent this. 2) Client/server incompatibilities - can we check versions and warn each time a client is started that is not compatible with the server? 3) Overall performance monitoring - end-to-end through-put. either a hard number, or better, trends

stale[bot] commented 4 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 4 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.