turnkeylinux / tracker

TurnKey Linux Tracker
https://www.turnkeylinux.org
70 stars 16 forks source link

Improve Monit integration and/or documentation #542

Open OnePressTech opened 8 years ago

OnePressTech commented 8 years ago

AFTER CONSIDERATION THIS ISSUE THREAD WAS RESET FROM ITS ORIGINAL REQUEST FOR WEBMIN SERVER & SYSTEM MODULE TO BE ENABLED BY DEFAULT TO A REQUEST FOR MONIT DOCUMENTATION WHEN IT WAS REVEALED THAT MONIT ALARMS HAD BEEN CONFIGURED TO DEFAULT SETTINGS IN V14.0

I think it is mandatory that an appliance be able to notify the administrator that a server is up / down and if disk is getting low.

Unless the TKLX team has a reason not to, I would recommend that the webmin System and Server status module be installed and enabled by default with monitors for SSH Server, Postfix Server, Apache Web Server, MySQL database server, and Diskspace.

Thanks for listening :-)

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

a3s7p commented 8 years ago

We kind of have Monit for that. There's a policy for CPU/RAM/swap/disk (space/inode) notifications and we've been planning to extend it.

@JedMeister, perhaps for 14.1?

OnePressTech commented 8 years ago

I saw the statement that monit was pre-installed on the appliances (https://www.turnkeylinux.org/blog/turnkey-14-0-release) but saw no sign of it in the Webmin console nor are there any instructions that I could find to indicate monit was directly accessible via server port or what that port number would be. The appliance pages provide a quick blurb for webmin and shellinabox but not Monit.

When I went to install the monit webmin component I was informed I needed to upgrade to v1.78 webmin. When I completed the webmin update the webmin module would not work because the webmin was still looking for old monit config files. I mucked with the config files and got monit apperaring in Webmin but the user interface was not the professional looking U.I. in the webmin documentation but a chessy obsolete looking U.I.

Looking at the feature set I don't really see Monit offering sufficient minimal feature capabilities over the webmin System & Server status module / AWS CloudWatch combination I already have in previous Turnkey core versions. I use Turnkey Linux as a minimal Debian vs. the default VM from the cloud provider so I am not sure what extra value Monit offers that would warrant adding another QA risk dependency in the core.

Cloud provider monitoring has progressed over the last few years and with components like Ansible combining monitoring with actions in a DevOps framework legacy monitoring components are starting to look a bit architecturally stale. But if that is what is of interest then a Nagios / Zabbix / etc monitoring tool would fit the bill rather than Monit.

So, as I see it, either 14.1 backs out Monit or fully commits. If it is the former then I am suggesting we pre-install the webmin System and Server status module. If it is the latter than use an up-to-date version of Monit AND pre-install the webmin module AND fix the webmin config file dilemma AND document Monit in the core and appliance pages.

I am willing to be convinced of Monit's inclusion but I suggest that be in the form of a blog post with a link in this issue to avoid turning the issue tracker into a blog.

Just one man's 2 cents worth :-)

JedMeister commented 8 years ago

The Webmin-Mon package is available for install via apt-get as per usual (we package all the standard Webmin modules). I'v never used it (or even installed it) though so I have no idea about it.

apt-get update && apt-get install webmin-mon

As @qrntz notes though, we added Monit for v14.0 but only configured it for minimal monitoring (CPU, RAM & HDD space). It is configured to email you via the secalerts inithook when resources exceed the preset limits (75%, 90% & 90% respectively).

You are right though that it isn't particularly strongly documented, although it is mentioned in passing in the v14.0 announcement (under the heading "security and system alerts").

It does include a (standalone) WebUI (not part of Webmin) but as it was only configured for those basics, the web UI didn't really add a lot of value. WRT looks, it's from the Debian repos, so it is unlikely that it's a really recent version; I imagine that tidying that up shouldn't be hard though (with CSS - like we did with Adminer). ' Admittedly I haven't looked at Webmin-Mon much at all; but from a glance at the Webmin page for it vs Monit it appears to me that Monit is much more powerful and flexible. E.g. WRT the OOMKiller killing MySQL in LAMP; you could configure Monit to restart both Apache and MySQL if MySQL crashes due to running out of memory. Maybe that sort of thing is also possible with Webmin-Mon but on face value it doesn't appear to be designed to support that sort of usage (although like I say I've possibly missed something).

We have plans to ultimately integrate Monit (at least to some degree) with the Hub as another "value add" for Hub users (whilst obviously still making it useful for non-Hub users too).

Another reason why I am not too keen to introduce more Webmin stuff is that replacing Webmin at some point in the future is highly likely. We know that we will probably cop a lot of annoyance from current Webmin users/lovers but there would be many potential advantages to switching to a different front end. If we go for something Python based then the bar would be much lower for us (and/or the community) to make custom modules ourselves. One that jumps to mind is a (re-written) web-based ConfConsole.

So whilst extension of Monit is definitely on the cards, it's not a super high priority item. or v14.1 fixing bugs is first priority. Next priority will be introducing new features to overcome significant pain points; particularly making email delivery more consistent and reliable (not much point in monitoring if you never know when there's an issue!). As well as making legit SSL certs (an optional) part of the default config (via inithooks probably).

OnePressTech commented 8 years ago

Thanks Jed. Useful additional info.

My point was that System & Server Status Webmin module is dead simple for your low tech DIY crowd. I'm not sure what Monit would offer that I can't already do from AWS Cloudwatch.

OOM killer could be handled by a cron job. I'm not sure that learning Monit conventions moves the yardsticks sufficiently for the TKLX community to learn it as a cron job replacement.

Having said that...if it is the way you want to go and it is installed...could we get a short blog post to at least indicate how to access it (I had assumed via Webmin) :-)

a3s7p commented 8 years ago

Monit is not a cron replacement and handling OOM states from a cron job is not a valid approach. Reliability should never be sacrificed in favor of ease of use. Besides, anyone learning how Monit works is not the primary point — it just doing its job is, which means:

1) having a small footprint and being self-contained to avoid being killed or hindered in emergency states 2) being able to execute proactive measures before it is too late 3) reacting to dynamic and drastic changes in system health

While quite a few machine-internal solutions fit 2) and 3), not many fit 1). While Webmin has been around for a while, a webserver written in Perl, which miniserv.pl is, is not something I would trust with my system's health. No offense to webservers or Perl but there are different tools for different jobs and in my opinion, Monit is among the best tools you can get for the purpose of ensuring the machine does not crash.

What we had in mind is to supply Monit config files for central services in every appliance. This is not, per se, meant to be exposed to the user; this is so the appliance Just Works™. However, when many metrics are monitored anyway, it makes sense to present the full overview to the user and that is when the web interface may be used. It is not really for «pulling strings» — that is why Webmin is around and something else may be around later.

Accessing Monit right now does not really make much sense, and that is precisely why the web interface is disabled.

CloudWatch, while I admit is nice, is a single proprietary solution valid for a single type of deployment.

OnePressTech commented 8 years ago

I have been in the computing business for 36 years. Anyone who thinks they can configure software to "just work" (trademark...really!) is kidding themselves. Anyone on any of my teams over the years who has said that to me and attempted to do so has ended with catastrophic failure...machine control and auto-correction has its limits...be careful. Regarding OOM Killer correction I expect any proposed solution to be disclosed for comment. It has the potential for catastrophic failure.

Having said that I am willing to go with the flow on Monit. I am, however, disappointed in the lack of visibility / consultation of the Monit functionality addition and this is the first time I am hearing the intent to keep it "invisible" and autonomous. Your comment "Accessing Monit right now does not really make much sense, and that is precisely why the web interface is disabled." is, to be frank, condescending to those of us with more computing background than you may have. We want to know what is being implemented that can impact the quality (good or bad) of the VM. And we want access.

Regarding your comment on Perl...OTRS is the corporate standard in enterprise grade trouble ticket & ITSM change management...1M lines of Perl. Don't make sweeping generalisations on technology without doing your homework.

My comment regarding use of cron is that a switch to Monit as a cron / script replacement is to add a framework that, in itself, requires additional learning to use and becomes a new point of failure that was not there before.

For a DIY audience System & Server Status Webmin is simpler for basic monitoring and actions. Cron / scripts are more techie but well understood by the legacy sysadmins. Monit will require everyone to learn something new AND it is a new point of failure. Don't trivialise the impact of the decision for its inclusion and use as an auto-corrector in the VM.

Just one man's two cents worth :-)

JedMeister commented 8 years ago

Just to clarify Anton's comments for you Tim. The intention has never been to "hide" Monit; it was to keep it out of the way. Especially prior to it being configured in a more useful way. As I think I touched on earlier (but probably didn't elaborate enough) the rational for disabling the web UI is that as it only measures CPU, RAM and HDD space it didn't seem like a good use of a port. IIRC it's just a big blank page with CPU, RAM and HDD usage on it. As I mentioned before it is noted, but not very well and not really elaborated on at all.

Having said that, you are right that we didn't communicate the inclusion very well. Providing info for users who do want to enable the web UI and/or configure it further themselves would be of value to our community I'm sure. And it is a good reminder to me to stop and ponder the bigger picture a bit more sometimes. TBH over the last year I have been totally swept away with trying to get things done and have somewhat lost focus on the bigger picture and the point of it all.

Whilst the idea of TurnKey is to make it easy to use and accessible to all. If the truth be told, in many ways passionate TurnKey users (such as yourself) who are tech savy (including consultants, developers, service providers, etc) are hugely important to us. Not just in terms of influence and mindshare; but also in terms of planning and developing our products themselves.

So once again, thanks for your input and feedback. And next time I'm in Sydney I really llok forward to catching up with you for a beer! :smile:

a3s7p commented 8 years ago

@OnePressTech, the trademark symbol was there precisely to signal ironic intent. :-)

First of all, I have no intent nor authority to challenge your credentials and I apologize if my reply came off as confrontational, condescending, misunderstood or illuminating the Ultimate Truth, which it was not meant to be. Your concern and your feedback are most welcome.

this is the first time I am hearing the intent to keep it "invisible" and autonomous.

As Jeremy confirms, there is no intent to keep it invisible and autonomous. What I highlight is that it is in a preliminary stage right now where the extra open port is not going to bring that much to the table and this was a consensus decision (indeed, the web interface was enabled at the start — https://github.com/turnkeylinux/common/commit/e1b34d88384d6419541fd7854d81a1e4dd45d374). This will probably change when it gets some serious work done on it.

Regarding your comment on Perl...OTRS is the corporate standard in enterprise grade trouble ticket & ITSM change management...1M lines of Perl.

OTRS is not a webserver for Webmin, miniserv.pl is. miniserv.pl is more of an afterthought to Webmin itself to make it more self-sufficient and it is acknowledged by the developers themselves that it is only a basic solution for what it is explicitly meant to do it. It is only reasonable to infer it may not be the best solution to host a proactive system monitoring solution either.

Monit will require everyone to learn something new AND it is a new point of failure.

Monit is a point of failure. That is true. However, it is also a tool for eliminating certain types of failure and more importantly, it does not take away any freedom from the user.

All-in-all, of course the decision to include it probably should have been documented better so that this issue would not be raised in the first place. But, failing that, your input here provides a valuable outlook on its implementation which will help us ensure the final product is up to expectations.

I thank you for that.

OnePressTech commented 8 years ago

No worries Jed / Anton :-)

I'm just flagging that the inclusion of Monit as an auto-management tool, if that is the intention, is actually a significant QA decision....possibly unprecedented in TKLX history (possibly not...but I am not aware of another example I would put in this category). For the most part my perception of TKLX technically is as a slimmed down Debian release with some convenience tools and some auto-config. I am not aware of any operationally significant code changes to Debian (I expect there are some but what we are not aware of we can't comment on).

If Monit is to be used as an auto-manager for the O/S, that would be a significant QA risk that could brick a VM. So there would need to be some disclosure and community experience contributions if we are to avoid bricked v14.x VMs.

Regarding the beer...I'm in :-)

OnePressTech commented 8 years ago

Getting back to the original issue submitted...System & Server Status Webmin is what I manually configure to provide me with RAM, Disk, Server up / down email notifications. You guys have waxed poetically about the wonders of Monit...but I am not getting duplicate status emails on these key system issues so I am assuming Monit is not providing them.

So I ask again...why not enable System & Server Status Webmin and preset the monitors. It is dead simple. If you tell me you will have equivalent notification emails from Monit in v14.1...great...if not...my suggestion, which is the point of this issue...stands. With respect :-)

JedMeister commented 8 years ago

Getting back to the original issue submitted...System & Server Status Webmin is what I manually configure to provide me with RAM, Disk, Server up / down email notifications. You guys have waxed poetically about the wonders of Monit...but I am not getting duplicate status emails on these key system issues so I am assuming Monit is not providing them.

Monit is configured to only email if/when the preset thresholds are reached. So it won't give you status emails; only when there are issues.

This should be working as of v14.0. Although we have had ongoing issues with email delivery.

For background, as of v14.0 all TKL servers include postfix and are configured to forward emails (that are sent to root@localhost) to whatever email address is submitted in in the secalerts firstboot script. Sending emails has always been a bit hit and miss; but now (that all servers are trying to send emails) it is much more of an issue (and one that we'd really like to solve for v14.1).

So long and the short of it is that Monit should already (as of v14.0) be emailing root@localhost if/when the thresholds are reached/exceeded. It was intentional to not make a bigger deal about Monit inclusion (as the config was minimal) but as it turns out with the emailing issues being bigger than we initially considered it's probably good that we didn't make a bigger deal about it...

So I ask again...why not enable System & Server Status Webmin and preset the monitors. It is dead simple. If you tell me you will have equivalent notification emails from Monit in v14.1...great...if not...my suggestion, which is the point of this issue...stands. With respect :-)

So you are recommending that the monitoring system email for info/status rather than warning? As I say, Monit should already be providing warnings (as of v14.0).

OnePressTech commented 8 years ago

Thanks for that Jeremy. The more we talk the more I think you guys need to take a pause and write a blog post as to how Monit is configured, the default settings, how we can change the default configurations and how you would like to move towards a self-correcting VM. Stop any actions aimed at auto-correcting server issues without opening it up to the community for consideration. None of us can afford angry clients with dead services due to bricked computers.

I will change the title of this issue to "Monit is undocumented" and you can close it once the blog post is written.

Sound like a way forward?

JedMeister commented 8 years ago

Beautiful! :+1: I'll add the "documentation" tag too! :smile:

OnePressTech commented 8 years ago

Cool :-)

JedMeister commented 7 years ago

renamed issue and pinned to v15.0 milestone.

JedMeister commented 4 years ago

FWIW Monit missed the boat for Buster, and v16.0 is so far overdue, that we need to push ahead without fulfilling any further feature requests. Howevet, Monit is in buster-backports so we could certainly still relook at it.