Open TWJW-SANGER opened 1 month ago
I think we could have avoided the MySQL outage affecting M/Monit as much as it did if we turned off the other instances earlier and had a clear understanding about what processes needed turning off. Its also worth noting we purposefully moved off sqlite to MySQL during the eta -> theta openstack
migration 4 years ago. The given reasoning is 'it will eventually break'. M/Monit themselves have a guide on migrating off sqlite and suggest: M/Monit comes bundled and configured with SQLite as its database system. No extra setup is required. If you plan to use M/Monit to monitor more than, say 40-50 hosts, you may want to use MySQL or PostgreSQL instead as these database systems are faster and scale much better. If in doubt, start with SQLite.
I think there are a couple options here:
Move back to sqlite
Move to a self hosted MySQL instance - e.g. run mysql directly on the instance instead of through the DBAs.
Identify if we can move DBA MySQL instance to a separate instance that is updated at a different time to all other MySQL upgrades.
Feels like a symptom of the way we turn off instances
Two separate use cases for M/Monit:
Team agreed the preferred solution is to have a self hosted MySQL instance.
Next steps:
User story As a team we would like to minimise the dependencies of Monit.d so that it is less likely to be impacted by service outages.
Who are the primary contacts for this story TW, PSD
Who is the nominated tester for UAT This is to be tested by the PSD team
Acceptance criteria To be considered successful the solution must allow:
Additional context When the MySQL databases were being taken offline to be patched we think we discovered that Monit has a dependency on a MySQL database provided by the DBA team. Ideally our monitoring solution would have no dependencies in common with the applications being monitored - otherwise we risk an outage on one of those services impacting the applications AND the system that alerts us to problems. In practice, Monit will need to depend on OpenStack, its instances, images and networking (which are monitored by other teams and have a large impact if they go down beyond just PSD).