WIP: Create optional monitoring configuration for Mastodon using Grafana

ywwg commented 1 year ago

Hi all, this is a draft PR of a change to add optional Grafana+Prometheus monitoring to Mastodon instances deployed with this chart set. The goal is to enable instance admins to very quickly add monitoring with all of the tricky parts of wiring together the various scrapers and config files done automatically.

Includes automatic scraping of kubernetes node_exporter, Mastodon Statsd output, and Postgres data.
Includes four default dashboards.
Includes ingress changes to create routes for grafana.hostname

This is a work in progress, but is functional. The main question I'd like to ask with this PR is, is this a feature the Mastodon project is interested in including? I don't want to do the work of continuing to polish and refactor the config if it's not a direction you want to go. The current major issues are the hardcoded values inside values.yaml that I would like to replace with auto-generated URLs, but I am having trouble figuring out how to do that in Helm and given the available ways to configure the services.

Major TODOS:

make the default dashboards fully functional / usable. Right now there are some incorrect queries
do more parameterization, esp where release name and namespace are currently hard-coded
move as much boilerplate out of values.yaml as possible

ineffyble commented 1 year ago

@ywwg It looks like Helm is trying to template your txt file 🙃

ywwg commented 1 year ago

@ywwg It looks like Helm is trying to template your txt file upside_down_face

any chance you can re-trigger the workflow? I think I might have fixed it but I can't reproduce the issue locally so I'm not sure

ywwg commented 1 year ago

I am new to helm, but it appears that NOTES.txt is supposed to be a templateable yaml file that helm can process.

I believe the version of helm in the CI is too old: https://github.com/prometheus-community/helm-charts/issues/2723

updated helm to 3.7 to see if that fixes it

deepy commented 1 year ago

Doesn't it seem a little bit heavy to add 4 more subcharts here?

ywwg commented 1 year ago

Doesn't it seem a little bit heavy to add 4 more subcharts here?

That's one reason I wanted to post the in-progress chart, to get a sense of what the goal of this chart is and how full-featured y'all want it to be. For instance, if this chart exists just to deploy mastodon, then it's probably superfluous to include the monitoring stack. I was thinking this chart could be used by admins who want to deploy a solid mastodon instance and are maybe less familiar with running an important service like this. I have seen a lot of messages in my feeds by admins who were surprised to run out of disk, or that the server went down over night. In that light, this chart could be used to help these less-experienced admins set up a more robust instance without having to worry about configuring monitoring themselves. (And those who don't need the monitoring stack or want to roll their own can disable it easily).

To answer your question specifically, in order to have a functional monitoring stack we need at least three of the charts: prometheus, which gathers and stores the metrics; the statsd exporter, to record metrics from mastodon; and grafana, to view the metrics. We could easily drop the postgresql chart if we don't care about providing access to postgres stats.

It also might help to talk about what "heavy" means in this case -- do you mean the maintenance burden of keeping them functional?

renchap commented 1 year ago

Hi there, and first of all thanks @ywwg for this PR!

I am working on moving mastodon.social & mastodon.online to K8s. For now mastodon.online has been deployed using home-made K8s resources (we needed to move it fast) but we are looking into using the official chart for those servers.

We have our existing monitoring deployment, and we dont want to have the Prometheus stack deployed using this chart. But it would probably a good idea to have a way to get the dashboards, and the various Monitor resources created by the chart so an existing prometheus-operator deployed on the cluster would start scraping the resources.

Would this be a valid alternative to what you have done?

I was thinking this chart could be used by admins who want to deploy a solid mastodon instance and are maybe less familiar with running an important service like this. For this concern, I think we could have a "meta" chart, that has more sub-chart dependencies and can (optionnaly) install all the parts of a Mastodon chart, including postgres, redis, prometheus… so people who want a fully-managed setup have this option.

On another topic, I see you chose to deploy the statsd_exporter as its own deployment. Did you consider having the statsd_exporter as a sidecar for Puma & Sidekiq pods, and all of those being scraped by Prometheus? This is the setup recommended in the exporters's README and what I deployed on mastodon.online, with good results so far.

The last issue I am looking to solve is being able to host multiple Mastodon instances on the same cluster / Prometheus monitoring. This is important for us, so I am looking into adding a label to all Mastodon-related metrics to specify the Mastodon server/instance the metrics refer to, and patching the dashboard to be able to view metrics only from the selected instance. Does this makes sense to you?

mastodon / chart

WIP: Create optional monitoring configuration for Mastodon using Grafana #12