Collect system metrics via telegraf

bastelfreak commented 1 year ago

hey @MartyEwings @m0dular, I would like to have your feedback on this one. I've one PE customer that already raised a few support tickets for performance problems. We've the puppet_operational_dashboard already in place and it helps to get Puppet performance metrics but it's a bit tricky to map that to the available system resources. We currently don't know if the system is under heavy load. Our idea was to enable some more inputs for telegraf and store them in influxdb as well. That enables us to see a correlation between system utilization and PE utilization. If you think that's a good addition to the module I'm happy to improve the code and also add a dashboard. We're currently testing:

m0dular commented 1 year ago

Would this collect system metrics from Puppet infra nodes, or just the dashboard node itself? I haven't tested it yet, but at a glance it looks like it only adds a Telegraf input to the dashboard node to collect its local sar metrics.

I think this also hits on something we've been trying to figure out with this module, which is what should be included for customers vs what should be included for Support. We did add Charlie's sar script that can import sa files and populate the v2 System Metrics dashboard. This is great for us because most people have sar collection turned on and we collect them in a support script, but not useful for customers because the module doesn't automate it other than offering the load_metrics plan. I'm sure there's some way to collect sar metrics from remote nodes using the tools we have, but I don't know what the best way for this module to do it is.

One option might be to use the code from the import script to parse and ship local sar metrics to a remote InfluxDB. There could be a new profile class that you apply to Puppet infra nodes that does something like:

Ensure sar is installed and collecting metrics
Create a Telegraf output that points to our InfluxDB server
Configure an exec Telegraf input plugin that runs the script similar to here

In general, we should make a decision on if this is a good solution for getting system metrics and if it should live in the module or somewhere else. I did consider splitting out the plans and scripts we use internally into another repo, but didn't decide on anything yet.

bastelfreak commented 1 year ago

@m0dular thanks for the feedback. I think the module should be split. the plans and the system dashboard aren't used outside of the PE support team I think? And it's confusing for people that there's an empty dashboard (it's good that this is now configureable).

I cleaned up the telegraf config and dropped the diskio and sysstat inputs, they aren't used by the dashboard. At the moment the dashboard and the inputs aren't configured by default. people that deploy everything on a single node can set collect_system_metrics and manage_telegraf_system_dashboard to true in the main class or set collect_system_metrics to true in the agent::telegraf subclass. The class already configures an output so I only add additional inputs.

I prefer telegraf over the sar implementation because why shouldn't we use the native telegraf plugins? I had a longer conversation with David Sandilands about metrics and I think it's important to understand how and if a system is utilized. That only works when system- and puppet-specific metrices are combined.

m0dular commented 1 year ago

It looks like puppet_operational_dashboards::telegraf::system_metrics is only included via the main puppet_operational_dashboards class? So that would only allow for collecting system metrics from the main dashboard node, unless you manually configured a Telegraf output and applied puppet_operational_dashboards::telegraf::system_metrics to nodes.

Having puppet_operational_dashboards::telegraf::agent include the new class could work, though. If you added a new boolean to include it, set the parameters that depend on the main class, and added system metrics to the unless here, then I expect that would work. That would give you an input to collect the system metrics and an output to ship them.

MartyEwings commented 1 year ago

@m0dular can you have a look at the most recent updates

m0dular commented 1 year ago

I tested this out, but I'm still unsure of what the desired behavior is. If the goal is to collect system metrics from PE infra nodes, this won't accomplish that because it only collects the new metrics from the dashboard node itself. If we want these metrics from PE nodes, we'd have to either create a new profile class or modify puppet_operational_dashboards::telegraf::agent like in my previous comment and apply that to said nodes.

puppetlabs / puppet_operational_dashboards

Collect system metrics via telegraf #182