prometheus-community / node-exporter-textfile-collector-scripts

Scripts for node-exporter's textfile collector
Apache License 2.0
512 stars 191 forks source link

Create instant_per_process_cpu_mem_usage.sh #157

Open deajan opened 1 year ago

deajan commented 1 year ago

Hello,

I've built this (very tiny footprint) script that allows to get non named per process CPU metrics. Most other tools out there require to setup process group names in order to catch process metrics, other solutions provide with non instant cpu metrics given by ps.

Getting instant per process cpu usage is really useful for admins to quickly find a culprit.

I did understand that you discourage shell scripts in favor of the Python client, but this one is mostly a big oneliner, and would only get a less tiny footprint if being rewritten in Python.

Would you mind merging this one ? I've found no other solution out there to achieve the same, so I normally did not reinvent the wheel ;)

I can also provide the corresponding Grafana dashboard of course: image

Hope this will help any other admins ;) Best regards.

dswarbrick commented 1 year ago

How does this compare to what process_exporter can do?

deajan commented 1 year ago

process_exporter needs to be configured to pickup processes by names or groups, so you have to know beforehand which processes you want to monitor.

This one just picks up whatever uses CPU or RAM. So if a new process shows up, it will be reported by the script as long as it uses resources.

dswarbrick commented 1 year ago

process_exporter matches process names by regular expression, which can be as concise or as vague as you like. The example in the README would match any process:

process_names:
  - name: "{{.Comm}}"
    cmdline:
      - '.+'

Generally, a textfile collector should not overlap with functionality provided by promql. That includes topk-like behaviour, which could lead to metrics appearing and disappearing as they oscillated in and out of the selection criteria (e.g. cpu-hungry, memory-hungry) of the collector. This tends to cause issues with Prometheus' default look-behind interval of 5m, resulting in apparently stale metrics.

Another thing that I would consider an absolute no-no is including process IDs as labels, since they are pretty much by definition high entropy, and would also result in similar problems as described above. The process_exporter README highlights that also:

Using PID or StartTime is discouraged: this is almost never what you want, and is likely to result in high cardinality metrics which Prometheus will have trouble with.

deajan commented 1 year ago

process_exporter matches process names by regular expression, which can be as concise or as vague as you like

I've actually played with process_exporter before trying to reinvent the wheel.

On the good side of process_explorer:

Therefore the tool is not meant for the same job, this one is a "record my top command output" alike, with process name and command line, without grouping anything, which is exactly what some people could need for diagnostics.

Generally, a textfile collector should not overlap with functionality provided by promql. That includes topk-like behaviour

What do you mean ? My script doesn't "aggregate" anything like topk would do. Do you mean it should keep zero values to avoid stale metrics ?

Another thing that I would consider an absolute no-no is including process IDs as labels, since they are pretty much by definition high entropy, and would also result in similar problems as described above. The process_exporter README highlights that also:

Makes sense. I'll have the PIDs removed, even if I still do think it makes sense at an admin's diagnostic level to know whether process python /some/script.py is the same process as python /some/script.py ten minutes before, or if it's a new instance with a different PID.