Create instant_per_process_cpu_mem_usage.sh

deajan commented 1 year ago

Hello,

I've built this (very tiny footprint) script that allows to get non named per process CPU metrics. Most other tools out there require to setup process group names in order to catch process metrics, other solutions provide with non instant cpu metrics given by ps.

Getting instant per process cpu usage is really useful for admins to quickly find a culprit.

I did understand that you discourage shell scripts in favor of the Python client, but this one is mostly a big oneliner, and would only get a less tiny footprint if being rewritten in Python.

Would you mind merging this one ? I've found no other solution out there to achieve the same, so I normally did not reinvent the wheel ;)

I can also provide the corresponding Grafana dashboard of course:

Hope this will help any other admins ;) Best regards.

dswarbrick commented 1 year ago

How does this compare to what process_exporter can do?

deajan commented 1 year ago

process_exporter needs to be configured to pickup processes by names or groups, so you have to know beforehand which processes you want to monitor.

This one just picks up whatever uses CPU or RAM. So if a new process shows up, it will be reported by the script as long as it uses resources.

dswarbrick commented 1 year ago

process_exporter matches process names by regular expression, which can be as concise or as vague as you like. The example in the README would match any process:

process_names:
  - name: "{{.Comm}}"
    cmdline:
      - '.+'

Generally, a textfile collector should not overlap with functionality provided by promql. That includes topk-like behaviour, which could lead to metrics appearing and disappearing as they oscillated in and out of the selection criteria (e.g. cpu-hungry, memory-hungry) of the collector. This tends to cause issues with Prometheus' default look-behind interval of 5m, resulting in apparently stale metrics.

Another thing that I would consider an absolute no-no is including process IDs as labels, since they are pretty much by definition high entropy, and would also result in similar problems as described above. The process_exporter README highlights that also:

Using PID or StartTime is discouraged: this is almost never what you want, and is likely to result in high cardinality metrics which Prometheus will have trouble with.

deajan commented 1 year ago

process_exporter matches process names by regular expression, which can be as concise or as vague as you like

I've actually played with process_exporter before trying to reinvent the wheel.

On the good side of process_explorer:

It shows details per thread
Can be triggerd by simply requestring /metrics
It has way more functionality like io/ context switches / page faults ... Caveats I found:

You don't get to know the process' arguments

Example: if you run a python script (or cockpit, or ansible, or whatever runs as a python invoked script) you will not know which python program creates the cpu usage spike, you'll only get to know that it's python

Example for setroubleshootd eating 100% cpu shown by my script

top_process_cpu_usage{pid="2584",process="/usr/bin/python3",sanitized_args=" -Es /usr/sbin/tuned -l -P"} 0.2
top_process_cpu_usage{pid="15501",process="/usr/bin/python3",sanitized_args=" -s /usr/sbin/firewalld --nofork --nopid"} 0.1
top_process_cpu_usage{pid="23299",process="python3",sanitized_args=" test.py"} 5.0
top_process_cpu_usage{pid="45921",process="/usr/bin/python3",sanitized_args="  -Es /usr/sbin/setroubleshootd -f"} 99.7

Example for setroubleshootd eating 100% cpu shown by process_explorer

namedprocess_namegroup_context_switches_total{ctxswitchtype="voluntary",groupname="python3"} 525551
namedprocess_namegroup_cpu_seconds_total{groupname="python3",mode="system"} 27.43
namedprocess_namegroup_cpu_seconds_total{groupname="python3",mode="user"} 18.41
namedprocess_namegroup_major_page_faults_total{groupname="python3"} 0
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="proportionalResident"} 4.164608e+06
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="proportionalSwapped"} 0
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="resident"} 7.090176e+06
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="swapped"} 0
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="virtual"} 9.658368e+06
namedprocess_namegroup_minor_page_faults_total{groupname="python3"} 877
namedprocess_namegroup_num_procs{groupname="python3"} 1
namedprocess_namegroup_num_threads{groupname="python3"} 1
namedprocess_namegroup_oldest_start_time_seconds{groupname="python3"} 1.685868499e+09
namedprocess_namegroup_open_filedesc{groupname="python3"} 3
namedprocess_namegroup_read_bytes_total{groupname="python3"} 0
namedprocess_namegroup_states{groupname="python3",state="Other"} 0
namedprocess_namegroup_states{groupname="python3",state="Running"} 0
namedprocess_namegroup_states{groupname="python3",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="python3",state="Waiting"} 0
namedprocess_namegroup_states{groupname="python3",state="Zombie"} 0
namedprocess_namegroup_threads_wchan{groupname="python3",wchan="do_select"} 1
namedprocess_namegroup_worst_fd_ratio{groupname="python3"} 0.0029296875
namedprocess_namegroup_write_bytes_total{groupname="python3"} 0

In process_explorer, cpu usage is shown in seconds since process start (condensed into one metric for all similar named processes). While you can achieve to calculate cpu usage percentage, it involves knowing two more variables, total available cpu time in seconds and number of cpu cores , the latter not being easy. Using something like 100 -irate(cpu_total_seconds[5m]) * 100) would just provide cpu usage percent on max cpu used, not capacity

Therefore the tool is not meant for the same job, this one is a "record my top command output" alike, with process name and command line, without grouping anything, which is exactly what some people could need for diagnostics.

Generally, a textfile collector should not overlap with functionality provided by promql. That includes topk-like behaviour

What do you mean ? My script doesn't "aggregate" anything like topk would do. Do you mean it should keep zero values to avoid stale metrics ?

Another thing that I would consider an absolute no-no is including process IDs as labels, since they are pretty much by definition high entropy, and would also result in similar problems as described above. The process_exporter README highlights that also:

Makes sense. I'll have the PIDs removed, even if I still do think it makes sense at an admin's diagnostic level to know whether process python /some/script.py is the same process as python /some/script.py ten minutes before, or if it's a new instance with a different PID.

prometheus-community / node-exporter-textfile-collector-scripts

Create instant_per_process_cpu_mem_usage.sh #157