Export individual processes stats

lawre commented 8 years ago

I would really like some way to export the individual process metrics to influxdb so we can better pinpoint what process is hogging our memory, network or cpu and for how long. I work in a software build environment, so these kinds of metrics can really help nail down bottlenecks in a jenkins build.

nicolargo commented 8 years ago

It is a good idea but... how to identify the process (primary key) ?

PID ?
Command line ?
Process name ?
Other ?

TBD

lawre commented 8 years ago

My primary goal for this request is to be able to hover over a line in a grafana graph and a tooltip will pop up with the full command line of the process I'm interested in. How can this best be accomplished?

primary = PID: This would be a good choice if we can figure out how to get Grafana to tooltip a second field instead of the one we're querying on. I'm not so interested in the PID per se, but it does offer the convenience of guaranteeing uniqueness for each process.

primary = Command Line: I think this would be the most straight-forward approach. But these lines can get pretty long with lots of funny characters. Can InfluxDB handle this? Perhaps truncate at # of characters? This approach also leaves the danger of a program like apache spawning multiple processes for the same thing... these wouldn't look unique in this case.

primary = PID-Command Line: Concatenating the PID and the command line might be the best approach to guarantee uniqueness if we can't figure out the tooltip shows secondary info problem.

primary = command: This does not get granular enough for me. Think of the Apache web server, or in my case the Jenkins node allowing simultaneous runs.

Thank you for considering this feature!

wingsof commented 8 years ago

Wow! I also need this feature! My purpose is to capture load of specific processes while running stress test on several solutions. I think only nmon currently exports this information.

Regarding primary key, I think combination of pid and start time of the process (because of pid reuse) might be enough to uniquely identify process.

johnhill2 commented 8 years ago

My team desperately needs this feature too. I've had to do some crazy stuff with the per process utilization flat file and rsync. Having it export directly to influxdb is our number one feature request

gabrieljames commented 8 years ago

+1

I've been working my way through the data collection agents (collectd, telegraf, topbeat) and have been completely surprised, shocked, that none of these agents collect all process cpu, mem and disk metrics. Some will allow specific process monitors to be defined in configuration files, but i'm looking for comprehensive process monitoring metrics stored in Influx or Logstash.

Hate to say it, but this is straight forward in Windows performance logging.

I am about to test Glances specifically for this capability, was expecting it to be there based on the data displayed in the command line interface

nicolargo commented 8 years ago

@lawre Another approch: use the PID as primary key and use tags to store: process name, full command line, process start...

What do you think ?

johnhill2 commented 8 years ago

@nicolargo that will work for me

lawre commented 8 years ago

I think that would work, good suggestion.

nicolargo commented 8 years ago

My proposal is to use the PID as primary key and store the process name, command line in tags only work for InfluxDB, the current architecture of the Glances export module is not related to the storage database (for example, same function are used to store in CSV, InfluxDB or Cassandra...). Tags only exist in the InfluxDB ecosystem.

Two others points to keep in mind:

for the moment, the Glances process list is optimized. It grabs cpu_percent, memory_percent, io_counters, name, cmdline for all processes (it's called mandatory stats) and others stats (extended stats) only for processes displayed in the UI. Export will only be done on mandatory stats.
performance issue can occur while writting huge process list to the database

My new proposal: define a processes list (for example using regular expression on command line), only the filtered process will be exported.

For example, if you want to monitor the NGinx process:

[processlist]
# Export process stats (export_* lines)
# Export NGinx processes (name matching ^nginx.* regexp) to foonginx key
export_foonginx=^nginx.*

The key is foonginx. It will export all the mandatory stats for the processes command line starting with nginx (one line per process or one line for all the processes, to be discus).

What do you think ?

nicolargo commented 8 years ago

@lawre : Any head up concerning my last proposal ?

lawre commented 8 years ago

Sorry, I didn't see the alert for your question before.

collectd's processes plugin already allows for regex match of processes, so in our case we would just continue to use that. Is there a possibility for this just to be an "InfluxDB only" feature as it is probably the only uniquely qualified database for this type of data? I'm not sure any other fixed key DB would even be able to handle this. I couldn't imagine reading it in a CSV where there could be 20-100 ever changing PIDs to key off.

nicolargo commented 7 years ago

I just try to code a first version in a local branch. The main problem is that the process stats update is done asynchronously in a specific thread. So when the stats are exported we are not sure that the process stats are completed. One workaround is to force the update before the export. But it cost a lot of CPU consumption...

nicolargo commented 7 years ago

Will be implemented in Glances version 3.0 (after the complete code refactor and change of the software architecture).

nicolargo commented 6 years ago

No time to work on this request for Glances 3.0. Need contributor !

gotjoshua commented 5 years ago

Just getting into glances and I doubt I (nor anyone on my team) will have the bandwidth to be that contributor... but I want to broaden the scope beyond influx, let me know if it is better to spin off into a new issue:

from the original issue:

better pinpoint what process is hogging our memory, network or cpu and for how long.

I would love to be able to configure actions/alerts to show the current "hogs" eg:

Warning or critical alerts (lasts 10 entries)
2019-01-27 10:14:30 (00:00:22) - CRITICAL on MEM (96.0) [Hogs: apached in dev (55%),gitlab worker in gitlab](41%)]

so the format i envision is: [{{process_name}} in {{container_name}} ({{hog%}})]

If this info could be logged to a file (that i could add to logstash) and displayed in the web interface alert list, it would be mega sexy.

Is this at all possible with the current API? or even with the command line somehow?

I ask for it in the alerts list, because this is persistent in the UI for longer, so when I check intermittently it would be great to have this glimpse into the previously critical moments.

unlikelyzero commented 10 months ago

For now, I'm going to be using a combination of https://github.com/ncabatoff/process-exporter and glances (container and gpu measurement). Ideally, I'd route all of this through the grafana agent to make a single target in prometheus.

nicolargo commented 7 months ago

Feature previous implemented in the branch: https://github.com/nicolargo/glances/tree/issue794

The processes list to export could be defined in the Glances configuration file:

#
# Define the list of processes to export using
# export is a comma separated list of Glances filter
export=.*firefox.*,username:nicolargo

or with the --export-process-filter option:

glances -C --export csv --export-csv-file /tmp/glances.csv --disable-plugin all --enable-plugin processlist --quiet --export-process-filter ".*python.*"

nicolargo commented 6 months ago

Feature merged into develop.

The documentation is here: https://github.com/nicolargo/glances/blob/develop/docs/aoa/ps.rst#export-process

Need beta testers on this feature !

cc: @unlikelyzero @gotjoshua @lawre @johnhill2 @gabrieljames @wingsof @alex-ruhl

bLuka commented 6 months ago

What a great day to be alive! I’ve been looking for such a feature for years, it’s been ages I haven’t touched Glances, and I just redid my monitoring setup over Glances + InfluxDB + Grafana past few days. Just to learn my long-awaited feature have been implemented 2 weeks ago ❤️

I deployed upstream’s develop Glances over my server, and process exporter works like a charm! I see data coming to InfluxDB and rendered on Grafana already. I don’t see anything wrong there, I’ll keep you updated if you want to confirm it’s stable on the long run.

I’m wondering (as I’m investigating disks I/O bottlenecks from unmonitored processes), is there any reason per-process disk I/O metrics are not exported? I am talking about the bytes R/s and W/s metrics which are already present in the curses display.

EDIT: I faced a weird behaviour where Glances ended claiming several dozens of GiB of RAM, along with a few CPU cores at 100% (iirc, 40GiB and 4 cores at 100%). I’m investigating to check whether it comes from misuse, from develop, or from the process list export.

I cannot reproduce. I believe it came from first experiments without filtering output. CPU is still a bit high (2 cores ~100%), but I believe it is known and expected.

bLuka commented 6 months ago

I find the cardinality a bit odd (and difficult to manipulate subsequently). Is there specific reasons/constraints that require an export such as:

timestamp	821811.num_threads	821811.name	821811.cpu_timesuser	2950324.num_threads	2950324.name	2950324.cpu_timesuser
N	1	kworker/13:5-events	0.02	22	dotnet	233.79
N+1	1	kworker/13:5-events	0.02	22	dotnet	233.79

Instead of a format with 1 row per process and fewer columns? Such as:	timestamp	pid	num_threads	name
N	821811	1	kworker/13:5-events	0.02
N	2950324	22	dotnet	233.79
N+1	821811	1	kworker/13:5-events	0.02
N+1	2950324	22	dotnet	233.79

I’m currently struggling to parse the results to build a Grafana dashboard from the InfluxDB backend, and the latter format would make it easier.

Additionally, regarding the format, I noticed a weird behavior where the processes were not aligned in the proper columns (have a look at the 270.pid column):

I reproduce consistently using the following command (and my own process list which includes a lot of processes):

glances -C --export csv --export-csv-file /tmp/glances.csv  --export-process-filter ".*dotnet.*,*java*,*qemu*,*samba*,*lxd*,*docker*" --disable-plugin all --enable-plugin processlist --quiet

I’m wondering if it comes from the fact that I monitor a lot of processes (some of which terminate, other spawn)?

nicolargo commented 6 months ago

Hi @bLuka and thanks for the feedback.

Concerning the CSV export, i want to keep one line per timestamp because the header should be the same for each line.
This behavior is also used for others plugins like network, where stats are displayed with the following format: .. However, the weird behavior concerning the column are not "normal". But i think that i understand what happen. If one of your process die during the capture then the column can be overwritten by another stats. had to investigate this...

For the InfluxDB export, this is not the same behavior and all the stats are exported line by line, the pid become a tag (in the InfluxDB data model). I just add the process name has another tag. So it should simplify the way InfluxDB store the information and so also simplify Grafana dashboard creation.

nicolargo commented 6 months ago

For the "weird" behavior, i also reproduce it on my side. Without any creation or deletion of a exported process, every 1 minute, the column are generate in a different order...

The problem is related to the cache_timeout=60 in the processes.py file:

class GlancesProcesses(object):
    """Get processed stats using the psutil library."""

    def __init__(self, cache_timeout=60):
        """Init the class to collect stats about processes."""
        ...

If i change the cache_timeout=60 to 30 then the glitch appear every 30 seconds...

nicolargo commented 6 months ago

@bLuka Last commit should correct the issue with clumn alignement. Stats are now sorted before export.

For the last point (new incoming process or process removed) i do not know what is the best solution: 1) Generate a new CSV file with a new header 2) Change the header and add new column if a new process is created 3) If 2) if a process is stopped, than the column will be filled with empty space

Any advise ?

guidocioni commented 6 months ago

Hey all. I've been testing also this feature, so I'll report here in the future in case I have some feedback.

Just a quick question. Is there any way to export in the csv only the N (user defined) processes with the highest CPU usage? Or the only way to filter what is saved in the CSV (or anything else) is the process filter with regular expression? I guess it would be hard to have PIDs changing at every timestamp as the schema of the table exported is defined at the beginning.

nicolargo commented 6 months ago

@guidocioni For the moment it is not possible but it is a nice feature. Can you open a new issue ?

Thanks !

nicolargo / glances

Export individual processes stats #794