Closed lawre closed 6 months ago
It is a good idea but... how to identify the process (primary key) ?
TBD
My primary goal for this request is to be able to hover over a line in a grafana graph and a tooltip will pop up with the full command line of the process I'm interested in. How can this best be accomplished?
primary = PID: This would be a good choice if we can figure out how to get Grafana to tooltip a second field instead of the one we're querying on. I'm not so interested in the PID per se, but it does offer the convenience of guaranteeing uniqueness for each process.
primary = Command Line: I think this would be the most straight-forward approach. But these lines can get pretty long with lots of funny characters. Can InfluxDB handle this? Perhaps truncate at # of characters? This approach also leaves the danger of a program like apache spawning multiple processes for the same thing... these wouldn't look unique in this case.
primary = PID-Command Line: Concatenating the PID and the command line might be the best approach to guarantee uniqueness if we can't figure out the tooltip shows secondary info problem.
primary = command: This does not get granular enough for me. Think of the Apache web server, or in my case the Jenkins node allowing simultaneous runs.
Thank you for considering this feature!
Wow! I also need this feature! My purpose is to capture load of specific processes while running stress test on several solutions. I think only nmon currently exports this information.
Regarding primary key, I think combination of pid and start time of the process (because of pid reuse) might be enough to uniquely identify process.
My team desperately needs this feature too. I've had to do some crazy stuff with the per process utilization flat file and rsync. Having it export directly to influxdb is our number one feature request
+1
I've been working my way through the data collection agents (collectd, telegraf, topbeat) and have been completely surprised, shocked, that none of these agents collect all process cpu, mem and disk metrics. Some will allow specific process monitors to be defined in configuration files, but i'm looking for comprehensive process monitoring metrics stored in Influx or Logstash.
Hate to say it, but this is straight forward in Windows performance logging.
I am about to test Glances specifically for this capability, was expecting it to be there based on the data displayed in the command line interface
@lawre Another approch: use the PID as primary key and use tags to store: process name, full command line, process start...
What do you think ?
@nicolargo that will work for me
I think that would work, good suggestion.
My proposal is to use the PID as primary key and store the process name, command line in tags only work for InfluxDB, the current architecture of the Glances export module is not related to the storage database (for example, same function are used to store in CSV, InfluxDB or Cassandra...). Tags only exist in the InfluxDB ecosystem.
Two others points to keep in mind:
My new proposal: define a processes list (for example using regular expression on command line), only the filtered process will be exported.
For example, if you want to monitor the NGinx process:
[processlist]
# Export process stats (export_* lines)
# Export NGinx processes (name matching ^nginx.* regexp) to foonginx key
export_foonginx=^nginx.*
The key is foonginx. It will export all the mandatory stats for the processes command line starting with nginx (one line per process or one line for all the processes, to be discus).
What do you think ?
@lawre : Any head up concerning my last proposal ?
Sorry, I didn't see the alert for your question before.
collectd's processes plugin already allows for regex match of processes, so in our case we would just continue to use that. Is there a possibility for this just to be an "InfluxDB only" feature as it is probably the only uniquely qualified database for this type of data? I'm not sure any other fixed key DB would even be able to handle this. I couldn't imagine reading it in a CSV where there could be 20-100 ever changing PIDs to key off.
I just try to code a first version in a local branch. The main problem is that the process stats update is done asynchronously in a specific thread. So when the stats are exported we are not sure that the process stats are completed. One workaround is to force the update before the export. But it cost a lot of CPU consumption...
Will be implemented in Glances version 3.0 (after the complete code refactor and change of the software architecture).
No time to work on this request for Glances 3.0. Need contributor !
Just getting into glances and I doubt I (nor anyone on my team) will have the bandwidth to be that contributor... but I want to broaden the scope beyond influx, let me know if it is better to spin off into a new issue:
from the original issue:
better pinpoint what process is hogging our memory, network or cpu and for how long.
I would love to be able to configure actions/alerts to show the current "hogs" eg:
Warning or critical alerts (lasts 10 entries)
2019-01-27 10:14:30 (00:00:22) - CRITICAL on MEM (96.0) [Hogs: apached in dev (55%),gitlab worker in gitlab](41%)]
so the format i envision is: [{{process_name}} in {{container_name}} ({{hog%}})]
If this info could be logged to a file (that i could add to logstash) and displayed in the web interface alert list, it would be mega sexy.
Is this at all possible with the current API? or even with the command line somehow?
I ask for it in the alerts list, because this is persistent in the UI for longer, so when I check intermittently it would be great to have this glimpse into the previously critical moments.
For now, I'm going to be using a combination of https://github.com/ncabatoff/process-exporter and glances (container and gpu measurement). Ideally, I'd route all of this through the grafana agent to make a single target in prometheus.
Feature previous implemented in the branch: https://github.com/nicolargo/glances/tree/issue794
The processes list to export could be defined in the Glances configuration file:
#
# Define the list of processes to export using
# export is a comma separated list of Glances filter
export=.*firefox.*,username:nicolargo
or with the --export-process-filter option:
glances -C --export csv --export-csv-file /tmp/glances.csv --disable-plugin all --enable-plugin processlist --quiet --export-process-filter ".*python.*"
Feature merged into develop.
The documentation is here: https://github.com/nicolargo/glances/blob/develop/docs/aoa/ps.rst#export-process
Need beta testers on this feature !
cc: @unlikelyzero @gotjoshua @lawre @johnhill2 @gabrieljames @wingsof @alex-ruhl
What a great day to be alive! I’ve been looking for such a feature for years, it’s been ages I haven’t touched Glances, and I just redid my monitoring setup over Glances + InfluxDB + Grafana past few days. Just to learn my long-awaited feature have been implemented 2 weeks ago ❤️
I deployed upstream’s develop Glances over my server, and process exporter works like a charm! I see data coming to InfluxDB and rendered on Grafana already. I don’t see anything wrong there, I’ll keep you updated if you want to confirm it’s stable on the long run.
I’m wondering (as I’m investigating disks I/O bottlenecks from unmonitored processes), is there any reason per-process disk I/O metrics are not exported?
I am talking about the bytes R/s
and W/s
metrics which are already present in the curses display.
EDIT: I faced a weird behaviour where Glances ended claiming several dozens of GiB of RAM, along with a few CPU cores at 100% (iirc, 40GiB and 4 cores at 100%). I’m investigating to check whether it comes from misuse, from develop
, or from the process list export.
I cannot reproduce. I believe it came from first experiments without filtering output. CPU is still a bit high (2 cores ~100%), but I believe it is known and expected.
I find the cardinality a bit odd (and difficult to manipulate subsequently). Is there specific reasons/constraints that require an export such as:
timestamp | 821811.num_threads | 821811.name | 821811.cpu_timesuser | 2950324.num_threads | 2950324.name | 2950324.cpu_timesuser |
---|---|---|---|---|---|---|
N | 1 | kworker/13:5-events | 0.02 | 22 | dotnet | 233.79 |
N+1 | 1 | kworker/13:5-events | 0.02 | 22 | dotnet | 233.79 |
Instead of a format with 1 row per process and fewer columns? Such as: | timestamp | pid | num_threads | name | cpu_timesuser |
---|---|---|---|---|---|
N | 821811 | 1 | kworker/13:5-events | 0.02 | |
N | 2950324 | 22 | dotnet | 233.79 | |
N+1 | 821811 | 1 | kworker/13:5-events | 0.02 | |
N+1 | 2950324 | 22 | dotnet | 233.79 |
I’m currently struggling to parse the results to build a Grafana dashboard from the InfluxDB backend, and the latter format would make it easier.
Additionally, regarding the format, I noticed a weird behavior where the processes were not aligned in the proper columns (have a look at the 270.pid
column):
I reproduce consistently using the following command (and my own process list which includes a lot of processes):
glances -C --export csv --export-csv-file /tmp/glances.csv --export-process-filter ".*dotnet.*,*java*,*qemu*,*samba*,*lxd*,*docker*" --disable-plugin all --enable-plugin processlist --quiet
I’m wondering if it comes from the fact that I monitor a lot of processes (some of which terminate, other spawn)?
Hi @bLuka and thanks for the feedback.
Concerning the CSV export, i want to keep one line per timestamp because the header should be the same for each line.
This behavior is also used for others plugins like network, where stats are displayed with the following format:
For the InfluxDB export, this is not the same behavior and all the stats are exported line by line, the pid become a tag (in the InfluxDB data model). I just add the process name has another tag. So it should simplify the way InfluxDB store the information and so also simplify Grafana dashboard creation.
For the "weird" behavior, i also reproduce it on my side. Without any creation or deletion of a exported process, every 1 minute, the column are generate in a different order...
The problem is related to the cache_timeout=60 in the processes.py file:
class GlancesProcesses(object):
"""Get processed stats using the psutil library."""
def __init__(self, cache_timeout=60):
"""Init the class to collect stats about processes."""
...
If i change the cache_timeout=60 to 30 then the glitch appear every 30 seconds...
@bLuka Last commit should correct the issue with clumn alignement. Stats are now sorted before export.
For the last point (new incoming process or process removed) i do not know what is the best solution: 1) Generate a new CSV file with a new header 2) Change the header and add new column if a new process is created 3) If 2) if a process is stopped, than the column will be filled with empty space
Any advise ?
Hey all. I've been testing also this feature, so I'll report here in the future in case I have some feedback.
Just a quick question. Is there any way to export in the csv only the N (user defined) processes with the highest CPU usage? Or the only way to filter what is saved in the CSV (or anything else) is the process filter with regular expression? I guess it would be hard to have PIDs changing at every timestamp as the schema of the table exported is defined at the beginning.
@guidocioni For the moment it is not possible but it is a nice feature. Can you open a new issue ?
Thanks !
I would really like some way to export the individual process metrics to influxdb so we can better pinpoint what process is hogging our memory, network or cpu and for how long. I work in a software build environment, so these kinds of metrics can really help nail down bottlenecks in a jenkins build.