Closed gordan-bobic closed 1 year ago
I made an MVP version of this feaure, most of the metrics have different unit or require different way to calculate the value, so I had to add some extract queries to grafana dashboards for those metrics
Note that snmp_exporter fetches the metrcs based on the snmp MIB, I cutom the config based on these MIBs (IF-MIB, HOST-RESOURCES-MIB and UCD-SNMP-MIB) that net-snmp
uses, I believe net-snmp is common used in linux-based systems? If some devices use different MIBs, then it won't get the metrics data.
And please let me know if following features are neccessary:
It would be good to default to public / v2c, but have support for a custom/selectable community string/version/username/password if it isn't too much hassle.
Is it better to pre-process different units into the same units that are currently used? Or is it better to put in detection and conditionals in the graphs themselves?
It will need a extract process if we want to pre-process the metrics, I don't think there is easy way in prometheus to do so, we may need to fetch the origin metrics first and then put the new metrics back to prometheus. I think it's better/easier to just put different queries into grfana dashboards.
I meant modify the snmp exporter to read whatever snmp sends and then write it out to prometheus in the same format that node_exporter uses. We only really care about the snmp data subset that we get from the regular node_exporter., if snmp sends a little more, that's fine, if it sends multiples more, we should probably filter it to avoid excessive prometheus bloat.
I think I have a very slight preference toward modifying the exporter, but if it is hugely more difficult than modifying every graph that needs modifying to work with either, I'm OK with that. As long as it doesn't introduce more fragility or anomalies on the dashboards side.
OK, I see, I wouldn't say it's hugely difficult, it's workable, only need some time to dig into the snmp_exporter project. You may need to fork the project first if we want to do it this way.
Just out of curiosity, since I haven't been aware of prometheus resource usage, how bad it bloat as the metrics grow?
On small deployments it isn't a problem and can be safely ignored. When you have hundreds of servers, it starts to become problematic. I aim for a ball park of 1GB of RAM per monitored server.
OK, thanks for the explanation
OK, converted all the snmp metrics to the format that node_exporter uses, and there are plenty metrics only exist in node_exporter, net-snmp doesn't collect them, you may want to take a look at those dashboards running on flak, I added a SNMP instance there.
And added more options for SNMP v1|v2c|v3 while adding the snmp instance, see below GIF, any advice on the UI/UX/functionality?
Node add screen looks good. Please make sure that if the monitored server with the specified name already exists, the metrics are recorded against that server, e.g. if we have a remoted monitored node called flak as a remote mysql node, and we add a remote monitored node called flak for snmp monitoring, those should be treated as the same node, not two esparate nodes with the same name. Regardless of which is added first. For remote nodes we may have to combine multiple sources to get a complete picture.
Hmm... Something doesn't seem right. The home page shows CPU usage but system overview doesn't. The fact that the home page is showing CPU usage implies there is enough to bring up at least some kind of CPU usage representation.
Home page shows disk reads and disk writes, but the Disk Performance dashboards do not.
Disk I/O is out by an order of magnitude or two. I'm pretty sure flak isn't running with 1250% Disk I/O utilisation.
Disk I/O Size also seems off, I don't think Disk I/O size shown at between 2MB and 10MB look sane.
Please make sure that if the monitored server with the specified name already exists, the metrics are recorded against that server
Yeah, this is already done.
Hmm... Something doesn't seem right. The home page shows CPU usage but system overview doesn't. The fact that the home page is showing CPU usage implies there is enough to bring up at least some kind of CPU usage representation.
About this, those cpu metrics are read from /proc/stat
(example at below, first line is overall cpu data, and following lines are for each cpu core), the problem is that node_exporter collects each cpu core data, but net-snmp only collects the overall cpu data, so we can't convert this cpu metric from snmp to node_exporter.
The home page is showing CPU usage because I added a extra query to that dashboards, there is a hrProcessorLoad
metric in snmp that tells the cpu load.
We can make those overall/average graphs work in that system overview page though, but not those graphs for each cpu core.
[jason@flak ~]$ cat /proc/stat
cpu 6114661 24671 1602813 84552530 217994 0 160676 0 0 0
cpu0 1414341 5336 403623 21162769 52734 0 79029 0 0 0
cpu1 1572167 6025 399515 21129535 51669 0 32911 0 0 0
cpu2 1566016 6604 400451 21127957 56214 0 26940 0 0 0
cpu3 1562137 6706 399224 21132269 57376 0 21796 0 0 0
...
Home page shows disk reads and disk writes, but the Disk Performance dashboards do not.
Disk reads and writes dashboards on home page use the data from pgpgin/pgpgout
in /proc/vmstat
, node_exporter and net-snmp both collect them. But the reads/writes dashboards on Disk Performance use data from /proc/diskstats
, net-snmp doesn't collect them.
Disk I/O is out by an order of magnitude or two. I'm pretty sure flak isn't running with 1250% Disk I/O utilisation.
OK, this is fixed.
Disk I/O Size also seems off, I don't think Disk I/O size shown at between 2MB and 10MB look sane.
OK, this is fixed.
Current implementation looks good. Please send merge requests. Closing this as completed.
While we have better ways of getting stats from a process running on a server, for remote monitoring various devices and legacy servers, snmp_exporter would be quite handy.
Please add prometheus/snmp_exporter ( https://github.com/prometheus/snmp_exporter ) to ssm-server, and plumb it into remote server addition from the server side. I cannot think of a good reason to add it to the ssm-client package, though, since node_exporter already provides the required data.
This should feed the data into prometheus with same metric names and units for the time series we use in grafana so that the graphs don't have to be modified. That may necessitate modifying some of the snmp_exporter time series labels.