Add prometheus snmp_exporter to ssm-server

gordan-bobic commented 1 year ago

While we have better ways of getting stats from a process running on a server, for remote monitoring various devices and legacy servers, snmp_exporter would be quite handy.

Please add prometheus/snmp_exporter ( https://github.com/prometheus/snmp_exporter ) to ssm-server, and plumb it into remote server addition from the server side. I cannot think of a good reason to add it to the ssm-client package, though, since node_exporter already provides the required data.

This should feed the data into prometheus with same metric names and units for the time series we use in grafana so that the graphs don't have to be modified. That may necessitate modifying some of the snmp_exporter time series labels.

oblitorum commented 1 year ago

I made an MVP version of this feaure, most of the metrics have different unit or require different way to calculate the value, so I had to add some extract queries to grafana dashboards for those metrics

CleanShot 2023-02-23 at 22 01 45

Note that snmp_exporter fetches the metrcs based on the snmp MIB, I cutom the config based on these MIBs (IF-MIB, HOST-RESOURCES-MIB and UCD-SNMP-MIB) that net-snmp uses, I believe net-snmp is common used in linux-based systems? If some devices use different MIBs, then it won't get the metrics data.

And please let me know if following features are neccessary:

The default snmp community it uses is public, should we add an option to allow user custom the community value?
The default snmp version it uses is v2c, should we add an option to allow user custom the version? (version 3 supports username/password authorization)

gordan-bobic commented 1 year ago

It would be good to default to public / v2c, but have support for a custom/selectable community string/version/username/password if it isn't too much hassle.

Is it better to pre-process different units into the same units that are currently used? Or is it better to put in detection and conditionals in the graphs themselves?

oblitorum commented 1 year ago

It will need a extract process if we want to pre-process the metrics, I don't think there is easy way in prometheus to do so, we may need to fetch the origin metrics first and then put the new metrics back to prometheus. I think it's better/easier to just put different queries into grfana dashboards.

gordan-bobic commented 1 year ago

I meant modify the snmp exporter to read whatever snmp sends and then write it out to prometheus in the same format that node_exporter uses. We only really care about the snmp data subset that we get from the regular node_exporter., if snmp sends a little more, that's fine, if it sends multiples more, we should probably filter it to avoid excessive prometheus bloat.

I think I have a very slight preference toward modifying the exporter, but if it is hugely more difficult than modifying every graph that needs modifying to work with either, I'm OK with that. As long as it doesn't introduce more fragility or anomalies on the dashboards side.

oblitorum commented 1 year ago

OK, I see, I wouldn't say it's hugely difficult, it's workable, only need some time to dig into the snmp_exporter project. You may need to fork the project first if we want to do it this way.

oblitorum commented 1 year ago

Just out of curiosity, since I haven't been aware of prometheus resource usage, how bad it bloat as the metrics grow?

gordan-bobic commented 1 year ago

On small deployments it isn't a problem and can be safely ignored. When you have hundreds of servers, it starts to become problematic. I aim for a ball park of 1GB of RAM per monitored server.

oblitorum commented 1 year ago

OK, thanks for the explanation

oblitorum commented 1 year ago

OK, converted all the snmp metrics to the format that node_exporter uses, and there are plenty metrics only exist in node_exporter, net-snmp doesn't collect them, you may want to take a look at those dashboards running on flak, I added a SNMP instance there.

oblitorum commented 1 year ago

And added more options for SNMP v1|v2c|v3 while adding the snmp instance, see below GIF, any advice on the UI/UX/functionality?

CleanShot 2023-03-06 at 08 31 39

gordan-bobic commented 1 year ago

Node add screen looks good. Please make sure that if the monitored server with the specified name already exists, the metrics are recorded against that server, e.g. if we have a remoted monitored node called flak as a remote mysql node, and we add a remote monitored node called flak for snmp monitoring, those should be treated as the same node, not two esparate nodes with the same name. Regardless of which is added first. For remote nodes we may have to combine multiple sources to get a complete picture.

Hmm... Something doesn't seem right. The home page shows CPU usage but system overview doesn't. The fact that the home page is showing CPU usage implies there is enough to bring up at least some kind of CPU usage representation.

Home page shows disk reads and disk writes, but the Disk Performance dashboards do not.

Disk I/O is out by an order of magnitude or two. I'm pretty sure flak isn't running with 1250% Disk I/O utilisation.

Disk I/O Size also seems off, I don't think Disk I/O size shown at between 2MB and 10MB look sane.

oblitorum commented 1 year ago

Please make sure that if the monitored server with the specified name already exists, the metrics are recorded against that server

Yeah, this is already done.

Hmm... Something doesn't seem right. The home page shows CPU usage but system overview doesn't. The fact that the home page is showing CPU usage implies there is enough to bring up at least some kind of CPU usage representation.

About this, those cpu metrics are read from /proc/stat (example at below, first line is overall cpu data, and following lines are for each cpu core), the problem is that node_exporter collects each cpu core data, but net-snmp only collects the overall cpu data, so we can't convert this cpu metric from snmp to node_exporter. The home page is showing CPU usage because I added a extra query to that dashboards, there is a hrProcessorLoad metric in snmp that tells the cpu load. We can make those overall/average graphs work in that system overview page though, but not those graphs for each cpu core.

[jason@flak ~]$ cat /proc/stat 
cpu  6114661 24671 1602813 84552530 217994 0 160676 0 0 0
cpu0 1414341 5336 403623 21162769 52734 0 79029 0 0 0
cpu1 1572167 6025 399515 21129535 51669 0 32911 0 0 0
cpu2 1566016 6604 400451 21127957 56214 0 26940 0 0 0
cpu3 1562137 6706 399224 21132269 57376 0 21796 0 0 0
...

Home page shows disk reads and disk writes, but the Disk Performance dashboards do not.

Disk reads and writes dashboards on home page use the data from pgpgin/pgpgout in /proc/vmstat, node_exporter and net-snmp both collect them. But the reads/writes dashboards on Disk Performance use data from /proc/diskstats, net-snmp doesn't collect them.

Disk I/O is out by an order of magnitude or two. I'm pretty sure flak isn't running with 1250% Disk I/O utilisation.

OK, this is fixed.

Disk I/O Size also seems off, I don't think Disk I/O size shown at between 2MB and 10MB look sane.

OK, this is fixed.

gordan-bobic commented 1 year ago

Current implementation looks good. Please send merge requests. Closing this as completed.

shatteredsilicon / ssm-submodules

Add prometheus snmp_exporter to ssm-server #73