influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.57k forks source link

ipmi_sensors should split multiple sensor readings with the same name #3416

Closed envy closed 6 years ago

envy commented 6 years ago

Feature Request

I'm monitoring a Dell R630 with telegraf and ipmi_sensors. The server has two CPU sockets and therefore reports temperatures for both CPUs. It would be nice, if I could access those two readings seperately as they are currently combined into one reading.

Proposal:

If multiple instances of a sensor are detected, they should be split. Example the Temp sensor:

$ ipmitool -I lanplus -H 10.1.100.103 -U root sdr
Password:
SEL              | Not Readable      | ns
Intrusion        | 0x00              | ok
Fan1A            | 5160 RPM          | ok
Fan2A            | 5160 RPM          | ok
Fan3A            | 5040 RPM          | ok
Fan4A            | 4920 RPM          | ok
Fan5A            | 5160 RPM          | ok
Fan6A            | 5040 RPM          | ok
Inlet Temp       | 18 degrees C      | ok
CPU Usage        | 8 percent         | ok
IO Usage         | 0 percent         | ok
MEM Usage        | 0 percent         | ok
SYS Usage        | 9 percent         | ok
Exhaust Temp     | 29 degrees C      | ok
Temp             | 39 degrees C      | ok
Temp             | disabled          | ns
... 

Current behavior:

The output comes from a server, which has only one CPU but it still applies:

$ telegraf --config-directory /etc/telegraf/telegraf.d/ --input-filter ipmi_sensor --test
* Plugin: inputs.ipmi_sensor, Collection 1
* Internal: 30s
...
> ipmi_sensor,unit=degrees_c,host=monitoring,name=temp,server=esxi0-mgmt.example.org status=1i,value=42 1509562189000000000
> ipmi_sensor,host=monitoring,name=temp,server=esxi0-mgmt.example.org value=0,status=0i 1509562189000000000
...

Two values are inserted for temp: 42 and 0. This makes it either difficult (only on CPU) or impossible (two or more CPUs) to find out the temperature of CPU0/CPU1/.../CPUn.

Desired behavior:

Instead of just one temp reading, a second one should be created to hold the second value. I propose the following naming schemes for further discussion:

  1. temp_0, temp_1, ..., temp_n Could be not that easy to implement, as you would need to look first for multiple instances before recording data.
  2. temp, temp_1, ..., temp_n First measurement is always called by its name. Further instances are suffixed with a number

For backwards compatibility reasons I would suggest the second option. This is also easier to implement, as you just need to create a new one if one with the same name already exists for the current parsing run.

Use case: [Why is this important (helps with prioritizing requests)]

Currently it is not possible to distinguish both temperature readings after the data has left telegraf, as only one measurement exists. Monitoring a multi CPU system's temperatures is therefore not possible. For servers with only one CPU but two sockets it is possbile but difficult to get the temperature as you need to employ e.g. the max function of InfluxDB to get only the non-zero value out of the series.

danielnelson commented 6 years ago

I like the second suggestion, though in the specific case above it seems like we should skip values like disabled that do not have a value.

envy commented 6 years ago

Yes, that would also be nice. I try to get some ipmitool output of some other Dell servers tomorrow that actually have two CPUs to see what the difference in output is (if there is any).

envy commented 6 years ago

So I gathered some output from two servers at work, one with IDRAC 6 and one with IDRAC 7: https://gist.github.com/envy/91cb29370cbf8414c58402014ff4f8df

The IDRAC 7 output matches the IDRAC 8 output of my R630 shown above. The IDRAC 6 output lists Temp even more often, however, no actual value is given. I don't know why yet, maybe it's a permission problem as I'm not the administrator of that machine.

Regarding skipping values, maybe it would be good to combine this with #3207 to filter out readings that have a status of ns/na ?

danielnelson commented 6 years ago

Oh right, I had forgotten about this other issue. So we'll just do the temp, temp_1 suggestion to complete this issue.

jhg03a commented 6 years ago

The information to properly disambiguate the information is present, but it changes the output to the parser.

> sudo ipmitool sdr
SEL              | Not Readable      | ns
Intrusion        | 0x00              | ok
Fan1             | 5040 RPM          | ok
Fan2             | 5160 RPM          | ok
...snip...
Riser 2 Presence | 0x00              | ok
Riser 3 Presence | 0x00              | ok
Temp             | 58 degrees C      | ok
Temp             | 71 degrees C      | ok
> sudo ipmitool sdr elist
SEL              | 72h | ns  |  7.1 | No Reading
Intrusion        | 73h | ok  |  7.1 |
Fan1             | 30h | ok  |  7.1 | 5040 RPM
Fan2             | 31h | ok  |  7.1 | 5160 RPM
...snip...
Riser 2 Presence | 4Eh | ok  |  7.1 | Connected
Riser 3 Presence | 4Fh | ok  |  7.1 | Connected
Temp             | 0Eh | ok  |  3.1 | 58 degrees C
Temp             | 0Fh | ok  |  3.2 | 69 degrees C

So in the case where you have a duplicate sensor name, you can probably disabiguate it using the entity id, for example Temp and 3.1 versus 3.2 in this case.

danielnelson commented 6 years ago

One option is to add a switch for the new format, which would free up breaking changes and we could add the entity_id as a tag.

[[inputs.ipmi_sensor]]
  schema_version = 2
jhg03a commented 6 years ago

I've got a complete PR for this. Working through process to get it submitted upstream.