lustre: please add /sys/fs/lustre/health_check

influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.

https://influxdata.com/telegraf

MIT License

14.49k stars 5.55k forks source link

lustre: please add /sys/fs/lustre/health_check #12836

Closed lukeyeager closed 1 year ago

lukeyeager commented 1 year ago

Use Case

I would like to switch from the defunct HewlettPackard/lustre_exporter to telegraf, but the fact that the health of the lustre storage targets isn't monitored is a blocker.

Expected behavior

I would expect to see something like the following in :9273/metrics:

# HELP lustre2_health Current health check status
# TYPE lustre2_health gauge
lustre2_health 1

Actual behavior

Instead, I don't see any monitoring of the current health status

Additional info

If it helps, here's how lustre_exporter added it: https://github.com/HewlettPackard/lustre_exporter/commit/762b9838152323b942ac74737e7ec8d5099b88ac

powersj commented 1 year ago

Hi,

After a quick read, the current lustre2 input plugin, I assume this is what you are using, looks to monitor lustre stats via files under /proc/fs/lustre/*/stats.

/sys/fs/lustre/health_check

Is this different than what is documented in this paper, namely /proc/fs/lustre/healthcheck? Does one exist in some versions and not others?

Based on the HP code, it looks like if this file contains "healthy", then all is well and a 1 is returned. And if anything else is read, then return 0. Can you confirm that is the intended behavior as well?

If so, you could easily do this today with:

[[inputs.file]]
    name_override = "lustre2_health"
    files = ["/sys/fs/lustre/health_check"]
    data_format = "value"
    data_type = "string"

lustre2_health value="healthy" 1678477350000000000

And if you wanted a number of 1 for healthy and 0 for everything else:

[[processors.enum]]
  [[processors.enum.mapping]]
    namepass = "lustre2_health"
    field = "value"
    dest = "value"
    default = 0
    [processors.enum.mapping.value_mappings]
      healthy = 1

lustre2_health value=1i 1678477561000000000

lukeyeager commented 1 year ago

/sys/fs/lustre/health_check Is this different than what is documented in this paper, namely /proc/fs/lustre/healthcheck? Does one exist in some versions and not others?

The location switched from sysfs to procfs in 2019: https://github.com/lustre/lustre-release/commit/5d368bd0b203aee8011426fd147fad3e42ac9f7f

If so, you could easily do this today with:

Nice, thanks for the workaround! This unblocks me. I still feel this would be a useful improvement to add to telegraf, so I'll leave this issue open.

powersj commented 1 year ago

The location switched from sysfs to procfs in 2019: https://github.com/lustre/lustre-release/commit/5d368bd0b203aee8011426fd147fad3e42ac9f7f

thanks for looking into that

Nice, thanks for the workaround! This unblocks me. I still feel this would be a useful improvement to add to telegraf, so I'll leave this issue open.

glad it helps - I did edit make a small change to the file example, but it should do the same thing.

We can look into adding the health check.

next steps: extend lustre2 input plugin to read the health_check file and report 0 (not healthy) or 1 (healthy)

lukeyeager commented 1 year ago

I'm finally transitioning from lustre_exporter to telegraf and I'm discovering that the workaround you gave isn't working - I don't get a health metric in the output:

$ cat /sys/fs/lustre/health_check
healthy

$ cat telegraf.conf
[agent]
  interval = "10s"
  omit_hostname = true
  debug = true
[[inputs.file]]
  name_override = "lustre2_health"
  files = ["/sys/fs/lustre/health_check"]
  data_format = "value"
  data_type = "string"
[[outputs.prometheus_client]]

$ telegraf -config /etc/telegraf/telegraf.conf
2023-06-26T16:12:14Z I! Loading config: /etc/telegraf/telegraf.conf
2023-06-26T16:12:14Z I! Starting Telegraf 1.27.1
2023-06-26T16:12:14Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-06-26T16:12:14Z I! Loaded inputs: file
2023-06-26T16:12:14Z I! Loaded aggregators:
2023-06-26T16:12:14Z I! Loaded processors:
2023-06-26T16:12:14Z I! Loaded secretstores:
2023-06-26T16:12:14Z I! Loaded outputs: prometheus_client
2023-06-26T16:12:14Z I! Tags enabled:
2023-06-26T16:12:14Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2023-06-26T16:12:14Z D! [agent] Initializing plugins
2023-06-26T16:12:14Z D! [agent] Connecting outputs
2023-06-26T16:12:14Z D! [agent] Attempting connection to [outputs.prometheus_client]
2023-06-26T16:12:14Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2023-06-26T16:12:14Z D! [agent] Successfully connected to outputs.prometheus_client
2023-06-26T16:12:14Z D! [agent] Starting service inputs
2023-06-26T16:12:24Z D! [outputs.prometheus_client] Wrote batch of 1 metrics in 36.537µs
2023-06-26T16:12:24Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics

$ curl -s localhost:9273/metrics | grep -Ev 'go_|process_' | wc -l
0

I've verified with strace that the file is being read successfully. But I can't yet understand why the series doesn't make it into the prometheus output. I feel it's highly likely that this is a newbie error.

powersj commented 1 year ago

Please add the outputs.file and see what is produced there as well. I have a feeling this is because the value is a string without the enum processor as well

lukeyeager commented 1 year ago

Ah! Yes, that was it. Thanks!

powersj commented 1 year ago

@lukeyeager - can you give the artifacts in #13756 a try and see if the health check value comes through?

telegraf-tiger[bot] commented 1 year ago

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!