Closed lukeyeager closed 1 year ago
Hi,
After a quick read, the current lustre2
input plugin, I assume this is what you are using, looks to monitor lustre stats via files under /proc/fs/lustre/*/stats
.
/sys/fs/lustre/health_check
Is this different than what is documented in this paper, namely /proc/fs/lustre/healthcheck
? Does one exist in some versions and not others?
Based on the HP code, it looks like if this file contains "healthy", then all is well and a 1 is returned. And if anything else is read, then return 0. Can you confirm that is the intended behavior as well?
If so, you could easily do this today with:
[[inputs.file]]
name_override = "lustre2_health"
files = ["/sys/fs/lustre/health_check"]
data_format = "value"
data_type = "string"
lustre2_health value="healthy" 1678477350000000000
And if you wanted a number of 1
for healthy and 0
for everything else:
[[processors.enum]]
[[processors.enum.mapping]]
namepass = "lustre2_health"
field = "value"
dest = "value"
default = 0
[processors.enum.mapping.value_mappings]
healthy = 1
lustre2_health value=1i 1678477561000000000
/sys/fs/lustre/health_check Is this different than what is documented in this paper, namely /proc/fs/lustre/healthcheck? Does one exist in some versions and not others?
The location switched from sysfs to procfs in 2019: https://github.com/lustre/lustre-release/commit/5d368bd0b203aee8011426fd147fad3e42ac9f7f
If so, you could easily do this today with:
Nice, thanks for the workaround! This unblocks me. I still feel this would be a useful improvement to add to telegraf, so I'll leave this issue open.
The location switched from sysfs to procfs in 2019: https://github.com/lustre/lustre-release/commit/5d368bd0b203aee8011426fd147fad3e42ac9f7f
thanks for looking into that
Nice, thanks for the workaround! This unblocks me. I still feel this would be a useful improvement to add to telegraf, so I'll leave this issue open.
glad it helps - I did edit make a small change to the file example, but it should do the same thing.
We can look into adding the health check.
next steps: extend lustre2 input plugin to read the health_check
file and report 0 (not healthy) or 1 (healthy)
I'm finally transitioning from lustre_exporter to telegraf and I'm discovering that the workaround you gave isn't working - I don't get a health metric in the output:
$ cat /sys/fs/lustre/health_check
healthy
$ cat telegraf.conf
[agent]
interval = "10s"
omit_hostname = true
debug = true
[[inputs.file]]
name_override = "lustre2_health"
files = ["/sys/fs/lustre/health_check"]
data_format = "value"
data_type = "string"
[[outputs.prometheus_client]]
$ telegraf -config /etc/telegraf/telegraf.conf
2023-06-26T16:12:14Z I! Loading config: /etc/telegraf/telegraf.conf
2023-06-26T16:12:14Z I! Starting Telegraf 1.27.1
2023-06-26T16:12:14Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-06-26T16:12:14Z I! Loaded inputs: file
2023-06-26T16:12:14Z I! Loaded aggregators:
2023-06-26T16:12:14Z I! Loaded processors:
2023-06-26T16:12:14Z I! Loaded secretstores:
2023-06-26T16:12:14Z I! Loaded outputs: prometheus_client
2023-06-26T16:12:14Z I! Tags enabled:
2023-06-26T16:12:14Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"", Flush Interval:10s
2023-06-26T16:12:14Z D! [agent] Initializing plugins
2023-06-26T16:12:14Z D! [agent] Connecting outputs
2023-06-26T16:12:14Z D! [agent] Attempting connection to [outputs.prometheus_client]
2023-06-26T16:12:14Z I! [outputs.prometheus_client] Listening on http://[::]:9273/metrics
2023-06-26T16:12:14Z D! [agent] Successfully connected to outputs.prometheus_client
2023-06-26T16:12:14Z D! [agent] Starting service inputs
2023-06-26T16:12:24Z D! [outputs.prometheus_client] Wrote batch of 1 metrics in 36.537µs
2023-06-26T16:12:24Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
$ curl -s localhost:9273/metrics | grep -Ev 'go_|process_' | wc -l
0
I've verified with strace
that the file is being read successfully. But I can't yet understand why the series doesn't make it into the prometheus output. I feel it's highly likely that this is a newbie error.
Please add the outputs.file and see what is produced there as well. I have a feeling this is because the value is a string without the enum processor as well
Ah! Yes, that was it. Thanks!
@lukeyeager - can you give the artifacts in #13756 a try and see if the health check value comes through?
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!
Use Case
I would like to switch from the defunct HewlettPackard/lustre_exporter to telegraf, but the fact that the health of the lustre storage targets isn't monitored is a blocker.
Expected behavior
I would expect to see something like the following in
:9273/metrics
:Actual behavior
Instead, I don't see any monitoring of the current health status
Additional info
If it helps, here's how lustre_exporter added it: https://github.com/HewlettPackard/lustre_exporter/commit/762b9838152323b942ac74737e7ec8d5099b88ac