Closed zeus86 closed 3 years ago
edit: The data definately has not made it to the db:
> show series from zfs where host = 'atlas'
key
---
zfs,host=atlas
> show series from zfs_pool where host = 'atlas'
> show series from zfs_dataset where host = 'atlas'
(same result for the other truenas host)
Hey @zeus86, note that the --test
option will prevent all outputs to be sent! This will only print the collected metrics to the console (as documented in the option's help text)! Please use --once
if you want to collect only once and send it to the output.
i know this, this exact config is runnig as service, please repoen the issue! The --test
option was just to provide output for this ticket.
Ok, so from the output it seems like the data is collected. Can you please show the output configuration (please redact sensitive information)? Do you use any processors or metric filtering somewhere?
Yes, that's what I've also wondered about, that's why I think, that some of Freenas/Truenas' pool and dataset-names might be the issue here, because data from other input-modules are submitted correctly to the database, as well as zfs-metrics from other linux hosts. In my config there is nothing special, i do not transform data by hand in any way, no filters, no processors, just inputs, one output and the agent configuration. The db-endpoint is just a plain influx1-db with http-endpoint:
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = "atlas"
omit_hostname = false
[[outputs.influxdb]]
urls = [ "http://xxx:8086" ]
database = "hosts"
precision = "s"
retention_policy = "autogen"
username = "xxx"
password = "xxx"
The DB-endpoint runs on ubuntu focal with influx packaged from ppa (influxdb 1.8.10-1)
Hmm can you maybe copy those line-protocol lines to a file and the use a input.file
and your output section in a separate telegraf instance to see if we have a problem in the config or in the metrics themselves?
Alternatively, you could add a outputs.file
with stdout
to your config and use namepass=["zfs_pool", "zfs_dataset"]
to see if the metrics arrive at the output level (see https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#selectors)...
uhm, don't know what you mean with "line-protocol-lines", but as i already said: i use the same config across multiple hosts, and everything works, except for BOTH truenas instances, running on different hw at different locations, so i'd assume, the config is not the problem here, as it is "known working" on other machines.
I mean the lines starting with > zfs
. Anyway, from what I see on the telegraf side this should work. Do you see any metric from the truenas instances in the database? It seems like you do from your queries above...
My understanding is
zfs
) on the database side.zfs_pool
and zfs_dataset
metrics are missing on the database side.Is this understanding correct? If so, there either has to be some filtering happening somewhere or metrics are dropped due to some reasons... Can you maybe check the log-files on the truenas machines for buffer fullness
or similar?
Do you see any metric from the truenas instances in the database?
yes, i do, even from the zfs-input itself, but not for datasets nor pools (see db-output from second post). the rest you understood correctly, yes.
log on client side is basically empty (just startup notification, but i'll try later with debug logging)
tried debug options in telegraf, no difference, no errors are logged, no dedicated debug-output was recorded
This is all very strange... Can you please run your test above with
./telegraf --config telegraf.conf --input-filter zfs --once --debug
just for me to see the metric flush messages...
sure, there you go:
# ./telegraf --config telegraf.conf --input-filter zfs --once --debug
2021-10-19T11:36:57Z I! Starting Telegraf 1.20.2
2021-10-19T11:36:57Z D! [agent] Initializing plugins
2021-10-19T11:36:57Z D! [agent] Connecting outputs
2021-10-19T11:36:57Z D! [agent] Attempting connection to [outputs.influxdb]
2021-10-19T11:36:57Z D! [agent] Successfully connected to outputs.influxdb
2021-10-19T11:36:57Z D! [agent] Starting service inputs
2021-10-19T11:36:58Z D! [agent] Stopping service inputs
2021-10-19T11:36:58Z D! [agent] Input channel closed
2021-10-19T11:36:58Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-10-19T11:36:58Z D! [outputs.influxdb] Wrote batch of 41 metrics in 56.025696ms
2021-10-19T11:36:58Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2021-10-19T11:36:58Z I! [agent] Stopping running outputs
2021-10-19T11:36:58Z D! [agent] Stopped Successfully
#
THIS is strange, i compared the db-content for this host before and after that command above, after executing with once there are now dataset-data in the DB, so it does not work while running as service
Ok, so it flushed 41 metrics to the output. I think this includes all zfs
metrics with zfs_pool
and zfs_dataset
. If we want to be 100% sure you can add
[[outputs.file]]
files = ["stdout"]
to your config and repeat the command.
This means that telegraf thinks it sent out the data...
THIS is strange, i compared the db-content for this host before and after that command above, after executing with once there are now dataset-data in the DB, so it does not work while running as service
Ok, can you please show the command you use to run it as a service?
in truenas/freenas you cannot easily alter parts of the OS, so you have to add a startup-service, which links the service to /etc on boot and then enables the service.
init-script added to Web-UI:
ln -s /mnt/nas/home/telegraf/telegraf.init /usr/local/etc/rc.d/telegraf && service telegraf start
initfile that got linked:
#!/bin/sh
# $FreeBSD$
# PROVIDE: telegraf
# REQUIRE: DAEMON NETWORKING
# BEFORE: LOGIN
# KEYWORD: shutdown
# Add the following lines to /etc/rc.conf to enable telegrafb:
# telegraf_enable="YES"
#
# telegraf_enable (bool): Set to YES to enable telegraf
# Default: NO
# telegraf_conf (str): telegraf configuration file
# Default: ${PREFIX}/etc/telegraf.conf
# telegraf_flags (str): Extra flags passed to telegraf
. /etc/rc.subr
directory="/mnt/nas/home/telegraf/"
name="telegraf"
rcvar=telegraf_enable
load_rc_config $name
: ${telegraf_enable:="YES"}
: ${telegraf_flags:="-quiet"}
: ${telegraf_conf:="${directory}${name}.conf"}
# daemon
start_precmd=telegraf_prestart
pidfile="/var/run/${name}.pid"
command=/usr/sbin/daemon
command_args="-crP ${pidfile} ${directory}${name} ${telegraf_flags} -config=${telegraf_conf} >> /var/log/telegraf.log 2>&1"
telegraf_prestart() {
# Have to empty rc_flags so they don't get passed to daemon(8)
rc_flags=""
}
run_rc_command "$1"
Two things that I noticed (assuming in the above test your PWD
is $directory
):
-config
instead of --config
in your service file. Not sure if this works, but if it doesn't, telegraf will look for /etc/telegraf.conf
...log_file
in the telegraf config instead of redirecting the output.Can you try to use --config
in the service and check if this changes things? Oh and you use -config=$config
currently but usually we use --config $config
... Not sure if this is relevant...
good point, , changed -config
and added a dash, but anyways, it definately has read the config anyways (otherwise it would have never been able to write to the correct db-host), and i also removed the quiet-flag, but still no changes, only the singular datapoint from the manual run. With debug and specific logfile and without quiet options, we do indeed see some warnings:
2021-10-19T11:55:13Z I! Loaded inputs: cpu disk diskio exec kernel linux_sysctl_fs mem net netstat processes smart swap system zfs
2021-10-19T11:55:13Z I! Loaded aggregators:
2021-10-19T11:55:13Z I! Loaded processors:
2021-10-19T11:55:13Z I! Loaded outputs: influxdb
2021-10-19T11:55:13Z I! Tags enabled: host=backup
2021-10-19T11:55:13Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"backup", Flush Interval:10s
2021-10-19T11:55:13Z D! [agent] Initializing plugins
2021-10-19T11:55:13Z W! [inputs.smart] nvme not found: verify that nvme is installed and it is in your PATH (or specified in config) to gather vendor specific attributes: provided path does not exist: []
2021-10-19T11:55:13Z D! [agent] Connecting outputs
2021-10-19T11:55:13Z D! [agent] Attempting connection to [outputs.influxdb]
2021-10-19T11:55:13Z D! [agent] Successfully connected to outputs.influxdb
2021-10-19T11:55:13Z D! [agent] Starting service inputs
2021-10-19T11:55:20Z W! [inputs.zfs] Invalid number of columns for line:
2021-10-19T11:55:23Z D! [outputs.influxdb] Wrote batch of 149 metrics in 64.864319ms
2021-10-19T11:55:23Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2021-10-19T11:55:30Z W! [inputs.zfs] Invalid number of columns for line:
2021-10-19T11:55:33Z D! [outputs.influxdb] Wrote batch of 152 metrics in 67.326364ms
2021-10-19T11:55:33Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2021-10-19T11:55:40Z W! [inputs.zfs] Invalid number of columns for line:
2021-10-19T11:55:43Z D! [outputs.influxdb] Wrote batch of 152 metrics in 62.46143ms
2021-10-19T11:55:43Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2021-10-19T11:55:50Z W! [inputs.zfs] Invalid number of columns for line:
nvme
does not exsist on freenas, thats okay, but the last line is interesting.
most likely, this comes from https://github.com/influxdata/telegraf/blob/master/plugins/inputs/zfs/zfs_freebsd.go#L106-L112
due to the src, the warning (in the end it's not even an error from a severity point of view), is generated from the dataset-setting, and indeed if i independently enable either dataset or pool metrics, the warning only occurs, when dataset metrics are enabled, but anyways, both metric types did not made it to the db, regardless of which of those two settings are enabled.
Hmmm did you notice that the warning will output the line it tries to parse? However, in your log this line
is empty...
Can you please run zfs list -Hp -o name,avail,used,usedsnap,usedds
manually on that host to see what goes into the parsing?
this might be the clue! On a linux-system, all lines have the same amount of fields, like so:
rpool 489257426944 476036472832 0 106496
rpool/ROOT 489257426944 320486854656 0 98304
rpool/ROOT/pve-1 489257426944 320486756352 0 320486756352
rpool/data 489257426944 155398995968 0 147456
rpool/data/subvol-100-disk-1 471621873664 65249038336 0 65249038336
...and so on
on truenas, this is not the case, if a pool has child-entities:
backup-pool 5572787747328 50075223552 83328 166656
backup-pool/.system 5572787747328 1587701952 3178368 810787392
backup-pool/.system/configs-1ae6384c4fbd402fb437ba266f4c6876 5572787747328 40515264 3095040 37420224
backup-pool/.system/cores 1071182464 2559360 2416512 142848
backup-pool/.system/rrd-1ae6384c4fbd402fb437ba266f4c6876 5572787747328 710478336 691741440 18736896
backup-pool/.system/samba4 5572787747328 5898432 5332992 565440
backup-pool/.system/services 5572787747328 142848 0 142848
backup-pool/.system/syslog-1ae6384c4fbd402fb437ba266f4c6876 5572787747328 13999104 9683904 4315200
backup-pool/.system/webui 5572787747328 142848 0 142848
...
i'm just blind, the field-count is the same, it was just differently formated in the shell (linux gives pretty output with alinged cells), so this was noch the answer.
Yeah I think this cannot be the issue as you get datapoint when running manually... But this catched my attention:
2021-10-19T11:55:13Z I! Tags enabled: host=backup
2021-10-19T11:55:13Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"backup", Flush Interval:10s
Here Hostname
is shown as backup
... Can you maybe check if you have the data for a host named backup
in your database... Not sure why the hostname is set to this value though...
i have 2 freenas hosts, named "backup" and "atlas", and i set the hostname via config manually, so that i have consistency in the DB regarding hostname vs. hostname+fqdn. most of the things in this thread i test on both machines, so that i don't miss anything, which applies to only one of them, and i always check the db for entries of both hosts, so this is expected:
[agent]
...
hostname = "backup"
omit_hostname = false
...
Just do me one favor, please add a file output to your config above
[[outputs.file]]
files = ["/tmp/telegraf.data"]
to really see if we get the zfs
stuff in the service. I'm still not sure whether the data is not collected on the input side or not written on the output side in case of the service.
Ermm one more thing that comes to my mind... In the service, which user runs telegraf
? Does that user have the permissions to run zfs
?
okay, this is interesting, have a look: if run as a service:
backup# > /tmp/telegraf.data; service telegraf start; sleep 10 ; service telegraf stop; grep ^zfs /tmp/telegraf.data | wc -l
Starting telegraf.
2021-10-19T13:28:32Z I! Starting Telegraf 1.20.2
Stopping telegraf.
Waiting for PIDS: 36026, 36026.
1
backup#
if run manually:
backup# > /tmp/telegraf.data; ./telegraf --config telegraf.conf --input-filter zfs --once --debug >/dev/null ; grep ^zfs /tmp/telegraf.data | wc -l
2021-10-19T13:30:03Z I! Starting Telegraf 1.20.2
39
backup#
...so indeed, if run as a service, a different set of metrics are collected.
Are both command run with the same user?
i am not that experienced with bsd-rc-structure, but as far as i can tell, yes, because it is a PITA to elevate permissions via sudo on freenas for scripts, due to the fact, that you cannot persist stuff in "/etc" over reboots, so basically no "pw-less" sudo, so yes, if should run as root. beside that all other metrics are collected so far, i'd assume if not run as root, that there would me missing much more metrics, but the rest is there (e.G. smartctl would also require root permissions, and this works).
quick check with ps
: runs as root when daemon
A quick search revealed https://www.truenas.com/community/threads/telegraf-monitoring.84356/ which says that PATH is not set on services... If this is really the case, telegraf will not be able to find the zfs
command and thus will not be able to gather the metrics...
I'm not familiar with TrueNAS, so I don't know if you can set the PATH in the service, but it might be worth a try...
this might indeed be the problem, i had a similar issue with lsof
, so i linked that to /bin
, but it might be safer to research how a path can be set properly. and the zfs-plugin should then throw an error, that it cannot find the executable... i'll test this later.
Maybe this helps: https://www.truenas.com/community/threads/startup-script.68588/post-470459? That guy just does export PATH=$PATH:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/root/bin
in the start part. Maybe you can do something similar in telegraf_prestart
?
yep, couldn't wait anyways, that was the issue, placed an export for PATH in the daemon section, and this indeed worked, the metrics are going to be collected now, that you soo much for tinkering with me.
in the end, the zfs-input plugin should get an additional warning, if binaries are not found, other modules do that (like smartctl as shown in the log output), this might have done the resolution of this much faster :).
@zeus86 can you please check #9956 once it is built? It should fire an error if the command is not found but works as usual with the PATH set...
Awesome, works like a charm:
2021-10-19T15:10:50Z E! [inputs.zfs] Error in plugin: zpool error: exec: "zpool": executable file not found in $PATH
Can you please also add this as a comment to the PR... Thanks for testing!
sure, just done that
On truenas-core 12-u6 the zfs-section is not properly working. I do get some basic metrics (see below), but they do not make it to the influxdb (
zfs_pool
andzfs_dataset
did not even got created for this host, so this metric has no entries), and are therefore not useable within grafana. other hosts with zfs-pools work fine, but these are debian hosts (deb9/10 with proxmox 6/7). The relevant config-snippets are identical accross systems, so I assume something in the output breaks some filtering logic.I'd assume, that this might come from an awkward named pool or dataset. If I would use silly names with special characters, i would deal with is, but I assume the truenas-default pools might not be correctly pased, because on a second, similarily configured truenas-instance with only one other pool named
backup
, the behaviour is exactly the same: no dataset or pool-metrics, so this would very likely affect everyone else also.Relevant telegraf.conf:
System info:
Expected behavior:
zfs-plugin collecting dataset and pool metrics
Actual behavior:
only very limited basic-metrics
Additional info:
I attached a sample output of the questionable machine
./telegraf --config telegraf.conf --input-filter zfs --test
telegraf-zfs.txtultimately i don't know, if this is an influx(1) or telegraf-issue, but as i said, the same setup works find on linux.