influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.65k stars 5.59k forks source link

[inputs.zfs][freebsd12] (TrueNAS-12.0-U6 Core) No zfs_pool and zfs_dataset metrics made it to influxdb #9946

Closed zeus86 closed 3 years ago

zeus86 commented 3 years ago

On truenas-core 12-u6 the zfs-section is not properly working. I do get some basic metrics (see below), but they do not make it to the influxdb (zfs_pool and zfs_dataset did not even got created for this host, so this metric has no entries), and are therefore not useable within grafana. other hosts with zfs-pools work fine, but these are debian hosts (deb9/10 with proxmox 6/7). The relevant config-snippets are identical accross systems, so I assume something in the output breaks some filtering logic.

I'd assume, that this might come from an awkward named pool or dataset. If I would use silly names with special characters, i would deal with is, but I assume the truenas-default pools might not be correctly pased, because on a second, similarily configured truenas-instance with only one other pool named backup, the behaviour is exactly the same: no dataset or pool-metrics, so this would very likely affect everyone else also.

Relevant telegraf.conf:

[[inputs.zfs]]
poolMetrics = true
datasetMetrics = true

System info:

Telegraf-Version (packaged binary from github, sha256: d87514f06cd2efa6990383906d3b4664cb17222fde6755442b653a13a42c201a )
  Telegraf 1.20.2 (git: HEAD f721f53d) 
Kernel: FreeBSD 12.2-RELEASE-p10 b26f74b5984(HEAD) TRUENAS  amd64
zfs: 
  zfs-2.0.5-1
  zfs-kmod-v2021071201-zfs_f7ba541d64cbc60b21507bd7781331bea1abb12e

Expected behavior:

zfs-plugin collecting dataset and pool metrics

Actual behavior:

only very limited basic-metrics

Additional info:

I attached a sample output of the questionable machine ./telegraf --config telegraf.conf --input-filter zfs --test telegraf-zfs.txt

ultimately i don't know, if this is an influx(1) or telegraf-issue, but as i said, the same setup works find on linux.

zeus86 commented 3 years ago

edit: The data definately has not made it to the db:

> show series from zfs where host = 'atlas'
key
---
zfs,host=atlas

> show series from zfs_pool where host = 'atlas'

> show series from zfs_dataset where host = 'atlas'

(same result for the other truenas host)

srebhan commented 3 years ago

Hey @zeus86, note that the --test option will prevent all outputs to be sent! This will only print the collected metrics to the console (as documented in the option's help text)! Please use --once if you want to collect only once and send it to the output.

zeus86 commented 3 years ago

i know this, this exact config is runnig as service, please repoen the issue! The --test option was just to provide output for this ticket.

srebhan commented 3 years ago

Ok, so from the output it seems like the data is collected. Can you please show the output configuration (please redact sensitive information)? Do you use any processors or metric filtering somewhere?

zeus86 commented 3 years ago

Yes, that's what I've also wondered about, that's why I think, that some of Freenas/Truenas' pool and dataset-names might be the issue here, because data from other input-modules are submitted correctly to the database, as well as zfs-metrics from other linux hosts. In my config there is nothing special, i do not transform data by hand in any way, no filters, no processors, just inputs, one output and the agent configuration. The db-endpoint is just a plain influx1-db with http-endpoint:

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = "atlas"
  omit_hostname = false
[[outputs.influxdb]]
  urls = [ "http://xxx:8086" ]
  database = "hosts"
  precision = "s"
  retention_policy = "autogen"
  username = "xxx"
  password = "xxx"

The DB-endpoint runs on ubuntu focal with influx packaged from ppa (influxdb 1.8.10-1)

srebhan commented 3 years ago

Hmm can you maybe copy those line-protocol lines to a file and the use a input.file and your output section in a separate telegraf instance to see if we have a problem in the config or in the metrics themselves?

srebhan commented 3 years ago

Alternatively, you could add a outputs.file with stdout to your config and use namepass=["zfs_pool", "zfs_dataset"] to see if the metrics arrive at the output level (see https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#selectors)...

zeus86 commented 3 years ago

uhm, don't know what you mean with "line-protocol-lines", but as i already said: i use the same config across multiple hosts, and everything works, except for BOTH truenas instances, running on different hw at different locations, so i'd assume, the config is not the problem here, as it is "known working" on other machines.

srebhan commented 3 years ago

I mean the lines starting with > zfs. Anyway, from what I see on the telegraf side this should work. Do you see any metric from the truenas instances in the database? It seems like you do from your queries above...

My understanding is

  1. We see the metrics are collected on the truenas hosts.
  2. We see output from other metrics (including zfs) on the database side.
  3. Only zfs_pool and zfs_dataset metrics are missing on the database side.
  4. The same config works for other hosts.

Is this understanding correct? If so, there either has to be some filtering happening somewhere or metrics are dropped due to some reasons... Can you maybe check the log-files on the truenas machines for buffer fullness or similar?

zeus86 commented 3 years ago

Do you see any metric from the truenas instances in the database?

yes, i do, even from the zfs-input itself, but not for datasets nor pools (see db-output from second post). the rest you understood correctly, yes.

log on client side is basically empty (just startup notification, but i'll try later with debug logging)

zeus86 commented 3 years ago

tried debug options in telegraf, no difference, no errors are logged, no dedicated debug-output was recorded

srebhan commented 3 years ago

This is all very strange... Can you please run your test above with ./telegraf --config telegraf.conf --input-filter zfs --once --debug just for me to see the metric flush messages...

zeus86 commented 3 years ago

sure, there you go:

# ./telegraf --config telegraf.conf --input-filter zfs --once --debug
2021-10-19T11:36:57Z I! Starting Telegraf 1.20.2
2021-10-19T11:36:57Z D! [agent] Initializing plugins
2021-10-19T11:36:57Z D! [agent] Connecting outputs
2021-10-19T11:36:57Z D! [agent] Attempting connection to [outputs.influxdb]
2021-10-19T11:36:57Z D! [agent] Successfully connected to outputs.influxdb
2021-10-19T11:36:57Z D! [agent] Starting service inputs
2021-10-19T11:36:58Z D! [agent] Stopping service inputs
2021-10-19T11:36:58Z D! [agent] Input channel closed
2021-10-19T11:36:58Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-10-19T11:36:58Z D! [outputs.influxdb] Wrote batch of 41 metrics in 56.025696ms
2021-10-19T11:36:58Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2021-10-19T11:36:58Z I! [agent] Stopping running outputs
2021-10-19T11:36:58Z D! [agent] Stopped Successfully
#
zeus86 commented 3 years ago

THIS is strange, i compared the db-content for this host before and after that command above, after executing with once there are now dataset-data in the DB, so it does not work while running as service

srebhan commented 3 years ago

Ok, so it flushed 41 metrics to the output. I think this includes all zfs metrics with zfs_pool and zfs_dataset. If we want to be 100% sure you can add

[[outputs.file]]
  files = ["stdout"]

to your config and repeat the command.

This means that telegraf thinks it sent out the data...

srebhan commented 3 years ago

THIS is strange, i compared the db-content for this host before and after that command above, after executing with once there are now dataset-data in the DB, so it does not work while running as service

Ok, can you please show the command you use to run it as a service?

zeus86 commented 3 years ago

in truenas/freenas you cannot easily alter parts of the OS, so you have to add a startup-service, which links the service to /etc on boot and then enables the service. init-script added to Web-UI: ln -s /mnt/nas/home/telegraf/telegraf.init /usr/local/etc/rc.d/telegraf && service telegraf start initfile that got linked:

#!/bin/sh
# $FreeBSD$

# PROVIDE: telegraf
# REQUIRE: DAEMON NETWORKING
# BEFORE: LOGIN
# KEYWORD: shutdown

# Add the following lines to /etc/rc.conf to enable telegrafb:
# telegraf_enable="YES"
#
# telegraf_enable (bool): Set to YES to enable telegraf
# Default: NO
# telegraf_conf (str): telegraf configuration file
# Default: ${PREFIX}/etc/telegraf.conf
# telegraf_flags (str): Extra flags passed to telegraf
. /etc/rc.subr

directory="/mnt/nas/home/telegraf/"
name="telegraf"
rcvar=telegraf_enable
load_rc_config $name

: ${telegraf_enable:="YES"}
: ${telegraf_flags:="-quiet"}
: ${telegraf_conf:="${directory}${name}.conf"}

# daemon
start_precmd=telegraf_prestart
pidfile="/var/run/${name}.pid"
command=/usr/sbin/daemon
command_args="-crP ${pidfile} ${directory}${name} ${telegraf_flags} -config=${telegraf_conf} >> /var/log/telegraf.log 2>&1"

telegraf_prestart() {
# Have to empty rc_flags so they don't get passed to daemon(8)
 rc_flags=""
}

run_rc_command "$1"
srebhan commented 3 years ago

Two things that I noticed (assuming in the above test your PWD is $directory):

  1. You use -config instead of --config in your service file. Not sure if this works, but if it doesn't, telegraf will look for /etc/telegraf.conf...
  2. You can specify a log_file in the telegraf config instead of redirecting the output.

Can you try to use --config in the service and check if this changes things? Oh and you use -config=$config currently but usually we use --config $config... Not sure if this is relevant...

zeus86 commented 3 years ago

good point, , changed -configand added a dash, but anyways, it definately has read the config anyways (otherwise it would have never been able to write to the correct db-host), and i also removed the quiet-flag, but still no changes, only the singular datapoint from the manual run. With debug and specific logfile and without quiet options, we do indeed see some warnings:

2021-10-19T11:55:13Z I! Loaded inputs: cpu disk diskio exec kernel linux_sysctl_fs mem net netstat processes smart swap system zfs
2021-10-19T11:55:13Z I! Loaded aggregators:
2021-10-19T11:55:13Z I! Loaded processors:
2021-10-19T11:55:13Z I! Loaded outputs: influxdb
2021-10-19T11:55:13Z I! Tags enabled: host=backup
2021-10-19T11:55:13Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"backup", Flush Interval:10s
2021-10-19T11:55:13Z D! [agent] Initializing plugins
2021-10-19T11:55:13Z W! [inputs.smart] nvme not found: verify that nvme is installed and it is in your PATH (or specified in config) to gather vendor specific attributes: provided path does not exist: []
2021-10-19T11:55:13Z D! [agent] Connecting outputs
2021-10-19T11:55:13Z D! [agent] Attempting connection to [outputs.influxdb]
2021-10-19T11:55:13Z D! [agent] Successfully connected to outputs.influxdb
2021-10-19T11:55:13Z D! [agent] Starting service inputs
2021-10-19T11:55:20Z W! [inputs.zfs] Invalid number of columns for line:
2021-10-19T11:55:23Z D! [outputs.influxdb] Wrote batch of 149 metrics in 64.864319ms
2021-10-19T11:55:23Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2021-10-19T11:55:30Z W! [inputs.zfs] Invalid number of columns for line:
2021-10-19T11:55:33Z D! [outputs.influxdb] Wrote batch of 152 metrics in 67.326364ms
2021-10-19T11:55:33Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2021-10-19T11:55:40Z W! [inputs.zfs] Invalid number of columns for line:
2021-10-19T11:55:43Z D! [outputs.influxdb] Wrote batch of 152 metrics in 62.46143ms
2021-10-19T11:55:43Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2021-10-19T11:55:50Z W! [inputs.zfs] Invalid number of columns for line:

nvme does not exsist on freenas, thats okay, but the last line is interesting.

zeus86 commented 3 years ago

most likely, this comes from https://github.com/influxdata/telegraf/blob/master/plugins/inputs/zfs/zfs_freebsd.go#L106-L112

zeus86 commented 3 years ago

due to the src, the warning (in the end it's not even an error from a severity point of view), is generated from the dataset-setting, and indeed if i independently enable either dataset or pool metrics, the warning only occurs, when dataset metrics are enabled, but anyways, both metric types did not made it to the db, regardless of which of those two settings are enabled.

srebhan commented 3 years ago

Hmmm did you notice that the warning will output the line it tries to parse? However, in your log this line is empty... Can you please run zfs list -Hp -o name,avail,used,usedsnap,usedds manually on that host to see what goes into the parsing?

zeus86 commented 3 years ago

this might be the clue! On a linux-system, all lines have the same amount of fields, like so:

rpool   489257426944    476036472832    0       106496
rpool/ROOT      489257426944    320486854656    0       98304
rpool/ROOT/pve-1        489257426944    320486756352    0       320486756352
rpool/data      489257426944    155398995968    0       147456
rpool/data/subvol-100-disk-1    471621873664    65249038336     0       65249038336
...and so on

on truenas, this is not the case, if a pool has child-entities:

backup-pool     5572787747328   50075223552     83328   166656
backup-pool/.system     5572787747328   1587701952      3178368 810787392
backup-pool/.system/configs-1ae6384c4fbd402fb437ba266f4c6876    5572787747328   40515264        3095040 37420224
backup-pool/.system/cores       1071182464      2559360 2416512 142848
backup-pool/.system/rrd-1ae6384c4fbd402fb437ba266f4c6876        5572787747328   710478336       691741440       18736896
backup-pool/.system/samba4      5572787747328   5898432 5332992 565440
backup-pool/.system/services    5572787747328   142848  0       142848
backup-pool/.system/syslog-1ae6384c4fbd402fb437ba266f4c6876     5572787747328   13999104        9683904 4315200
backup-pool/.system/webui       5572787747328   142848  0       142848
...
zeus86 commented 3 years ago

i'm just blind, the field-count is the same, it was just differently formated in the shell (linux gives pretty output with alinged cells), so this was noch the answer.

srebhan commented 3 years ago

Yeah I think this cannot be the issue as you get datapoint when running manually... But this catched my attention:

2021-10-19T11:55:13Z I! Tags enabled: host=backup
2021-10-19T11:55:13Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"backup", Flush Interval:10s

Here Hostname is shown as backup... Can you maybe check if you have the data for a host named backup in your database... Not sure why the hostname is set to this value though...

zeus86 commented 3 years ago

i have 2 freenas hosts, named "backup" and "atlas", and i set the hostname via config manually, so that i have consistency in the DB regarding hostname vs. hostname+fqdn. most of the things in this thread i test on both machines, so that i don't miss anything, which applies to only one of them, and i always check the db for entries of both hosts, so this is expected:

[agent]
...
  hostname = "backup"
  omit_hostname = false
...
srebhan commented 3 years ago

Just do me one favor, please add a file output to your config above

[[outputs.file]]
  files = ["/tmp/telegraf.data"]

to really see if we get the zfs stuff in the service. I'm still not sure whether the data is not collected on the input side or not written on the output side in case of the service.

srebhan commented 3 years ago

Ermm one more thing that comes to my mind... In the service, which user runs telegraf? Does that user have the permissions to run zfs?

zeus86 commented 3 years ago

okay, this is interesting, have a look: if run as a service:

backup# > /tmp/telegraf.data; service telegraf start; sleep 10 ; service telegraf stop; grep ^zfs /tmp/telegraf.data | wc -l
Starting telegraf.
2021-10-19T13:28:32Z I! Starting Telegraf 1.20.2
Stopping telegraf.
Waiting for PIDS: 36026, 36026.
       1
backup#

if run manually:

backup# > /tmp/telegraf.data; ./telegraf --config telegraf.conf --input-filter zfs --once --debug >/dev/null ; grep ^zfs /tmp/telegraf.data | wc -l
2021-10-19T13:30:03Z I! Starting Telegraf 1.20.2
      39
backup#

...so indeed, if run as a service, a different set of metrics are collected.

srebhan commented 3 years ago

Are both command run with the same user?

zeus86 commented 3 years ago

i am not that experienced with bsd-rc-structure, but as far as i can tell, yes, because it is a PITA to elevate permissions via sudo on freenas for scripts, due to the fact, that you cannot persist stuff in "/etc" over reboots, so basically no "pw-less" sudo, so yes, if should run as root. beside that all other metrics are collected so far, i'd assume if not run as root, that there would me missing much more metrics, but the rest is there (e.G. smartctl would also require root permissions, and this works).

zeus86 commented 3 years ago

quick check with ps: runs as root when daemon

srebhan commented 3 years ago

A quick search revealed https://www.truenas.com/community/threads/telegraf-monitoring.84356/ which says that PATH is not set on services... If this is really the case, telegraf will not be able to find the zfs command and thus will not be able to gather the metrics... I'm not familiar with TrueNAS, so I don't know if you can set the PATH in the service, but it might be worth a try...

zeus86 commented 3 years ago

this might indeed be the problem, i had a similar issue with lsof, so i linked that to /bin, but it might be safer to research how a path can be set properly. and the zfs-plugin should then throw an error, that it cannot find the executable... i'll test this later.

srebhan commented 3 years ago

Maybe this helps: https://www.truenas.com/community/threads/startup-script.68588/post-470459? That guy just does export PATH=$PATH:/sbin:/bin:/usr/sbin:/usr/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/root/bin in the start part. Maybe you can do something similar in telegraf_prestart?

zeus86 commented 3 years ago

yep, couldn't wait anyways, that was the issue, placed an export for PATH in the daemon section, and this indeed worked, the metrics are going to be collected now, that you soo much for tinkering with me.

in the end, the zfs-input plugin should get an additional warning, if binaries are not found, other modules do that (like smartctl as shown in the log output), this might have done the resolution of this much faster :).

srebhan commented 3 years ago

@zeus86 can you please check #9956 once it is built? It should fire an error if the command is not found but works as usual with the PATH set...

zeus86 commented 3 years ago

Awesome, works like a charm: 2021-10-19T15:10:50Z E! [inputs.zfs] Error in plugin: zpool error: exec: "zpool": executable file not found in $PATH

srebhan commented 3 years ago

Can you please also add this as a comment to the PR... Thanks for testing!

zeus86 commented 3 years ago

sure, just done that