influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.63k stars 5.58k forks source link

zfs parsing error #11089

Closed mdtancsa closed 1 year ago

mdtancsa commented 2 years ago

Relevant telegraf.conf

[[inputs.zfs]]
poolMetrics = true
datasetMetrics = true

Logs from Telegraf

[inputs.zfs] Error in plugin: Error parsing avail "-": strconv.ParseInt: parsing "-": invalid syntax

It seems to be an issue on parsing zfs volumes which dont have a mount point ? Not sure
But with debug on 

> zfs_dataset,dataset=zbackup22/zrepl/sink2/sentex-cam.sentex.ca/tank1,host=b4.sentex.ca avail=11328588255296i,used=486662039936i,usedds=142848i,usedsnap=0i 1652372245000000000
> zfs_dataset,dataset=zbackup22/zrepl/sink2/sentex-cam.sentex.ca/tank1/sqlBACKUP,host=b4.sentex.ca avail=11328588255296i,used=83328i,usedds=83328i,usedsnap=0i 1652372245000000000
2022-05-12T16:17:25Z E! [inputs.zfs] Error in plugin: Error parsing avail "-": strconv.ParseInt: parsing "-": invalid syntax

 zfs list -r zbackup2s/zrepl/sink2/sentex-cam.sentex.ca/tank1
NAME                                                         USED  AVAIL  REFER  MOUNTPOINT
zbackup22/zrepl/sink2/sentex-cam.sentex.ca/tank1             453G  10.3T   140K  none
zbackup22/zrepl/sink2/sentex-cam.sentex.ca/tank1/sqlBACKUP  81.4K  10.3T  81.4K  -
zbackup22/zrepl/sink2/sentex-cam.sentex.ca/tank1/sqlDB       197G  10.3T   197G  -
zbackup22/zrepl/sink2/sentex-cam.sentex.ca/tank1/tank1-vms   256G  10.3T   249G  none

System info

telegraf --version Telegraf 1.22.3, FreeBSD 12.3

Docker

No response

Steps to reproduce

telegraf --debug --test --config telegraf.conf

But it doesnt seem to cause issues on other servers where I have volumes, so not sure what else is going on.

Expected behavior

Not throw that error and stop collecting arcstats

Actual behavior

Telegraf hits that error and does not proceed to collect / report the rest of the datasets nor arcstats

Additional info

No response

powersj commented 2 years ago

Error parsing avail "-": strconv.ParseInt: parsing "-": invalid syntax

This error says it is failing to parse the avail column while trying to parse -.

The FreeBSD code path literally runs a couple of commands and parses the output. Can you collect the output of the following two commands please:

zpool list -Hp -o name,health,size,alloc,free,fragmentation,capacity,dedupratio
zfs list -Hp -o name,avail,used,usedsnap,usedds

Thanks!

mdtancsa commented 2 years ago

OK, I see whats going on! Now the question is... WHY!? Its actually not the datasets in question. Its another pool. For some reason on this server (which started out as RELENG_10, then RELENG_11 then RELENG_12, zfs list gives different output on the boot pool.
On the 3 sets, one of the pools when doing a zfs list shows snapshots for some reason.
e.g

zroot   25024442368     77953568768     0       98304
zroot@6 -       0       -       -
zroot@0 -       0       -       -
zroot@1 -       0       -       -
zroot@2 -       0       -       -
zroot@3 -       0       -       -
zroot@4 -       0       -       -
zroot/ROOT      25024442368     16143228928     0       98304
zroot/ROOT@6    -       0       -       -
zroot/ROOT@0    -       0       -       -
zroot/ROOT@1    -       0       -       -
zroot/ROOT@2    -       0       -       -
zroot/ROOT@3    -       0       -       -
zroot/ROOT@4    -       0       -       -
zroot/ROOT/default      25024442368     16143130624     672641024       15470489600
zroot/ROOT/default@6    -       102952960       -       -
zroot/ROOT/default@0    -       75075584        -       -
zroot/ROOT/default@1    -       74461184        -       -
zroot/ROOT/default@2    -       77942784        -       -
zroot/ROOT/default@3    -       77799424        -       -
zroot/ROOT/default@4    -       76488704        -       -

If I nuke the boot pools snapshots, telegraf works as expected, because telegraf is amazing :) But if I do

zfs snapshot -r zroot@test2
 zfs list -r zroot
NAME                       USED  AVAIL  REFER  MOUNTPOINT
zroot                     71.0G  24.9G    96K  none
zroot@test2                   0      -    96K  -
zroot/ROOT                14.4G  24.9G    96K  none
zroot/ROOT@test2              0      -    96K  -
zroot/ROOT/default        14.4G  24.9G  14.4G  /
zroot/ROOT/default@test2      0      -  14.4G  -
zroot/coredisk0           21.2G  41.4G  4.68G  -
zroot/coredisk0@test2         0      -  4.68G  -
zroot/snappydisk0         20.6G  41.4G  4.06G  -
zroot/snappydisk0@test2       0      -  4.06G  -
zroot/tmp                 3.68G  24.9G  3.68G  /tmp
zroot/tmp@test2               0      -  3.68G  -
zroot/usr                 10.9G  24.9G    96K  /usr
zroot/usr@test2               0      -    96K  -
zroot/usr/home            4.52G  24.9G  4.52G  /usr/home
zroot/usr/home@test2          0      -  4.52G  -
zroot/usr/ports           4.68G  24.9G  4.68G  /usr/ports
zroot/usr/ports@test2         0      -  4.68G  -
zroot/usr/src             1.74G  24.9G  1.74G  /usr/src
zroot/usr/src@test2           0      -  1.74G  -
zroot/var                  165M  24.9G    96K  /var
zroot/var@test2               0      -    96K  -
zroot/var/crash            348K  24.9G   348K  /var/crash
zroot/var/crash@test2         0      -   348K  -
zroot/var/log              135M  24.9G   135M  /var/log
zroot/var/log@test2           0      -   135M  -
zroot/var/mail            29.7M  24.9G  29.7M  /var/mail
zroot/var/mail@test2          0      -  29.7M  -
zroot/var/tmp              144K  24.9G   144K  /var/tmp
zroot/var/tmp@test2           0      -   144K  -

On a different RELENG12 box, this is not an issue and the other 2 large pools on the troubled box in question dont do this either. I guess the next question is why / how did it get that way.

I dont see any properties that would effect this. I will ask on a FreeBSD list to see how I can control this behaviour as it seems to be some bug. I am guessing something along the way got messed up when upgrading from older pools. I tried the obvious one (make snapdirs=visible then snapdirs=hidden) but it didnt change anything for this pool

 zfs get all zroot
NAME   PROPERTY              VALUE                  SOURCE
zroot  type                  filesystem             -
zroot  creation              Fri Oct 24 15:25 2014  -
zroot  used                  62.3G                  -
zroot  available             33.6G                  -
zroot  referenced            96K                    -
zroot  compressratio         2.00x                  -
zroot  mounted               no                     -
zroot  quota                 none                   default
zroot  reservation           none                   default
zroot  recordsize            128K                   default
zroot  mountpoint            none                   local
zroot  sharenfs              off                    default
zroot  checksum              on                     default
zroot  compression           lz4                    local
zroot  atime                 on                     local
zroot  devices               on                     default
zroot  exec                  on                     default
zroot  setuid                on                     default
zroot  readonly              off                    default
zroot  jailed                off                    default
zroot  snapdir               hidden                 local
zroot  aclmode               discard                default
zroot  aclinherit            restricted             default
zroot  createtxg             1                      -
zroot  canmount              on                     default
zroot  xattr                 on                     default
zroot  copies                1                      default
zroot  version               5                      -
zroot  utf8only              off                    -
zroot  normalization         none                   -
zroot  casesensitivity       sensitive              -
zroot  vscan                 off                    default
zroot  nbmand                off                    default
zroot  sharesmb              off                    default
zroot  refquota              none                   default
zroot  refreservation        none                   default
zroot  guid                  15617863627434061981   -
zroot  primarycache          all                    default
zroot  secondarycache        all                    default
zroot  usedbysnapshots       0                      -
zroot  usedbydataset         96K                    -
zroot  usedbychildren        62.3G                  -
zroot  usedbyrefreservation  0                      -
zroot  logbias               latency                default
zroot  objsetid              21                     -
zroot  dedup                 off                    default
zroot  mlslabel                                     -
zroot  sync                  standard               default
zroot  dnodesize             legacy                 default
zroot  refcompressratio      1.00x                  -
zroot  written               96K                    -
zroot  logicalused           73.0G                  -
zroot  logicalreferenced     9.50K                  -
zroot  volmode               default                default
zroot  filesystem_limit      none                   default
zroot  snapshot_limit        none                   default
zroot  filesystem_count      none                   default
zroot  snapshot_count        none                   default
zroot  redundant_metadata    all                    default
zroot  special_small_blocks  0                      default
mdtancsa commented 2 years ago

OK, I found it! I was trying zfs not zpool. Its a pool property, not a dataset property!

 zpool get all zroot
NAME   PROPERTY                       VALUE                          SOURCE
zroot  size                           99G                            -
zroot  capacity                       36%                            -
zroot  altroot                        -                              default
zroot  health                         ONLINE                         -
zroot  guid                           7300997414167702310            default
zroot  version                        -                              default
zroot  bootfs                         zroot/ROOT/default             local
zroot  delegation                     on                             default
zroot  autoreplace                    off                            default
zroot  cachefile                      -                              default
zroot  failmode                       wait                           default
zroot  listsnapshots                  on                             local

and

 zfs snapshot -r zroot@test3
0{backup4}# zfs list -r zroot
NAME                 USED  AVAIL  REFER  MOUNTPOINT
zroot               68.7G  27.2G    96K  none
zroot/ROOT          14.4G  27.2G    96K  none
zroot/ROOT/default  14.4G  27.2G  14.4G  /
zroot/coredisk0     21.2G  43.7G  4.68G  -
zroot/snappydisk0   20.6G  43.7G  4.06G  -
zroot/tmp           1.37G  27.2G  1.37G  /tmp
zroot/usr           10.9G  27.2G    96K  /usr
zroot/usr/home      4.52G  27.2G  4.52G  /usr/home
zroot/usr/ports     4.68G  27.2G  4.68G  /usr/ports
zroot/usr/src       1.74G  27.2G  1.74G  /usr/src
zroot/var            166M  27.2G    96K  /var
zroot/var/crash      348K  27.2G   348K  /var/crash
zroot/var/log        135M  27.2G   135M  /var/log
zroot/var/mail      29.7M  27.2G  29.7M  /var/mail
zroot/var/tmp        144K  27.2G   144K  /var/tmp
powersj commented 2 years ago

Thanks for this @mdtancsa, so at this point it sounds like from telegraf's perspective you are good to go now?

mdtancsa commented 2 years ago

@powersj yes, I am good to go. It might be helpful in the edge cases where users have this on like I did in the documentation.
e.g .

--- config      2022-05-12 14:17:23.391727000 -0400
+++ config.new  2022-05-12 14:17:07.987455000 -0400
@@ -6718,7 +6718,8 @@
 #   #   "dmu_tx", "fm", "vdev_mirror_stats", "zfetchstats", "zil"]
 #   ## By default, don't gather zpool stats
 #   # poolMetrics = false
-#   ## By default, don't gather zdataset stats
+#   ## By default, don't gather zdataset stats. Note  listsnapshots=off  must be set for 
+#   ## the pool for this metric collection to work
 #   # datasetMetrics = false
powersj commented 2 years ago

Can you take a look at https://github.com/influxdata/telegraf/pull/11091?

mdtancsa commented 2 years ago

Thanks again !

mdtancsa commented 1 year ago

It looks like this issue is back in RELENG_14 of FreeBSD and ZFS. The output of the two commands are different in the newer version of FreeBSD and ZFS.

In RELENG_13, the output does not include a - for an empty value. .eg in RELENG_13 we see

zfs list -Hp -o name,avail,used,usedsnap,usedds
 zfs list -Hp -o name,avail,used,usedsnap,usedds
tortank1        964702415808    1841073050688   0       142848
tortank1/vms    964702415808    1838508696960   1079164131456   693932894976
tortank1/vms/wsus       964702415808    65411670528     38910081024     26501589504
zrootnfsa1      363369832448    102768336896    0       98304
zrootnfsa1/ROOT 363369832448    27220209664     0       98304

In RELENG_14 however,

# zfs list -Hp -o name,avail,used,usedsnap,usedds
zroot   21170147328     4314443776      -       -
zroot/ROOT      21170147328     4301705216      -       -
zroot/ROOT/default      21170147328     4301271040      -       -
zroot/home      21170147328     778240  -       -
zroot/tmp       21170147328     589824  -       -
zroot/usr       21170147328     1724416 -       -
zroot/usr/obj   21170147328     430080  -       -
zroot/usr/ports 21170147328     430080  -       -
zroot/usr/src   21170147328     430080  -       -
zroot/var       21170147328     4358144 -       -
zroot/var/audit 21170147328     438272  -       -
zroot/var/log   21170147328     2555904 -       -
zroot/var/mail  21170147328     491520  -       -
zroot/var/tmp   21170147328     438272  -       -

The parser chokes on that.

 telegraf --test --debug --config telegraf.conf.min
2023-10-19T20:16:47Z I! Loading config: telegraf.conf.min
2023-10-19T20:16:47Z I! Starting Telegraf unknown brought to you by InfluxData the makers of InfluxDB
2023-10-19T20:16:47Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 4 secret-stores
2023-10-19T20:16:47Z I! Loaded inputs: zfs
2023-10-19T20:16:47Z I! Loaded aggregators: 
2023-10-19T20:16:47Z I! Loaded processors: 
2023-10-19T20:16:47Z I! Loaded secretstores: 
2023-10-19T20:16:47Z W! Outputs are not used in testing mode!
2023-10-19T20:16:47Z I! Tags enabled: host=wsus-windows-firewall.sentex.ca
2023-10-19T20:16:47Z D! [agent] Initializing plugins
2023-10-19T20:16:47Z D! [agent] Starting service inputs
> zfs_pool,health=ONLINE,host=wsus-windows-firewall.sentex.ca,pool=zroot allocated=4315226112i,capacity=16i,dedupratio=1,fragmentation=1i,free=21991448576i,size=26306674688i 1697746607000000000
2023-10-19T20:16:47Z E! [inputs.zfs] Error in plugin: Error parsing usedsnap "-": strconv.ParseInt: parsing "-": invalid syntax
2023-10-19T20:16:47Z D! [agent] Stopping service inputs
2023-10-19T20:16:47Z D! [agent] Input channel closed
2023-10-19T20:16:47Z D! [agent] Stopped Successfully
2023-10-19T20:16:47Z E! [telegraf] Error running agent: input plugins recorded 1 errors
mdtancsa commented 1 year ago

This might actually be a freebsd bug. The -p should show parseble numbers and not dashes. Also, there should be values, but even with snapshots they are always a -. Going to see what the FreeBSD people say

mdtancsa commented 1 year ago

Tracking via https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=274613

srebhan commented 1 year ago

@mdtancsa can you please test the binary in #14176 available once CI has finished the tests. With this I can successfully run Telegraf on a 14.0-RC2 machine...