Closed mdtancsa closed 1 year ago
Error parsing avail "-": strconv.ParseInt: parsing "-": invalid syntax
This error says it is failing to parse the avail
column while trying to parse -
.
The FreeBSD code path literally runs a couple of commands and parses the output. Can you collect the output of the following two commands please:
zpool list -Hp -o name,health,size,alloc,free,fragmentation,capacity,dedupratio
zfs list -Hp -o name,avail,used,usedsnap,usedds
Thanks!
OK, I see whats going on! Now the question is... WHY!? Its actually not the datasets in question. Its another pool. For some reason on this server (which started out as RELENG_10, then RELENG_11 then RELENG_12, zfs list gives different output on the boot pool.
On the 3 sets, one of the pools when doing a zfs list shows snapshots for some reason.
e.g
zroot 25024442368 77953568768 0 98304
zroot@6 - 0 - -
zroot@0 - 0 - -
zroot@1 - 0 - -
zroot@2 - 0 - -
zroot@3 - 0 - -
zroot@4 - 0 - -
zroot/ROOT 25024442368 16143228928 0 98304
zroot/ROOT@6 - 0 - -
zroot/ROOT@0 - 0 - -
zroot/ROOT@1 - 0 - -
zroot/ROOT@2 - 0 - -
zroot/ROOT@3 - 0 - -
zroot/ROOT@4 - 0 - -
zroot/ROOT/default 25024442368 16143130624 672641024 15470489600
zroot/ROOT/default@6 - 102952960 - -
zroot/ROOT/default@0 - 75075584 - -
zroot/ROOT/default@1 - 74461184 - -
zroot/ROOT/default@2 - 77942784 - -
zroot/ROOT/default@3 - 77799424 - -
zroot/ROOT/default@4 - 76488704 - -
If I nuke the boot pools snapshots, telegraf works as expected, because telegraf is amazing :) But if I do
zfs snapshot -r zroot@test2
zfs list -r zroot
NAME USED AVAIL REFER MOUNTPOINT
zroot 71.0G 24.9G 96K none
zroot@test2 0 - 96K -
zroot/ROOT 14.4G 24.9G 96K none
zroot/ROOT@test2 0 - 96K -
zroot/ROOT/default 14.4G 24.9G 14.4G /
zroot/ROOT/default@test2 0 - 14.4G -
zroot/coredisk0 21.2G 41.4G 4.68G -
zroot/coredisk0@test2 0 - 4.68G -
zroot/snappydisk0 20.6G 41.4G 4.06G -
zroot/snappydisk0@test2 0 - 4.06G -
zroot/tmp 3.68G 24.9G 3.68G /tmp
zroot/tmp@test2 0 - 3.68G -
zroot/usr 10.9G 24.9G 96K /usr
zroot/usr@test2 0 - 96K -
zroot/usr/home 4.52G 24.9G 4.52G /usr/home
zroot/usr/home@test2 0 - 4.52G -
zroot/usr/ports 4.68G 24.9G 4.68G /usr/ports
zroot/usr/ports@test2 0 - 4.68G -
zroot/usr/src 1.74G 24.9G 1.74G /usr/src
zroot/usr/src@test2 0 - 1.74G -
zroot/var 165M 24.9G 96K /var
zroot/var@test2 0 - 96K -
zroot/var/crash 348K 24.9G 348K /var/crash
zroot/var/crash@test2 0 - 348K -
zroot/var/log 135M 24.9G 135M /var/log
zroot/var/log@test2 0 - 135M -
zroot/var/mail 29.7M 24.9G 29.7M /var/mail
zroot/var/mail@test2 0 - 29.7M -
zroot/var/tmp 144K 24.9G 144K /var/tmp
zroot/var/tmp@test2 0 - 144K -
On a different RELENG12 box, this is not an issue and the other 2 large pools on the troubled box in question dont do this either. I guess the next question is why / how did it get that way.
I dont see any properties that would effect this. I will ask on a FreeBSD list to see how I can control this behaviour as it seems to be some bug. I am guessing something along the way got messed up when upgrading from older pools. I tried the obvious one (make snapdirs=visible then snapdirs=hidden) but it didnt change anything for this pool
zfs get all zroot
NAME PROPERTY VALUE SOURCE
zroot type filesystem -
zroot creation Fri Oct 24 15:25 2014 -
zroot used 62.3G -
zroot available 33.6G -
zroot referenced 96K -
zroot compressratio 2.00x -
zroot mounted no -
zroot quota none default
zroot reservation none default
zroot recordsize 128K default
zroot mountpoint none local
zroot sharenfs off default
zroot checksum on default
zroot compression lz4 local
zroot atime on local
zroot devices on default
zroot exec on default
zroot setuid on default
zroot readonly off default
zroot jailed off default
zroot snapdir hidden local
zroot aclmode discard default
zroot aclinherit restricted default
zroot createtxg 1 -
zroot canmount on default
zroot xattr on default
zroot copies 1 default
zroot version 5 -
zroot utf8only off -
zroot normalization none -
zroot casesensitivity sensitive -
zroot vscan off default
zroot nbmand off default
zroot sharesmb off default
zroot refquota none default
zroot refreservation none default
zroot guid 15617863627434061981 -
zroot primarycache all default
zroot secondarycache all default
zroot usedbysnapshots 0 -
zroot usedbydataset 96K -
zroot usedbychildren 62.3G -
zroot usedbyrefreservation 0 -
zroot logbias latency default
zroot objsetid 21 -
zroot dedup off default
zroot mlslabel -
zroot sync standard default
zroot dnodesize legacy default
zroot refcompressratio 1.00x -
zroot written 96K -
zroot logicalused 73.0G -
zroot logicalreferenced 9.50K -
zroot volmode default default
zroot filesystem_limit none default
zroot snapshot_limit none default
zroot filesystem_count none default
zroot snapshot_count none default
zroot redundant_metadata all default
zroot special_small_blocks 0 default
OK, I found it! I was trying zfs not zpool. Its a pool property, not a dataset property!
zpool get all zroot
NAME PROPERTY VALUE SOURCE
zroot size 99G -
zroot capacity 36% -
zroot altroot - default
zroot health ONLINE -
zroot guid 7300997414167702310 default
zroot version - default
zroot bootfs zroot/ROOT/default local
zroot delegation on default
zroot autoreplace off default
zroot cachefile - default
zroot failmode wait default
zroot listsnapshots on local
and
zfs snapshot -r zroot@test3
0{backup4}# zfs list -r zroot
NAME USED AVAIL REFER MOUNTPOINT
zroot 68.7G 27.2G 96K none
zroot/ROOT 14.4G 27.2G 96K none
zroot/ROOT/default 14.4G 27.2G 14.4G /
zroot/coredisk0 21.2G 43.7G 4.68G -
zroot/snappydisk0 20.6G 43.7G 4.06G -
zroot/tmp 1.37G 27.2G 1.37G /tmp
zroot/usr 10.9G 27.2G 96K /usr
zroot/usr/home 4.52G 27.2G 4.52G /usr/home
zroot/usr/ports 4.68G 27.2G 4.68G /usr/ports
zroot/usr/src 1.74G 27.2G 1.74G /usr/src
zroot/var 166M 27.2G 96K /var
zroot/var/crash 348K 27.2G 348K /var/crash
zroot/var/log 135M 27.2G 135M /var/log
zroot/var/mail 29.7M 27.2G 29.7M /var/mail
zroot/var/tmp 144K 27.2G 144K /var/tmp
Thanks for this @mdtancsa, so at this point it sounds like from telegraf's perspective you are good to go now?
@powersj yes, I am good to go. It might be helpful in the edge cases where users have this on like I did in the documentation.
e.g .
--- config 2022-05-12 14:17:23.391727000 -0400
+++ config.new 2022-05-12 14:17:07.987455000 -0400
@@ -6718,7 +6718,8 @@
# # "dmu_tx", "fm", "vdev_mirror_stats", "zfetchstats", "zil"]
# ## By default, don't gather zpool stats
# # poolMetrics = false
-# ## By default, don't gather zdataset stats
+# ## By default, don't gather zdataset stats. Note listsnapshots=off must be set for
+# ## the pool for this metric collection to work
# # datasetMetrics = false
Can you take a look at https://github.com/influxdata/telegraf/pull/11091?
Thanks again !
It looks like this issue is back in RELENG_14 of FreeBSD and ZFS. The output of the two commands are different in the newer version of FreeBSD and ZFS.
In RELENG_13, the output does not include a - for an empty value. .eg in RELENG_13 we see
zfs list -Hp -o name,avail,used,usedsnap,usedds
zfs list -Hp -o name,avail,used,usedsnap,usedds
tortank1 964702415808 1841073050688 0 142848
tortank1/vms 964702415808 1838508696960 1079164131456 693932894976
tortank1/vms/wsus 964702415808 65411670528 38910081024 26501589504
zrootnfsa1 363369832448 102768336896 0 98304
zrootnfsa1/ROOT 363369832448 27220209664 0 98304
In RELENG_14 however,
# zfs list -Hp -o name,avail,used,usedsnap,usedds
zroot 21170147328 4314443776 - -
zroot/ROOT 21170147328 4301705216 - -
zroot/ROOT/default 21170147328 4301271040 - -
zroot/home 21170147328 778240 - -
zroot/tmp 21170147328 589824 - -
zroot/usr 21170147328 1724416 - -
zroot/usr/obj 21170147328 430080 - -
zroot/usr/ports 21170147328 430080 - -
zroot/usr/src 21170147328 430080 - -
zroot/var 21170147328 4358144 - -
zroot/var/audit 21170147328 438272 - -
zroot/var/log 21170147328 2555904 - -
zroot/var/mail 21170147328 491520 - -
zroot/var/tmp 21170147328 438272 - -
The parser chokes on that.
telegraf --test --debug --config telegraf.conf.min
2023-10-19T20:16:47Z I! Loading config: telegraf.conf.min
2023-10-19T20:16:47Z I! Starting Telegraf unknown brought to you by InfluxData the makers of InfluxDB
2023-10-19T20:16:47Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 4 secret-stores
2023-10-19T20:16:47Z I! Loaded inputs: zfs
2023-10-19T20:16:47Z I! Loaded aggregators:
2023-10-19T20:16:47Z I! Loaded processors:
2023-10-19T20:16:47Z I! Loaded secretstores:
2023-10-19T20:16:47Z W! Outputs are not used in testing mode!
2023-10-19T20:16:47Z I! Tags enabled: host=wsus-windows-firewall.sentex.ca
2023-10-19T20:16:47Z D! [agent] Initializing plugins
2023-10-19T20:16:47Z D! [agent] Starting service inputs
> zfs_pool,health=ONLINE,host=wsus-windows-firewall.sentex.ca,pool=zroot allocated=4315226112i,capacity=16i,dedupratio=1,fragmentation=1i,free=21991448576i,size=26306674688i 1697746607000000000
2023-10-19T20:16:47Z E! [inputs.zfs] Error in plugin: Error parsing usedsnap "-": strconv.ParseInt: parsing "-": invalid syntax
2023-10-19T20:16:47Z D! [agent] Stopping service inputs
2023-10-19T20:16:47Z D! [agent] Input channel closed
2023-10-19T20:16:47Z D! [agent] Stopped Successfully
2023-10-19T20:16:47Z E! [telegraf] Error running agent: input plugins recorded 1 errors
This might actually be a freebsd bug. The -p should show parseble numbers and not dashes. Also, there should be values, but even with snapshots they are always a -. Going to see what the FreeBSD people say
@mdtancsa can you please test the binary in #14176 available once CI has finished the tests. With this I can successfully run Telegraf on a 14.0-RC2 machine...
Relevant telegraf.conf
Logs from Telegraf
System info
telegraf --version Telegraf 1.22.3, FreeBSD 12.3
Docker
No response
Steps to reproduce
telegraf --debug --test --config telegraf.conf
But it doesnt seem to cause issues on other servers where I have volumes, so not sure what else is going on.
Expected behavior
Not throw that error and stop collecting arcstats
Actual behavior
Telegraf hits that error and does not proceed to collect / report the rest of the datasets nor arcstats
Additional info
No response