Closed sempervictus closed 9 years ago
Cache device was definitely the problem: send went from ~2MB/s to >70, massive (50-70%) IOWait on reads is gone as well.
I've checked systems running prior builds, mostly using #2129 and iSCSI stacks, and i'm not seeing this. May be coming from #3115, may be coming from somewhere else in #3189 (and/or #3216 since its a very similar stack) .
@kernelOfTruth: are you seeing this behavior elsewhere? It doesnt seem apparent until the cache device has been in use for a while, and the SCST load + ZFS send/receives we've been doing on these hosts fit the bill quite nicely to create this condition. The perf degradation is catastrophic - overarching subscribers time out on IO, databases fail, life sucks in general. Without a ZFS unload, removing the cache device did increase performance for a couple of minutes, but the system stalled out very quickly back into a sad iowait-related crawl (dmesg clean). Rmmod zfs failed, claimed module was still in use. Soft reboot failed, had to ipmi the sucker down hard to unload.
@sempervictus oh dear, that sounds grave !
unfortunately no - this system hardly has an uptime greater than 2.5-3 days and the l2arc isn't most of the times completely filled
Jumping in,
I've opened #3358 a few days ago and I think it's somehow related. #1420 points in that direction too. I have migrated SAN nodes from SCST/iSCSI over DRBD to SCST/iSCSI over ZVOLs, starting with 0.6.3 and now 0.6.4.1 (nodes have 32G of RAM)
L2ARC is just unusable in that context for me. I can confirm 'l2_size' going way beyond my partition size. For instance:
[root@sanlab2 ~]# lsblk /dev/sdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 477G 0 disk
├─sdb1 8:17 0 59.6G 0 part
├─sdb2 8:18 0 59.6G 0 part
├─sdb3 8:19 0 59.6G 0 part
└─sdb4 8:20 0 59.6G 0 part
Partition 1 (sdb1, 60G) has this WWN:
lrwxrwxrwx 1 root root 9 Apr 29 15:27 wwn-0x50025385a014444d -> ../../sdb
lrwxrwxrwx 1 root root 10 Apr 30 09:56 wwn-0x50025385a014444d-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Apr 29 15:27 wwn-0x50025385a014444d-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 Apr 29 15:27 wwn-0x50025385a014444d-part3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Apr 29 15:27 wwn-0x50025385a014444d-part4 -> ../../sdb4
I'm using this WWN as cache vdev:
NAME STATE READ WRITE CKSUM
p02 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
wwn-0x5000c5006c5f04ab ONLINE 0 0 0
wwn-0x5000c5006c5f3b6f ONLINE 0 0 0
wwn-0x5000c5006c5f24bf ONLINE 0 0 0
wwn-0x5000c5006c5f36ab ONLINE 0 0 0
cache
wwn-0x50025385a014444d-part1 ONLINE 0 0 0
However, 'arcstat.sh' reports a 274G L2ARC:
[root@sanlab2 ~]# arcstat.sh
|---------------------------------------------------------------------------------------------------------------------|
|l1reads l1miss l1hits l1hit% size | l2reads l2misses l2hits l2hit% size disk_access% |
|---------------------------------------------------------------------------------------------------------------------|
|463645439 132874797 330770642 71.341% 11 GB | 131706937 49127270 82579667 62.699% 274GB 10.595% |
/proc/spl/kstat/zfs/arcstats seems to agree [root@sanlab2 ~]# cat /proc/spl/kstat/zfs/arcstats
...
l2_size 4 294606549504
...
l2_hdr_size 4 12651900104
...
Over time, I also see performance going down and the 'arc_adapt' process using a lot of CPU. More importantly, if I try to remove this cache vdev then IOs are blocked to all the pool ZVOLs for several minutes. According to strace, 'zpool remove' spends this time on ioctl 0x5a0c:
22:03:56 ioctl(3, 0x5a0c, 0x7fffa57f1c40) = 0
I can read a lot of people claiming that L2ARC can have a negative impact depending on the workload, but I still have the feeling that something is going wrong here.
More than happy to provide more logs/traces/info.
okay, so it seems the changes from
https://github.com/zfsonlinux/zfs/pull/2110 (merged)
and https://github.com/zfsonlinux/zfs/pull/3115 (to be merged, included in e.g. #3189 , #3190 , #3216 )
are not enough to plug this issue
slowing down might be partly due to: https://github.com/zfsonlinux/zfs/issues/361
Things to read into:
https://github.com/zfsonlinux/zfs/issues/361#issuecomment-18578208 https://www.illumos.org/issues/3794
https://github.com/zfsonlinux/zfs/issues/361#issuecomment-77006614 http://lists.open-zfs.org/pipermail/developer/2015-January/001222.html
Explanation from @behlendorf ( https://github.com/zfsonlinux/zfs/issues/1420#issuecomment-16831663 )
What's happening is that virtually all of your 4GB of ARC space is being consumed managing the 125GB of data in the L2ARC. This means there's basically no memory available for anything else which is why your system is struggling.
To explain a little more, when a data buffer gets removed from the primary ARC cache and migrated to the L2ARC a reference to the L2ARC buffer must be left if memory. Depending on how large your L2ARC device is and what your default block size it can take a significant amount >of memory to manage these headers. This can get particularly bad for ZVOLs because they have >a small 8k block size vs 128k for a file system. This means the ARCs memory requirements for L2ARC headers increase by 16x.
You can check for this in the l2_hdr_size field of the arcstats if you know what you're looking for. There are really only two ways to handle this at the moment.
1) Add additional memory to your system so the ARC is large enough to manage your entire L2ARC device. 2) Manually partition your L2ARC device so it's smaller.
Arguably ZFS should internally limit its L2ARC usage to prevent this pathological behavior and that's something we'll want to look in to. The upstream code also suffers from this issue but it's somewhat hidden because the vendors will carefully size the ARC and L2ARC to avoid this case.
not sure how much this still applies after all changes merged
@kernelOfTruth: the L2ARC in that example is a 64G device, and there's 12G of ARC. The only data being served is a pair of ZVOLs, so not metadata heavy (or shouldnt be). The host has 24G of physical memory, and there's 20% unused even when SCST buffers start eating away at it. If the RAM is accounting for all the insanely misplaced pointers to L2ARC space which has been wrapped around and doesnt really exist, its still a problem with the wraparound.
@sempervictus I got a simple idea you might wanna give it a try, my system have 32G installed, after fire up the vm and all service it takes about 16G for those usage, 16G free memory left, so is it reasonable to give ZFS arc 8GB right?
wrong, in the end it either got panic really soon or got caught in kswapd0 spin in few days, so I lower this to 4GB it's now still up and running for 2 weeks, I have a 64G device for L2ARC too, my l2_hdr_size is 162,072,864 , it serve mostly for 128k record size so it won't hit metadata limit,
I am using 0.6.4.1 now, at 0.6.3 I don't have kswapd0 spin issue, but if I give it lower arc max value it alway blow over the limit due to heavy metadata workload, it won't crash if I use default 16G arc_max at 0.6.4 it honer the limit much better, but got weird issue with kswapd0 spin, I have same bottomless L2ARC issue at 0.6.3 few times, but never caught the reason,
just an idea for you to test and trying to survive this issue, but this is really hard to catch where is going wrong.
might be related to the observation that @odoucet made in https://github.com/zfsonlinux/zfs/issues/3259#issuecomment-91361348
https://github.com/zfsonlinux/zfs/pull/3190#issuecomment-88399082 https://github.com/zfsonlinux/zfs/pull/3216#issuecomment-89738971
one of his test cases was scanning of files (thus mostly reading) with clamav which also lead to some strange behavior
the behaviour I observed was due to an ARCsize set too high. I suggest you try to lower ARCSIZE and see if you observe the same.
Here's another one using #3216 and experiencing serious lag. Receive of a 4T ZVOL went from 100MB/s to several bytes if i'm lucky once the SSDs filled up:
root@storage-host:~# zpool list -v
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
fn00-pool00 14.5T 1.70T 12.8T - 4% 11% 1.00x ONLINE -
raidz2 14.5T 1.70T 12.8T - 4% 11%
scsi-1AMCC_serial_number_removed - - - - - -
scsi-1AMCC_serial_number_removed - - - - - -
scsi-1AMCC_serial_number_removed - - - - - -
scsi-1AMCC_serial_number_removed - - - - - -
scsi-1AMCC_serial_number_removed - - - - - -
scsi-1AMCC_serial_number_removed - - - - - -
scsi-1AMCC_serial_number_removed - - - - - -
scsi-1AMCC_serial_number_removed - - - - - -
mirror 46.5G 0 46.5G - 0% 0%
ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part1 - - - - - -
ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part1 - - - - - -
cache - - - - - -
ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part2 186G 518G 16.0E - 0% 277%
ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part2 186G 519G 16.0E - 0% 278%
root@storage-host:~# arcstat
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
15:25:15 0 0 0 0 0 0 0 0 0 24G 24G
root@storage-host:~# cat /proc/spl/kstat/zfs/arcstats | grep size
size 4 26245258968
hdr_size 4 729092784
data_size 4 0
meta_size 4 102400
other_size 4 136616
anon_size 4 18432
mru_size 4 83968
mru_ghost_size 4 25769576960
mfu_size 4 0
mfu_ghost_size 4 213504
l2_size 4 1051616213504
l2_asize 4 1046476891648
l2_hdr_size 4 25515927168
duplicate_buffers_size 4 0
root@storage-host:~#
The fun part is that the CPU is completely idle, no churn on any kernel tasks, there's no IOWait to speak of (its just screwing itself to the wall doing a receive from another host).
This isnt an ARC sizing issue, its a problem with the L2ARC devices showing unreal capacities and ARC being flooded with references into the dead space by the looks of it.
EDIT: removing the cache devices has pegged a CPU and i'm slowly watching the l2_size drop - down to 800G, which is nuts, given hat there's only 372G of L2ARC total.
EDIT2: after removal of the cache devices the pool was still unresponsive to receiving a send from another host. After exporting and importing the pool, same deal. zfs module could not be unloaded, claimed in use, showed no consumer. After reboot the pool imported and immediately went to 30% CPU use for at least 20 minutes running txg_sync. This pool is empty aside from an empty DS storing an empty ZVOL which is the recipient target of the send that caused the hangup in the first place. Something is seriously rotten in the state of Denmark.
This is starting to look a bit like Illumos issue 5701 - https://reviews.csiden.org/r/175/. From reading the relevant ZoL PRs, i gather we need #3115 and #3038 to port onto.
I pushed a test port of this to #3216
upstream: https://github.com/illumos/illumos-gate/commit/a52fc310ba80fa3b2006110936198de7f828cd94
https://github.com/kernelOfTruth/zfs/commit/c99dfad8b839e9560349caada1e08135c82e41e7
let's see what the buildbots say
From the wording of the Illumos issue entry it at first sounds if only
zpool list is broken - thus cosmetic - but:
https://www.illumos.org/issues/5701
The l2arc vdev space accounting was broken as of the landing of the l2arc RAM reduction patch (89c86e3), as that patch completely removed a couple calls to vdev_space_update(). The issue was compounded with the lock change in l2arc_write_buffers() that went in with the arc_state contention patch (244781f), as buffers could now be release with its ARC_FLAG_L2_WRITING flag set without having been issued to the l2arc device (this results in decrements without accompanying increments).
so a regression, bug
@sempervictus looks like all works well, the buildbots give green light in #3216 for those two additional commits - feel free to give it a good testing - I've also referenced this issue for @dweeezil
thanks
I have made a few tests here with #3216 . Things have improved on the performance side but I still end up with an L2 reported size on 300+ GB on a 60G partition!
My setup:
I export a ZVOL with lz4 compression over SCST/iSCSI (blockio) to ESXi (2x10G ethernet, MPIO). I have a Win 2008R2 VM with iometer. 1 OS disk plus 2 disks for measurements sitting on the exported ZVOL.
I configured 4 threads (2 per disk) with 256K transfers, 80% random, 50% read, 16 outstanding IOs. The 4 threads are hitting the same ZVOL. Following are the iops (r+w) and bandwidth (r+w) graphs for a single thread over 8 hours:
0.6.4.1, no cache, iops http://i.imgur.com/K301KlG.png
0.6.4.1, no cache, bw http://i.imgur.com/wjA7eKh.png
0.6.4.1, cache, iops http://i.imgur.com/hWsbFI2.png
0.6.4.1, cache, bw http://i.imgur.com/YlphS3i.png
0.6.4-50_g19b6408, cache, iops http://i.imgur.com/IexinvM.png
0.6.4-50_g19b6408, cache, bw http://i.imgur.com/zC1LNYT.png
I should also make a test with the patch but without cache. Moreover, I have Storage IO control enabled on the ESXi side, which I should probably disable for a better view.
Still, the perf decrease over time with an L2 cache on 0.6.4.1 is very real but seems to be less dramatic with #3216
@pzwahlen thanks for your tests !
Just a poke in the dark: the SSD is running without TRIM, right ? tried disabling NCQ, e.g. libata.force=noncq if that makes a change ?
what is the setting of zfs_arc_max ? could you - as an experiment - set l2arc size to roughly less than twice of the ARC max ?
@sempervictus did you do any new stress testing with abd-next ? does this also happen there ? if not it could be either that the regression wasn't fully fixed by "5701 zpool list reports incorrect "alloc" value for cache" (not sure if that "fix" even was fully & correctly ported - but according to buildbots it appears so) or if it DOES happen with ABD only and without the changes from #3216 that it's possibly related to ABD-changes
@dweeezil since you now have access to some bigger "muscle" - did you observe anything similar in your tests ?
@kernelOfTruth: looks like abd_next is the culprit:
dm-uuid-CRYPT-LUKS1-...-sddc_crypt_unformatted 52.2G 69.4G 16.0E - 0% 132%
This host is running abd_next on master with no other ARC related patches. Time to ping @tuxoko on #2129 i suppose.
@sempervictus Uhm, if my reading serves me right, the 16.0E thing should be fixed by illumos 5701? Have you try it on top of abd_next?
@kernelOfTruth I've not been doing anything with l2arc lately (working on #3115).
@tuxoko: as @pzwahlen reported, the addition of 5701 to @kernelOfTruth's stack did not mitigate the problem. 5701 fixed an issue introduced by the Illumos changes before it, this problem seems to have started from ABD since metadata moved into the buffers, and occurs even without the relevant Illumos ports which required 5701 in the first place (if i'm understanding this correctly).
@sempervictus But the problem also happens on illumos, which don't have ABD in the first place.
The probem behavior doesn't occur with master far as i can, and only applies to the newer revisions of 2129.
@tuxoko recent changes in Illumos ( #3115 ) from the end of last year introduced a regression which got fixed by "Illumos 5701 zpool list reports incorrect "alloc" value for cache devices"
current ZFSonLinux master doesn't show this behavior but master + ABD seems to show a similar behavior - if I understood the report currectly.
Since the upstream (Illumos) changes that lead to this broken state, regression (according to the Illumos issue description) are not in ZFSonLinux master, the fix won't help in ABD's case
@sempervictus I hope that's the correct summarization
@sempervictus @kernelOfTruth ABD doesn't touch anything related to the vdev size stats, so I don't see how it could cause such bug. Also, in the first post by @pzwahlen https://github.com/zfsonlinux/zfs/issues/3400#issuecomment-101399683, he indicated that he saw l2_size grow over the disk size on 0.6.4.1. I assume that was on master?
@tuxoko so that's probably another building site - perhaps ABD stresses areas of ARC & L2ARC accounting that wouldn't otherwise ? ( if I remember correctly @sempervictus pointed out that high i/o wait and/or other issues occured without ABD so he couldn't possibly hit this without the ABD improvements)
but referring to @pzwahlen 's report it must have been there prior to ABD changes - and perhaps a regression introduced between 0.6.3 and 0.6.4.1
thanks for pointing that out
I can confirm that in my case, L2ARC reported size goes way beyond the actual partition/disk size since 0.6.3 (I started using ZoL with this version).
0.6.4.1 with #3216 seems to improve performance a bit, but nothing changed on the reported size issue.
Also keep in mind I'm doing ZVOL only, I don't have a single "file" in my zfs datasets.
Thanks for looking into this, being able to use L2ARC would be really cool...
referencing #3114 (which includes the potential fix in the top comment)
adding some more: https://github.com/zfsonlinux/zfs/pull/1612 (superseded) , https://github.com/zfsonlinux/zfs/pull/1936 (superseded), https://github.com/zfsonlinux/zfs/pull/1967 (superseded), https://github.com/zfsonlinux/zfs/pull/2110 (merged) - Improve ARC hit rate with metadata heavy workloads #2110
https://github.com/zfsonlinux/zfs/pull/1622 (merged) - Illumos #3137 L2ARC compression #1622
https://github.com/zfsonlinux/zfs/pull/1522 (superseded), https://github.com/zfsonlinux/zfs/pull/1542 (merged) - Fix inaccurate arcstat_l2_hdr_size calculations #1542
https://github.com/zfsonlinux/zfs/issues/936 (closed, too many unknowns)
The issue referenced @ the bottom of the FreeNAS Redmine seems to indicate that this is now resolved in their kernel. Will pull their gh repo later and see what that actually was (though i'm sure by then @kernelOfTruth will have ported, tested, and polished it).
403
You are not authorized to access this page.
WTF ?!
I took the time and effort to create an account and take a look at the code - apparently it's not open source
@sempervictus you, by chance could point me to the actual commit ?
I'm sure I'm missing something blatantly obvious :question:
Right there with ya. Looks like the GH repo is not much help either, its a build system, requiring that we build atop an existing freenas... digging further, but yeah, "WTF?!" is about right.
alright, let's go a few hierarchies up (since FreeBSD -> m0n0wall -> NanoBSD -> FreeNAS ; according to wikipedia)
http://lists.freebsd.org/pipermail/freebsd-bugs/2014-December/059376.html "[Bug 195746] New: zfs L2ARC wrong alloc/free size"
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195746
http://lists.freebsd.org/pipermail/svn-src-head/2014-December/065692.html
http://lists.freebsd.org/pipermail/svn-src-head/2014-November/065195.html https://svnweb.freebsd.org/base?view=revision&revision=273060
<-- that's supposedly the actual fix
@sempervictus @pzwahlen please take a look at #3216 , it contains that fix:
https://github.com/kernelOfTruth/zfs/commit/d7e1fd0a2b46f38815789ee7c9c425d0801a8d16
took me quite some time to track this down ^^
Let's see what the buildbots say,
now hopefully - with those 2 fixes - l2arc works properly and with ABD + the preliminary changes included in #3216 should offer a great deal of performance improvements
it's obviously not the end of the line since @tuxoko improved it further with abd_next (metadata support, large block support) and @dweeezil has discovered & fixed a few further mutex contention issues
Bright times ahead :+1:
@kernelOfTruth I don't see how that patch would fix this issue. Every other places calls vdev_space_update with b_asize (or cumulative of them). Changing a single call with vdev_psize_to_asize(b_asize) would just make things more wrong than right.
Also, while not caused by the patch itself, the naming of the variables are completely screwed up. How is write_psize = vdev_psize_to_asize(write_asize)
is suppose to make sense.
@tuxoko me neither XD
The thing is, that this was mentioned in several of these mailing list threads and bug reports - but seemingly it helps with this issue, reports are not clear - however. It would be nice to have a working L2ARC anyway
Let's do things one at a time: it appears that L2ARC never (?) actually worked under stressful conditions - having it working, if only for a preliminary time would be nice to stress-test current master (if it's not prevented by high i/o wait, etc.) to have a base that can be compared to #2129 (ABD/next) and #3115 - at least on ZFS on Linux. Even if it needs to be fixed again afterwards - in the end it would lead to a cleaner (and more logical, thus easier to maintain) codebase and functioning code
Also we need a buildbot with an L2ARC for zfs/master that runs through lots of cycles of stress-testing to find e.g. these kind of overflow and other related issues (high i/o wait ?) - in perhaps 3 main configurations (ARC to L2ARC ratio)
I've noticed several variables whose naming is screwed up - this is supposed to be fixed (partly ?) by https://github.com/dweeezil/zfs/commit/7deafbb55237217a6d915dbe1eea5bc2f2abe0aa
5369 arc flags should be an enum 5370 consistent arc_buf_hdr_t naming scheme
Thanks
I have L2ARCs working just fine under SCST workloads using 0.6.3-based patch stacks, with ABD, and a few other minor changes. One of these systems has been running for >180 days now without breaking its caches (also a ZVOL/SCST export). So L2ARC works, just not recently.
I'll build another set off of 3216 and go from there. This bug is a bit of a PITA to reproduce, and not all systems have an L2ARC on them. If anyone else is testing this, consider doing a large send/recv once you attach your cache devs as this helps with the flooding the sucker (still takes a few hours, but better than waiting days for it to happen of its own accord).
Thanks @kernelOfTruth and @tuxoko for tracking on this - still would love to know how and when we actually broke the L2ARCs in 0.6.4 or @ about the timeframe of that tag, since its preventing me from deploying to any environment that doesnt have a ZFS-aware systems team (appliance-style deployments only work when it functions like an appliance, consistently).
I just wanted to mention that testing with a small cache partition makes the issue appear much faster.
I now have a 4GB partition for cache on my SSD, and here's the arcstats after just 10 minutes of my 4 threads IOMeter (-> l2_size is almost 9 GB):
[root@sanlab2 ~]# cat /proc/spl/kstat/zfs/arcstats | grep ^l2 l2_hits 4 45977605 l2_misses 4 48415752 l2_feeds 4 54958 l2_rw_clash 4 37 l2_read_bytes 4 376204858368 l2_write_bytes 4 691504770048 l2_writes_sent 4 54891 l2_writes_done 4 54891 l2_writes_error 4 0 l2_writes_lock_retry 4 2196 l2_evict_lock_retry 4 85 l2_evict_reading 4 7 l2_evict_l1cached 4 1225217 l2_free_on_write 4 119994 l2_cdata_free_on_write 4 65 l2_abort_lowmem 4 0 l2_cksum_bad 4 556832 l2_io_error 4 8292 l2_size 4 9051697664 l2_asize 4 9028545024 l2_hdr_size 4 4110311712 l2_compress_successes 4 153006 l2_compress_zeros 4 0 l2_compress_failures 4 0
I'll do my best to test with #3216
Cheers
OK, #3216 with this one-liner doesn't seem to make a difference. My 4G cache has grown to 22G (l2_size) after about 30 minutes.
I'm running zfs-0.6.4-51_gd7e1fd0.
@pzwahlen thanks for the report !
If there's a chance - could you also please test that change ("l2arc space accounting mismatch") against current master
WITH / WITHOUT L2ARC compression ?
like @tuxoko indicated, this was unlikely to fix this issue - I'll have to take a closer look for who and under which circumstances it fixed this issue
commit 3a17a7a99a1a6332d0999f9be68e2b8dc3933de1 Author: Saso Kiselkov skiselkov@gmail.com Date: Thu Aug 1 13:02:10 2013 -0700
Illumos #3137 L2ARC compression
3137 L2ARC compression Reviewed by: George Wilson george.wilson@delphix.com Reviewed by: Matthew Ahrens mahrens@delphix.com Approved by: Dan McDonald danmcd@nexenta.com
References: illumos/illumos-gate@aad02571bc59671aa3103bb070ae365f531b0b62 https://www.illumos.org/issues/3137 http://wiki.illumos.org/display/illumos/L2ARC+Compression
Notes for Linux port:
A l2arc_nocompress module option was added to prevent the compression of l2arc buffers regardless of how a dataset's compression property is set. This allows the legacy behavior to be preserved.
Ported by: James H james@kagisoft.co.uk Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov Closes #1379
Could anyone who is running into these problems run their L2ARC for the tests related to this issue WITHOUT compression
Best practice (and effect) probably would be achieved by setting it at module load
zfs l2arc_nocompress=1
or via
echo 1 > /sys/module/zfs/parameters/l2arc_nocompress
This requires some in-depth testing of all eventualities to rule out that it e.g. is caused by introduction of the compression support
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195746#c1
Checksum and IO errors appear after an L2ARC device fills completely with cache data on any release of FreeBSD after L2ARC compression
Notice the lines
l2_cksum_bad 4 556832 l2_io_error 4 8292
posted from @pzwahlen which also show checksum errors
It would be interesting to see whether these kind of errors also would appear with e.g. LZJB
@sempervictus also reported a performance limitation regarding to lz4 compression, that @behlendorf referenced a few days ago in a different issue and/or pull request
currently can't find it but at the back of my mind currently is the compression algorithm that might cause trouble - needs further investigation ... (but enough for today - at least for me)
Disabling compression makes no difference, which matches what I already observed with 0.6.3 and 0.6.4.1 (but I never reported about the tests I did with nocompress=1)
On Wed, May 20, 2015 at 12:50 AM, kernelOfTruth aka. kOT, Gentoo user < notifications@github.com> wrote:
@pzwahlen https://github.com/pzwahlen thanks for the report !
If there's a chance - could you also please test that change ("l2arc space accounting mismatch") against current master
WITH / WITHOUT L2ARC compression ?
like @tuxoko https://github.com/tuxoko indicated, this was unlikely to fix this issue - I'll have to take a closer look for who and under which circumstances it fixed this issue
commit 3a17a7a https://github.com/zfsonlinux/zfs/commit/3a17a7a99a1a6332d0999f9be68e2b8dc3933de1 Author: Saso Kiselkov skiselkov@gmail.com Date: Thu Aug 1 13:02:10 2013 -0700
Illumos #3137 https://github.com/zfsonlinux/zfs/pull/3137 L2ARC compression
3137 L2ARC compression Reviewed by: George Wilson george.wilson@delphix.com Reviewed by: Matthew Ahrens mahrens@delphix.com Approved by: Dan McDonald danmcd@nexenta.com
References: illumos/illumos-gate@aad0257 https://github.com/illumos/illumos-gate/commit/aad02571bc59671aa3103bb070ae365f531b0b62 https://www.illumos.org/issues/3137 http://wiki.illumos.org/display/illumos/L2ARC+Compression
Notes for Linux port:
A l2arc_nocompress module option was added to prevent the compression of l2arc buffers regardless of how a dataset's compression property is set. This allows the legacy behavior to be preserved.
Ported by: James H james@kagisoft.co.uk Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov Closes #1379 https://github.com/zfsonlinux/zfs/issues/1379
Could anyone who is running into these problems run their L2ARC for the tests related to this issue without compression ?
Best practice (and effect) probably would be achieved by setting it at module load
zfs l2arc_nocompress=1
or via
echo 1 > /sys/module/zfs/parameters/l2arc_nocompress
This requires some in-depth testing of all eventualities to rule out that it e.g. is caused by introduction of the compression support
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195746#c1
Checksum and IO errors appear after an L2ARC device fills completely with cache data on any release of FreeBSD after L2ARC compression
— Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/3400#issuecomment-103689812.
currently waiting for the buildbots (would they even produce meaningful results related to L2ARC stuff @behlendorf ?)
@pzwahlen and @sempervictus and @jflandry since this isn't widely tested - the patch suggested in #3433 (at least for now) would be only suitable for internal or "real testing" (non-production) test-runs
perhaps @avg-I can shed some light on the issue since he's the one who created that pull-request over at FreeBSD
@kernelOfTruth currently the buildbots don't provide much l2arc test coverage so I'm not sure how much they're going to reveal. As for this exactly issue I haven't had a chance to look in to it, but it would be very helpful to know when it was introduced and how it can be reproduced.
@behlendorf It probably would be helpful to introduce these kind of tests into the buildbot in the future to be able to track similar issues down - I'm not entirely sure though in what exact way
Also we need a buildbot with an L2ARC for zfs/master that runs through lots of cycles of stress-testing to find e.g. these kind of overflow and other related issues (high i/o wait ?) - in perhaps 3 main configurations (ARC to L2ARC ratio)
The question probably comes down to the following: whether the servers hosting the buildbots are seated on top of SSDs (other servers needed ?) and how to reserve those partitions and how to make sure to generate the needed load even when the server works on other things in parallel.
As to the reproducibility: still looking for some clear info from the mailing lists and bugtrackers, best bet may be on pzwahlen
Thanks
@pzwahlen you seem to be able to trigger this rather quickly - could you please post a sample scenario + the steps on how to reproduce this ?
From your postings - it seems to involve a rather small L2ARC partition, a not too large ARC cache and repeated transfers of rather huge files (or just huge amounts of files ?) compared to the size of ARC - if I understood correctly ?
I still have the suspicion that this seems to be two issues that exhibit similar symptoms ( @sempervictus systems e.g. ran fine with a 0.6.3-based patch stack) - but according to those numerous reports on e.g. the NetBSD, FreeNAS, FreeBSD mailing lists it might actually be the same issue ... :confused: (with the common denominator that comparatively massive i/o load is necessary)
If it's not too much trouble and work & it's possible could you , @pzwahlen , please run a pretty recent master with L2ARC and compression disabled and post the stats, results here ?
Thanks
As far as 0.6.3 builds go - we run 10G iSCSI on it, with some hosts running SSD-only LUNs on 10G arista backed iSCSI via SCST for >8 months with no downtime or noticeable regression. These hosts are running 0.6.3 from the repos. Whatever this is, it was either introduced in the interrim, or exposed by the kmem changes and subsequent patches.
I have 0.6.4.1 with 1TB L2ARC (on SSD) - 256GB RAM / 128G ARC size - and did not trigger the bug (maybe not enough IO ?). Is this a case that mixed small ARC, small L2ARC and lots of IO ? I have a duplicate system (same hardware) to try to reproduce it, if it helps ...
@pzwahlen Just to be sure and have a "control" state:
Does this performance degradation also occur WITHOUT an L2ARC device ?
@odoucet Thanks for your offer ! Much appreciated - I'm still looking for a reproducer - albeit slowly (since life's demanding its toll)
The one patch mentioned at https://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c?sortby=file&r1=256889&r2=256888&pathrev=256889 and #3436 doesn't seem applicable to ZFSonLinux (different implementation)
Sorry, life is demanding on this side, too ;-)
@sempervictus We started using ZFS with 0.6.3-1.1 (so we skipped 0.6.3). As soon as we started hitting our ZVOLS with real workloads (something like 40 VMs spread across 6 ZVOLS on 2 pools) we had the L2ARC size issue. We now have the same with 0.6.4.1 so in my case I always had this problem.
@kernelOfTruth I tested #3216 using the buildbot commands for checking out master and fetching the patch. I would say as was on master when doing my tests. Am I wrong ? To answer your last question, I don't see performance degradation without L2ARC.
I will do my best to document a replication scenario over the week-end. I also would like to find an FIO test that matches my IOMeter test, and run that locally on my ZVOLs, completely removing the network and SCST from the picture. Another option would be to replicate on a VM using L2ARC over a virtual disk and exporting a ZVOL over iSCSI to the WIndows iSCSI Initiator (with IOMeter running on the WIndows iSCSI client). This would be easy to package and share, then.
Finally, could that be related to ESXi sending ASYNC writes only and me running with sync=standard ? I'm also on ashift=12 if that matters.
Cheers!
Finally had time to make some more tests.
I have applied #3451 using the following commands from a buildbot log: http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-builder/builds/2391/steps/git_1/logs/stdio
Things are definitely much better, With a small (4G) l2arc partition that I could previously "overfill" in less than 10 minutes, I now couldn't reach more than 3G of used cache with my iometer test.
On a larger 60G partition, cache usage went up to 29G (after about 2 hours) and then stayed there.
I don't know if reaching half the cache size is normal with my workload or if it's the sign of another size calculation issue, though.
The performance don't seem to suffer in any way.
Are there other tests I could perform now that I have this running ?
Thx for the hard work!
@pzwahlen Thanks for sharing your stats =)
If it's possible - I'd say leave it running (if there's automated tests or some scripts) over a longer period of days to stress it some more.
Indeed, it could be that this (seemingly correctly behaving L2arc) shows some bug in the calculation or it's simply saturated and might fill up more over time.
Cheers
Quick update: after a few Storage vMotion I have been able to bring l2arc size to 81G out of my 128G partition. It means it definitely can grow over half the total size.
Another point is that during low activity times, l2arc size actually decreases, which is something I never saw before!
@sempervictus Do you have some additional feedback ?
@pzwahlen did you observe any stalls, performance issues or other regression-like behavior ?
@sempervictus not so far. However, I just applied #3451 over master (as far as I understand Git), and nothing related to #2129.
Now that I read back this conversation, I even wonder if it's not #3433 that is supposed to fix this l2arc behavior, instead of #3451 (even though it definitely changed things in the right direction)
Any input much appreciated. Cheers!
@pzwahlen sorry, I should have updated the title of #3433 - done now
according to @avg-I it's not entirely correct and thus should be superseded by #3451 (and subsequent changes) , so you applied the correct patch
So the only change that applies to current zfs master is #3451 .
"5701 zpool list reports incorrect "alloc" value for cache devices" does not apply and will be relevant once #3115 has been merged.
Hope that clears things up
Thanks
While running #3189 i came across something very strange - the L2 ARC has developed magical powers and turned a 64G SSD into a bottomless pit of data. This in turn has resulted in crushing performance degradation and apparent demonic possession of the SCST host running this export.
My assumption is that the following output is based on the number of times L2ARC has wrapped around:
Reads from the pool are suffering badly, IOWait on a send operation is >50%. About to drop the cache device and hope it gets fixed, but figured i should put this up here for now.