sempervictus commented 9 years ago

While running #3189 i came across something very strange - the L2 ARC has developed magical powers and turned a 64G SSD into a bottomless pit of data. This in turn has resulted in crushing performance degradation and apparent demonic possession of the SCST host running this export.

My assumption is that the following output is based on the number of times L2ARC has wrapped around:

zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
(omitted since this is a client system and the underlying VDEVs are named with serial numbers)
...
...
cache      -      -      -      -      -      -
  dm-name-crypt_2718620040  59.6G   667G  16.0E         -     0%  1118%

Reads from the pool are suffering badly, IOWait on a send operation is >50%. About to drop the cache device and hope it gets fixed, but figured i should put this up here for now.

sempervictus commented 9 years ago

Cache device was definitely the problem: send went from ~2MB/s to >70, massive (50-70%) IOWait on reads is gone as well.

I've checked systems running prior builds, mostly using #2129 and iSCSI stacks, and i'm not seeing this. May be coming from #3115, may be coming from somewhere else in #3189 (and/or #3216 since its a very similar stack) .

@kernelOfTruth: are you seeing this behavior elsewhere? It doesnt seem apparent until the cache device has been in use for a while, and the SCST load + ZFS send/receives we've been doing on these hosts fit the bill quite nicely to create this condition. The perf degradation is catastrophic - overarching subscribers time out on IO, databases fail, life sucks in general. Without a ZFS unload, removing the cache device did increase performance for a couple of minutes, but the system stalled out very quickly back into a sad iowait-related crawl (dmesg clean). Rmmod zfs failed, claimed module was still in use. Soft reboot failed, had to ipmi the sucker down hard to unload.

kernelOfTruth commented 9 years ago

@sempervictus oh dear, that sounds grave !

unfortunately no - this system hardly has an uptime greater than 2.5-3 days and the l2arc isn't most of the times completely filled

pzwahlen commented 9 years ago

Jumping in,

I've opened #3358 a few days ago and I think it's somehow related. #1420 points in that direction too. I have migrated SAN nodes from SCST/iSCSI over DRBD to SCST/iSCSI over ZVOLs, starting with 0.6.3 and now 0.6.4.1 (nodes have 32G of RAM)

L2ARC is just unusable in that context for me. I can confirm 'l2_size' going way beyond my partition size. For instance:

[root@sanlab2 ~]# lsblk /dev/sdb
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdb      8:16   0  477G  0 disk 
├─sdb1   8:17   0 59.6G  0 part 
├─sdb2   8:18   0 59.6G  0 part 
├─sdb3   8:19   0 59.6G  0 part 
└─sdb4   8:20   0 59.6G  0 part

Partition 1 (sdb1, 60G) has this WWN:

lrwxrwxrwx 1 root root  9 Apr 29 15:27 wwn-0x50025385a014444d -> ../../sdb
lrwxrwxrwx 1 root root 10 Apr 30 09:56 wwn-0x50025385a014444d-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Apr 29 15:27 wwn-0x50025385a014444d-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 Apr 29 15:27 wwn-0x50025385a014444d-part3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Apr 29 15:27 wwn-0x50025385a014444d-part4 -> ../../sdb4

I'm using this WWN as cache vdev:

        NAME                            STATE     READ WRITE CKSUM
        p02                             ONLINE       0     0     0
          raidz1-0                      ONLINE       0     0     0
            wwn-0x5000c5006c5f04ab      ONLINE       0     0     0
            wwn-0x5000c5006c5f3b6f      ONLINE       0     0     0
            wwn-0x5000c5006c5f24bf      ONLINE       0     0     0
            wwn-0x5000c5006c5f36ab      ONLINE       0     0     0
        cache
          wwn-0x50025385a014444d-part1  ONLINE       0     0     0

However, 'arcstat.sh' reports a 274G L2ARC:

[root@sanlab2 ~]# arcstat.sh 
|---------------------------------------------------------------------------------------------------------------------|
|l1reads    l1miss     l1hits     l1hit%     size  |  l2reads    l2misses   l2hits     l2hit%     size   disk_access% |
|---------------------------------------------------------------------------------------------------------------------|
|463645439  132874797  330770642  71.341%    11 GB  |  131706937  49127270   82579667   62.699%    274GB   10.595%    |

/proc/spl/kstat/zfs/arcstats seems to agree [root@sanlab2 ~]# cat /proc/spl/kstat/zfs/arcstats

...
l2_size                         4    294606549504
...
l2_hdr_size                     4    12651900104
...

Over time, I also see performance going down and the 'arc_adapt' process using a lot of CPU. More importantly, if I try to remove this cache vdev then IOs are blocked to all the pool ZVOLs for several minutes. According to strace, 'zpool remove' spends this time on ioctl 0x5a0c:

22:03:56 ioctl(3, 0x5a0c, 0x7fffa57f1c40) = 0

I can read a lot of people claiming that L2ARC can have a negative impact depending on the workload, but I still have the feeling that something is going wrong here.

More than happy to provide more logs/traces/info.

kernelOfTruth commented 9 years ago

okay, so it seems the changes from

https://github.com/zfsonlinux/zfs/pull/2110 (merged)

and https://github.com/zfsonlinux/zfs/pull/3115 (to be merged, included in e.g. #3189 , #3190 , #3216 )

are not enough to plug this issue

slowing down might be partly due to: https://github.com/zfsonlinux/zfs/issues/361

Things to read into:

https://github.com/zfsonlinux/zfs/issues/361#issuecomment-18578208 https://www.illumos.org/issues/3794

https://github.com/zfsonlinux/zfs/issues/361#issuecomment-77006614 http://lists.open-zfs.org/pipermail/developer/2015-January/001222.html

Explanation from @behlendorf ( https://github.com/zfsonlinux/zfs/issues/1420#issuecomment-16831663 )

What's happening is that virtually all of your 4GB of ARC space is being consumed managing the 125GB of data in the L2ARC. This means there's basically no memory available for anything else which is why your system is struggling.

To explain a little more, when a data buffer gets removed from the primary ARC cache and migrated to the L2ARC a reference to the L2ARC buffer must be left if memory. Depending on how large your L2ARC device is and what your default block size it can take a significant amount >of memory to manage these headers. This can get particularly bad for ZVOLs because they have >a small 8k block size vs 128k for a file system. This means the ARCs memory requirements for L2ARC headers increase by 16x.

You can check for this in the l2_hdr_size field of the arcstats if you know what you're looking for. There are really only two ways to handle this at the moment.

1) Add additional memory to your system so the ARC is large enough to manage your entire L2ARC device. 2) Manually partition your L2ARC device so it's smaller.

Arguably ZFS should internally limit its L2ARC usage to prevent this pathological behavior and that's something we'll want to look in to. The upstream code also suffers from this issue but it's somewhat hidden because the vendors will carefully size the ARC and L2ARC to avoid this case.

not sure how much this still applies after all changes merged

sempervictus commented 9 years ago

@kernelOfTruth: the L2ARC in that example is a 64G device, and there's 12G of ARC. The only data being served is a pair of ZVOLs, so not metadata heavy (or shouldnt be). The host has 24G of physical memory, and there's 20% unused even when SCST buffers start eating away at it. If the RAM is accounting for all the insanely misplaced pointers to L2ARC space which has been wrapped around and doesnt really exist, its still a problem with the wraparound.

361 is likely unrelated here as it deals with writes - this is specific to reads, once that ARC is wrapped, reads go to hell in a handbasket. The boundary issue is a real PITA for us, since Xen 6.2 wants 512b iscsi sectors, but we're moving to 6.5 in the near term so at least will be able to map 4K iscsi blocks to 16K recordsizes (to prevent the MD swell of small volblocksizing).

AndCycle commented 9 years ago

@sempervictus I got a simple idea you might wanna give it a try, my system have 32G installed, after fire up the vm and all service it takes about 16G for those usage, 16G free memory left, so is it reasonable to give ZFS arc 8GB right?

wrong, in the end it either got panic really soon or got caught in kswapd0 spin in few days, so I lower this to 4GB it's now still up and running for 2 weeks, I have a 64G device for L2ARC too, my l2_hdr_size is 162,072,864 , it serve mostly for 128k record size so it won't hit metadata limit,

I am using 0.6.4.1 now, at 0.6.3 I don't have kswapd0 spin issue, but if I give it lower arc max value it alway blow over the limit due to heavy metadata workload, it won't crash if I use default 16G arc_max at 0.6.4 it honer the limit much better, but got weird issue with kswapd0 spin, I have same bottomless L2ARC issue at 0.6.3 few times, but never caught the reason,

just an idea for you to test and trying to survive this issue, but this is really hard to catch where is going wrong.

kernelOfTruth commented 9 years ago

might be related to the observation that @odoucet made in https://github.com/zfsonlinux/zfs/issues/3259#issuecomment-91361348

https://github.com/zfsonlinux/zfs/pull/3190#issuecomment-88399082 https://github.com/zfsonlinux/zfs/pull/3216#issuecomment-89738971

one of his test cases was scanning of files (thus mostly reading) with clamav which also lead to some strange behavior

odoucet commented 9 years ago

the behaviour I observed was due to an ARCsize set too high. I suggest you try to lower ARCSIZE and see if you observe the same.

sempervictus commented 9 years ago

Here's another one using #3216 and experiencing serious lag. Receive of a 4T ZVOL went from 100MB/s to several bytes if i'm lucky once the SSDs filled up:

root@storage-host:~# zpool list -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
fn00-pool00  14.5T  1.70T  12.8T         -     4%    11%  1.00x  ONLINE  -
  raidz2  14.5T  1.70T  12.8T         -     4%    11%
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
    scsi-1AMCC_serial_number_removed      -      -      -         -      -      -
  mirror  46.5G      0  46.5G         -     0%     0%
    ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part1      -      -      -         -      -      -
    ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part1      -      -      -         -      -      -
cache      -      -      -      -      -      -
  ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part2   186G   518G  16.0E         -     0%   277%
  ata-Samsung_SSD_840_EVO_250GB_serial_number_removed-part2   186G   519G  16.0E         -     0%   278%
root@storage-host:~# arcstat 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
15:25:15     0     0      0     0    0     0    0     0    0    24G   24G  
root@storage-host:~#  cat /proc/spl/kstat/zfs/arcstats  | grep size
size                            4    26245258968
hdr_size                        4    729092784
data_size                       4    0
meta_size                       4    102400
other_size                      4    136616
anon_size                       4    18432
mru_size                        4    83968
mru_ghost_size                  4    25769576960
mfu_size                        4    0
mfu_ghost_size                  4    213504
l2_size                         4    1051616213504
l2_asize                        4    1046476891648
l2_hdr_size                     4    25515927168
duplicate_buffers_size          4    0
root@storage-host:~#

The fun part is that the CPU is completely idle, no churn on any kernel tasks, there's no IOWait to speak of (its just screwing itself to the wall doing a receive from another host).

This isnt an ARC sizing issue, its a problem with the L2ARC devices showing unreal capacities and ARC being flooded with references into the dead space by the looks of it.

EDIT: removing the cache devices has pegged a CPU and i'm slowly watching the l2_size drop - down to 800G, which is nuts, given hat there's only 372G of L2ARC total.

EDIT2: after removal of the cache devices the pool was still unresponsive to receiving a send from another host. After exporting and importing the pool, same deal. zfs module could not be unloaded, claimed in use, showed no consumer. After reboot the pool imported and immediately went to 30% CPU use for at least 20 minutes running txg_sync. This pool is empty aside from an empty DS storing an empty ZVOL which is the recipient target of the send that caused the hangup in the first place. Something is seriously rotten in the state of Denmark.

sempervictus commented 9 years ago

This is starting to look a bit like Illumos issue 5701 - https://reviews.csiden.org/r/175/. From reading the relevant ZoL PRs, i gather we need #3115 and #3038 to port onto.

kernelOfTruth commented 9 years ago

I pushed a test port of this to #3216

upstream: https://github.com/illumos/illumos-gate/commit/a52fc310ba80fa3b2006110936198de7f828cd94

https://github.com/kernelOfTruth/zfs/commit/c99dfad8b839e9560349caada1e08135c82e41e7

let's see what the buildbots say

From the wording of the Illumos issue entry it at first sounds if only

zpool list is broken - thus cosmetic - but:

https://www.illumos.org/issues/5701

The l2arc vdev space accounting was broken as of the landing of the l2arc RAM reduction patch (89c86e3), as that patch completely removed a couple calls to vdev_space_update(). The issue was compounded with the lock change in l2arc_write_buffers() that went in with the arc_state contention patch (244781f), as buffers could now be release with its ARC_FLAG_L2_WRITING flag set without having been issued to the l2arc device (this results in decrements without accompanying increments).

so a regression, bug

kernelOfTruth commented 9 years ago

@sempervictus looks like all works well, the buildbots give green light in #3216 for those two additional commits - feel free to give it a good testing - I've also referenced this issue for @dweeezil

thanks

pzwahlen commented 9 years ago

I have made a few tests here with #3216 . Things have improved on the performance side but I still end up with an L2 reported size on 300+ GB on a 60G partition!

My setup:

CentOS 7
2x Xeon E5-2620
32GB ECC memory
4x 600G 15k 3.5' in a JBOD via LSI (SAS 6G)
raidz1-0 pool (2 mirrors)
240G local SSD with a 60G partition for L2ARC

I export a ZVOL with lz4 compression over SCST/iSCSI (blockio) to ESXi (2x10G ethernet, MPIO). I have a Win 2008R2 VM with iometer. 1 OS disk plus 2 disks for measurements sitting on the exported ZVOL.

I configured 4 threads (2 per disk) with 256K transfers, 80% random, 50% read, 16 outstanding IOs. The 4 threads are hitting the same ZVOL. Following are the iops (r+w) and bandwidth (r+w) graphs for a single thread over 8 hours:

0.6.4.1, no cache, iops http://i.imgur.com/K301KlG.png

0.6.4.1, no cache, bw http://i.imgur.com/wjA7eKh.png

0.6.4.1, cache, iops http://i.imgur.com/hWsbFI2.png

0.6.4.1, cache, bw http://i.imgur.com/YlphS3i.png

0.6.4-50_g19b6408, cache, iops http://i.imgur.com/IexinvM.png

0.6.4-50_g19b6408, cache, bw http://i.imgur.com/zC1LNYT.png

I should also make a test with the patch but without cache. Moreover, I have Storage IO control enabled on the ESXi side, which I should probably disable for a better view.

Still, the perf decrease over time with an L2 cache on 0.6.4.1 is very real but seems to be less dramatic with #3216

kernelOfTruth commented 9 years ago

@pzwahlen thanks for your tests !

Just a poke in the dark: the SSD is running without TRIM, right ? tried disabling NCQ, e.g. libata.force=noncq if that makes a change ?

what is the setting of zfs_arc_max ? could you - as an experiment - set l2arc size to roughly less than twice of the ARC max ?

@sempervictus did you do any new stress testing with abd-next ? does this also happen there ? if not it could be either that the regression wasn't fully fixed by "5701 zpool list reports incorrect "alloc" value for cache" (not sure if that "fix" even was fully & correctly ported - but according to buildbots it appears so) or if it DOES happen with ABD only and without the changes from #3216 that it's possibly related to ABD-changes

@dweeezil since you now have access to some bigger "muscle" - did you observe anything similar in your tests ?

sempervictus commented 9 years ago

@kernelOfTruth: looks like abd_next is the culprit:

dm-uuid-CRYPT-LUKS1-...-sddc_crypt_unformatted  52.2G  69.4G  16.0E         -     0%   132%

This host is running abd_next on master with no other ARC related patches. Time to ping @tuxoko on #2129 i suppose.

tuxoko commented 9 years ago

@sempervictus Uhm, if my reading serves me right, the 16.0E thing should be fixed by illumos 5701? Have you try it on top of abd_next?

dweeezil commented 9 years ago

@kernelOfTruth I've not been doing anything with l2arc lately (working on #3115).

sempervictus commented 9 years ago

@tuxoko: as @pzwahlen reported, the addition of 5701 to @kernelOfTruth's stack did not mitigate the problem. 5701 fixed an issue introduced by the Illumos changes before it, this problem seems to have started from ABD since metadata moved into the buffers, and occurs even without the relevant Illumos ports which required 5701 in the first place (if i'm understanding this correctly).

tuxoko commented 9 years ago

@sempervictus But the problem also happens on illumos, which don't have ABD in the first place.

sempervictus commented 9 years ago

The probem behavior doesn't occur with master far as i can, and only applies to the newer revisions of 2129.

kernelOfTruth commented 9 years ago

@tuxoko recent changes in Illumos ( #3115 ) from the end of last year introduced a regression which got fixed by "Illumos 5701 zpool list reports incorrect "alloc" value for cache devices"

current ZFSonLinux master doesn't show this behavior but master + ABD seems to show a similar behavior - if I understood the report currectly.

Since the upstream (Illumos) changes that lead to this broken state, regression (according to the Illumos issue description) are not in ZFSonLinux master, the fix won't help in ABD's case

@sempervictus I hope that's the correct summarization

tuxoko commented 9 years ago

@sempervictus @kernelOfTruth ABD doesn't touch anything related to the vdev size stats, so I don't see how it could cause such bug. Also, in the first post by @pzwahlen https://github.com/zfsonlinux/zfs/issues/3400#issuecomment-101399683, he indicated that he saw l2_size grow over the disk size on 0.6.4.1. I assume that was on master?

kernelOfTruth commented 9 years ago

@tuxoko so that's probably another building site - perhaps ABD stresses areas of ARC & L2ARC accounting that wouldn't otherwise ? ( if I remember correctly @sempervictus pointed out that high i/o wait and/or other issues occured without ABD so he couldn't possibly hit this without the ABD improvements)

but referring to @pzwahlen 's report it must have been there prior to ABD changes - and perhaps a regression introduced between 0.6.3 and 0.6.4.1

thanks for pointing that out

pzwahlen commented 9 years ago

I can confirm that in my case, L2ARC reported size goes way beyond the actual partition/disk size since 0.6.3 (I started using ZoL with this version).

0.6.4.1 with #3216 seems to improve performance a bit, but nothing changed on the reported size issue.

Also keep in mind I'm doing ZVOL only, I don't have a single "file" in my zfs datasets.

Thanks for looking into this, being able to use L2ARC would be really cool...

kernelOfTruth commented 9 years ago

referencing #3114 (which includes the potential fix in the top comment)

adding some more: https://github.com/zfsonlinux/zfs/pull/1612 (superseded) , https://github.com/zfsonlinux/zfs/pull/1936 (superseded), https://github.com/zfsonlinux/zfs/pull/1967 (superseded), https://github.com/zfsonlinux/zfs/pull/2110 (merged) - Improve ARC hit rate with metadata heavy workloads #2110

https://github.com/zfsonlinux/zfs/pull/1622 (merged) - Illumos #3137 L2ARC compression #1622

https://github.com/zfsonlinux/zfs/pull/1522 (superseded), https://github.com/zfsonlinux/zfs/pull/1542 (merged) - Fix inaccurate arcstat_l2_hdr_size calculations #1542

https://github.com/zfsonlinux/zfs/issues/936 (closed, too many unknowns)

sempervictus commented 9 years ago

The issue referenced @ the bottom of the FreeNAS Redmine seems to indicate that this is now resolved in their kernel. Will pull their gh repo later and see what that actually was (though i'm sure by then @kernelOfTruth will have ported, tested, and polished it).

kernelOfTruth commented 9 years ago

403

You are not authorized to access this page.

WTF ?!

I took the time and effort to create an account and take a look at the code - apparently it's not open source

@sempervictus you, by chance could point me to the actual commit ?

I'm sure I'm missing something blatantly obvious :question:

sempervictus commented 9 years ago

Right there with ya. Looks like the GH repo is not much help either, its a build system, requiring that we build atop an existing freenas... digging further, but yeah, "WTF?!" is about right.

kernelOfTruth commented 9 years ago

alright, let's go a few hierarchies up (since FreeBSD -> m0n0wall -> NanoBSD -> FreeNAS ; according to wikipedia)

http://lists.freebsd.org/pipermail/freebsd-bugs/2014-December/059376.html "[Bug 195746] New: zfs L2ARC wrong alloc/free size"

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195746

http://lists.freebsd.org/pipermail/svn-src-head/2014-December/065692.html

http://lists.freebsd.org/pipermail/svn-src-head/2014-November/065195.html https://svnweb.freebsd.org/base?view=revision&revision=273060

https://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c?r1=273060&r2=273059&pathrev=273060

<-- that's supposedly the actual fix

kernelOfTruth commented 9 years ago

@sempervictus @pzwahlen please take a look at #3216 , it contains that fix:

https://github.com/kernelOfTruth/zfs/commit/d7e1fd0a2b46f38815789ee7c9c425d0801a8d16

took me quite some time to track this down ^^

Let's see what the buildbots say,

now hopefully - with those 2 fixes - l2arc works properly and with ABD + the preliminary changes included in #3216 should offer a great deal of performance improvements

it's obviously not the end of the line since @tuxoko improved it further with abd_next (metadata support, large block support) and @dweeezil has discovered & fixed a few further mutex contention issues

Bright times ahead :+1:

tuxoko commented 9 years ago

@kernelOfTruth I don't see how that patch would fix this issue. Every other places calls vdev_space_update with b_asize (or cumulative of them). Changing a single call with vdev_psize_to_asize(b_asize) would just make things more wrong than right.

Also, while not caused by the patch itself, the naming of the variables are completely screwed up. How is write_psize = vdev_psize_to_asize(write_asize) is suppose to make sense.

kernelOfTruth commented 9 years ago

@tuxoko me neither XD

The thing is, that this was mentioned in several of these mailing list threads and bug reports - but seemingly it helps with this issue, reports are not clear - however. It would be nice to have a working L2ARC anyway

Let's do things one at a time: it appears that L2ARC never (?) actually worked under stressful conditions - having it working, if only for a preliminary time would be nice to stress-test current master (if it's not prevented by high i/o wait, etc.) to have a base that can be compared to #2129 (ABD/next) and #3115 - at least on ZFS on Linux. Even if it needs to be fixed again afterwards - in the end it would lead to a cleaner (and more logical, thus easier to maintain) codebase and functioning code

Also we need a buildbot with an L2ARC for zfs/master that runs through lots of cycles of stress-testing to find e.g. these kind of overflow and other related issues (high i/o wait ?) - in perhaps 3 main configurations (ARC to L2ARC ratio)

I've noticed several variables whose naming is screwed up - this is supposed to be fixed (partly ?) by https://github.com/dweeezil/zfs/commit/7deafbb55237217a6d915dbe1eea5bc2f2abe0aa

5369 arc flags should be an enum 5370 consistent arc_buf_hdr_t naming scheme

Thanks

sempervictus commented 9 years ago

I have L2ARCs working just fine under SCST workloads using 0.6.3-based patch stacks, with ABD, and a few other minor changes. One of these systems has been running for >180 days now without breaking its caches (also a ZVOL/SCST export). So L2ARC works, just not recently.

I'll build another set off of 3216 and go from there. This bug is a bit of a PITA to reproduce, and not all systems have an L2ARC on them. If anyone else is testing this, consider doing a large send/recv once you attach your cache devs as this helps with the flooding the sucker (still takes a few hours, but better than waiting days for it to happen of its own accord).

Thanks @kernelOfTruth and @tuxoko for tracking on this - still would love to know how and when we actually broke the L2ARCs in 0.6.4 or @ about the timeframe of that tag, since its preventing me from deploying to any environment that doesnt have a ZFS-aware systems team (appliance-style deployments only work when it functions like an appliance, consistently).

pzwahlen commented 9 years ago

I just wanted to mention that testing with a small cache partition makes the issue appear much faster.

I now have a 4GB partition for cache on my SSD, and here's the arcstats after just 10 minutes of my 4 threads IOMeter (-> l2_size is almost 9 GB):

[root@sanlab2 ~]# cat /proc/spl/kstat/zfs/arcstats | grep ^l2 l2_hits 4 45977605 l2_misses 4 48415752 l2_feeds 4 54958 l2_rw_clash 4 37 l2_read_bytes 4 376204858368 l2_write_bytes 4 691504770048 l2_writes_sent 4 54891 l2_writes_done 4 54891 l2_writes_error 4 0 l2_writes_lock_retry 4 2196 l2_evict_lock_retry 4 85 l2_evict_reading 4 7 l2_evict_l1cached 4 1225217 l2_free_on_write 4 119994 l2_cdata_free_on_write 4 65 l2_abort_lowmem 4 0 l2_cksum_bad 4 556832 l2_io_error 4 8292 l2_size 4 9051697664 l2_asize 4 9028545024 l2_hdr_size 4 4110311712 l2_compress_successes 4 153006 l2_compress_zeros 4 0 l2_compress_failures 4 0

I'll do my best to test with #3216

Cheers

pzwahlen commented 9 years ago

OK, #3216 with this one-liner doesn't seem to make a difference. My 4G cache has grown to 22G (l2_size) after about 30 minutes.

I'm running zfs-0.6.4-51_gd7e1fd0.

kernelOfTruth commented 9 years ago

@pzwahlen thanks for the report !

If there's a chance - could you also please test that change ("l2arc space accounting mismatch") against current master

WITH / WITHOUT L2ARC compression ?

like @tuxoko indicated, this was unlikely to fix this issue - I'll have to take a closer look for who and under which circumstances it fixed this issue

commit 3a17a7a99a1a6332d0999f9be68e2b8dc3933de1 Author: Saso Kiselkov skiselkov@gmail.com Date: Thu Aug 1 13:02:10 2013 -0700

Illumos #3137 L2ARC compression

3137 L2ARC compression Reviewed by: George Wilson george.wilson@delphix.com Reviewed by: Matthew Ahrens mahrens@delphix.com Approved by: Dan McDonald danmcd@nexenta.com

References: illumos/illumos-gate@aad02571bc59671aa3103bb070ae365f531b0b62 https://www.illumos.org/issues/3137 http://wiki.illumos.org/display/illumos/L2ARC+Compression

Notes for Linux port:

A l2arc_nocompress module option was added to prevent the compression of l2arc buffers regardless of how a dataset's compression property is set. This allows the legacy behavior to be preserved.

Ported by: James H james@kagisoft.co.uk Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov Closes #1379

Could anyone who is running into these problems run their L2ARC for the tests related to this issue WITHOUT compression

WITH and WITHOUT the change from https://github.com/kernelOfTruth/zfs/commit/d7e1fd0a2b46f38815789ee7c9c425d0801a8d16 ?

Best practice (and effect) probably would be achieved by setting it at module load

zfs l2arc_nocompress=1

or via

echo 1 > /sys/module/zfs/parameters/l2arc_nocompress

This requires some in-depth testing of all eventualities to rule out that it e.g. is caused by introduction of the compression support

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195746#c1

Checksum and IO errors appear after an L2ARC device fills completely with cache data on any release of FreeBSD after L2ARC compression

Notice the lines

l2_cksum_bad 4 556832 l2_io_error 4 8292

posted from @pzwahlen which also show checksum errors

It would be interesting to see whether these kind of errors also would appear with e.g. LZJB

@sempervictus also reported a performance limitation regarding to lz4 compression, that @behlendorf referenced a few days ago in a different issue and/or pull request

currently can't find it but at the back of my mind currently is the compression algorithm that might cause trouble - needs further investigation ... (but enough for today - at least for me)

pzwahlen commented 9 years ago

Disabling compression makes no difference, which matches what I already observed with 0.6.3 and 0.6.4.1 (but I never reported about the tests I did with nocompress=1)

On Wed, May 20, 2015 at 12:50 AM, kernelOfTruth aka. kOT, Gentoo user < notifications@github.com> wrote:

@pzwahlen https://github.com/pzwahlen thanks for the report !

If there's a chance - could you also please test that change ("l2arc space accounting mismatch") against current master

WITH / WITHOUT L2ARC compression ?

like @tuxoko https://github.com/tuxoko indicated, this was unlikely to fix this issue - I'll have to take a closer look for who and under which circumstances it fixed this issue

commit 3a17a7a https://github.com/zfsonlinux/zfs/commit/3a17a7a99a1a6332d0999f9be68e2b8dc3933de1 Author: Saso Kiselkov skiselkov@gmail.com Date: Thu Aug 1 13:02:10 2013 -0700

Illumos #3137 https://github.com/zfsonlinux/zfs/pull/3137 L2ARC compression

3137 L2ARC compression Reviewed by: George Wilson george.wilson@delphix.com Reviewed by: Matthew Ahrens mahrens@delphix.com Approved by: Dan McDonald danmcd@nexenta.com

References: illumos/illumos-gate@aad0257 https://github.com/illumos/illumos-gate/commit/aad02571bc59671aa3103bb070ae365f531b0b62 https://www.illumos.org/issues/3137 http://wiki.illumos.org/display/illumos/L2ARC+Compression

Notes for Linux port:

A l2arc_nocompress module option was added to prevent the compression of l2arc buffers regardless of how a dataset's compression property is set. This allows the legacy behavior to be preserved.

Ported by: James H james@kagisoft.co.uk Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov Closes #1379 https://github.com/zfsonlinux/zfs/issues/1379

Could anyone who is running into these problems run their L2ARC for the tests related to this issue without compression ?

Best practice (and effect) probably would be achieved by setting it at module load

zfs l2arc_nocompress=1

or via

echo 1 > /sys/module/zfs/parameters/l2arc_nocompress

This requires some in-depth testing of all eventualities to rule out that it e.g. is caused by introduction of the compression support

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195746#c1

Checksum and IO errors appear after an L2ARC device fills completely with cache data on any release of FreeBSD after L2ARC compression

— Reply to this email directly or view it on GitHub https://github.com/zfsonlinux/zfs/issues/3400#issuecomment-103689812.

kernelOfTruth commented 9 years ago

currently waiting for the buildbots (would they even produce meaningful results related to L2ARC stuff @behlendorf ?)

@pzwahlen and @sempervictus and @jflandry since this isn't widely tested - the patch suggested in #3433 (at least for now) would be only suitable for internal or "real testing" (non-production) test-runs

perhaps @avg-I can shed some light on the issue since he's the one who created that pull-request over at FreeBSD

behlendorf commented 9 years ago

@kernelOfTruth currently the buildbots don't provide much l2arc test coverage so I'm not sure how much they're going to reveal. As for this exactly issue I haven't had a chance to look in to it, but it would be very helpful to know when it was introduced and how it can be reproduced.

kernelOfTruth commented 9 years ago

@behlendorf It probably would be helpful to introduce these kind of tests into the buildbot in the future to be able to track similar issues down - I'm not entirely sure though in what exact way

Also we need a buildbot with an L2ARC for zfs/master that runs through lots of cycles of stress-testing to find e.g. these kind of overflow and other related issues (high i/o wait ?) - in perhaps 3 main configurations (ARC to L2ARC ratio)

The question probably comes down to the following: whether the servers hosting the buildbots are seated on top of SSDs (other servers needed ?) and how to reserve those partitions and how to make sure to generate the needed load even when the server works on other things in parallel.

As to the reproducibility: still looking for some clear info from the mailing lists and bugtrackers, best bet may be on pzwahlen

Thanks

@pzwahlen you seem to be able to trigger this rather quickly - could you please post a sample scenario + the steps on how to reproduce this ?

From your postings - it seems to involve a rather small L2ARC partition, a not too large ARC cache and repeated transfers of rather huge files (or just huge amounts of files ?) compared to the size of ARC - if I understood correctly ?

I still have the suspicion that this seems to be two issues that exhibit similar symptoms ( @sempervictus systems e.g. ran fine with a 0.6.3-based patch stack) - but according to those numerous reports on e.g. the NetBSD, FreeNAS, FreeBSD mailing lists it might actually be the same issue ... :confused: (with the common denominator that comparatively massive i/o load is necessary)

If it's not too much trouble and work & it's possible could you , @pzwahlen , please run a pretty recent master with L2ARC and compression disabled and post the stats, results here ?

Thanks

sempervictus commented 9 years ago

As far as 0.6.3 builds go - we run 10G iSCSI on it, with some hosts running SSD-only LUNs on 10G arista backed iSCSI via SCST for >8 months with no downtime or noticeable regression. These hosts are running 0.6.3 from the repos. Whatever this is, it was either introduced in the interrim, or exposed by the kmem changes and subsequent patches.

odoucet commented 9 years ago

I have 0.6.4.1 with 1TB L2ARC (on SSD) - 256GB RAM / 128G ARC size - and did not trigger the bug (maybe not enough IO ?). Is this a case that mixed small ARC, small L2ARC and lots of IO ? I have a duplicate system (same hardware) to try to reproduce it, if it helps ...

kernelOfTruth commented 9 years ago

@pzwahlen Just to be sure and have a "control" state:

Does this performance degradation also occur WITHOUT an L2ARC device ?

@odoucet Thanks for your offer ! Much appreciated - I'm still looking for a reproducer - albeit slowly (since life's demanding its toll)

The one patch mentioned at https://svnweb.freebsd.org/base/head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c?sortby=file&r1=256889&r2=256888&pathrev=256889 and #3436 doesn't seem applicable to ZFSonLinux (different implementation)

pzwahlen commented 9 years ago

Sorry, life is demanding on this side, too ;-)

@sempervictus We started using ZFS with 0.6.3-1.1 (so we skipped 0.6.3). As soon as we started hitting our ZVOLS with real workloads (something like 40 VMs spread across 6 ZVOLS on 2 pools) we had the L2ARC size issue. We now have the same with 0.6.4.1 so in my case I always had this problem.

@kernelOfTruth I tested #3216 using the buildbot commands for checking out master and fetching the patch. I would say as was on master when doing my tests. Am I wrong ? To answer your last question, I don't see performance degradation without L2ARC.

I will do my best to document a replication scenario over the week-end. I also would like to find an FIO test that matches my IOMeter test, and run that locally on my ZVOLs, completely removing the network and SCST from the picture. Another option would be to replicate on a VM using L2ARC over a virtual disk and exporting a ZVOL over iSCSI to the WIndows iSCSI Initiator (with IOMeter running on the WIndows iSCSI client). This would be easy to package and share, then.

Finally, could that be related to ESXi sending ASYNC writes only and me running with sync=standard ? I'm also on ashift=12 if that matters.

Cheers!

pzwahlen commented 9 years ago

Finally had time to make some more tests.

I have applied #3451 using the following commands from a buildbot log: http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-builder/builds/2391/steps/git_1/logs/stdio

Things are definitely much better, With a small (4G) l2arc partition that I could previously "overfill" in less than 10 minutes, I now couldn't reach more than 3G of used cache with my iometer test.

On a larger 60G partition, cache usage went up to 29G (after about 2 hours) and then stayed there.

I don't know if reaching half the cache size is normal with my workload or if it's the sign of another size calculation issue, though.

The performance don't seem to suffer in any way.

Are there other tests I could perform now that I have this running ?

Thx for the hard work!

kernelOfTruth commented 9 years ago

@pzwahlen Thanks for sharing your stats =)

If it's possible - I'd say leave it running (if there's automated tests or some scripts) over a longer period of days to stress it some more.

Indeed, it could be that this (seemingly correctly behaving L2arc) shows some bug in the calculation or it's simply saturated and might fill up more over time.

Cheers

pzwahlen commented 9 years ago

Quick update: after a few Storage vMotion I have been able to bring l2arc size to 81G out of my 128G partition. It means it definitely can grow over half the total size.

Another point is that during low activity times, l2arc size actually decreases, which is something I never saw before!

@sempervictus Do you have some additional feedback ?

kernelOfTruth commented 9 years ago

@pzwahlen did you observe any stalls, performance issues or other regression-like behavior ?

pzwahlen commented 9 years ago

@sempervictus not so far. However, I just applied #3451 over master (as far as I understand Git), and nothing related to #2129.

Now that I read back this conversation, I even wonder if it's not #3433 that is supposed to fix this l2arc behavior, instead of #3451 (even though it definitely changed things in the right direction)

Any input much appreciated. Cheers!

kernelOfTruth commented 9 years ago

@pzwahlen sorry, I should have updated the title of #3433 - done now

according to @avg-I it's not entirely correct and thus should be superseded by #3451 (and subsequent changes) , so you applied the correct patch

So the only change that applies to current zfs master is #3451 .

"5701 zpool list reports incorrect "alloc" value for cache devices" does not apply and will be relevant once #3115 has been merged.

Hope that clears things up

Thanks

openzfs / zfs

L2ARC Capacity and Usage FUBAR - significant performance penalty apparently associated #3400