openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
824 stars 72 forks source link

Wired memory grows from less than 3GB to more than 10GB over several days - is it ZFS? #213

Closed alanruttenberg closed 9 years ago

alanruttenberg commented 10 years ago

I'm not sure what the best info would be to assess this - let me know. For now I've included the output of zfs list, the output of sysctl zfs after several days, and the output of zdb. When I reboot I'll add another sysctl zfs file to compare. (Why doesn't github let me attach .txt files???)


NAME                     USED  AVAIL  REFER  MOUNTPOINT
zfsroot                 3.89T   586G  1.32M  /Volumes/zfsroot
zfsroot/name            283G   586G   283G  -
zfsroot/nameImage       218G  21.8G   218G  /Volumes/nameImage
zfsroot/Commons          206G   707G  85.2G  -
zfsroot/TimeMachine      309G   617G   278G  -
zfsroot/books            168G   132G   168G  /Volumes/books
zfsroot/iPhoto           206G   714G  78.5G  -
zfsroot/iTunes           309G   697G   198G  -
zfsroot/media           1.26T   205G  1.26T  /Volumes/media
zfsroot/media2           986G   114G   986G  /Volumes/media2```

-------

zfs.arc_max: 0 zfs.arc_min: 0 zfs.arc_meta_used: 1702334208 zfs.arc_meta_limit: 3221225472 zfs.l2arc_write_max: 8388608 zfs.l2arc_write_boost: 8388608 zfs.l2arc_headroom: 2 zfs.l2arc_feed_secs: 1 zfs.l2arc_feed_min_ms: 200 zfs.l2arc_noprefetch: 1 zfs.l2arc_feed_again: 1 zfs.l2arc_norw: 0 zfs.anon_size: 18432 zfs.anon_metadata_lsize: 0 zfs.anon_data_lsize: 0 zfs.mru_size: 1998799872 zfs.mru_metadata_lsize: 266654720 zfs.mru_data_lsize: 1604591616 zfs.mru_ghost_size: 2296164352 zfs.mru_ghost_metadata_lsize: 1076072448 zfs.mru_ghost_data_lsize: 1220091904 zfs.mfu_size: 1158606848 zfs.mfu_metadata_lsize: 155009536 zfs.mfu_data_lsize: 974876160 zfs.mfu_ghost_size: 1888160768 zfs.mfu_ghost_metadata_lsize: 259149824 zfs.mfu_ghost_data_lsize: 1629010944 zfs.l2c_only_size: 55236383232 zfs.vnop_osx_debug: 0 zfs.vnop_ignore_negatives: 0 zfs.vnop_ignore_positives: 0 zfs.vnop_create_negatives: 1 zfs.reclaim_list: 0 zfs.vnop_reclaim_throttle: 33280 zfs.vfs_suspend_fs_begin_delay: 2 zfs.vfs_suspend_fs_end_delay: 2 zfs.vnop_force_formd_normalized_output: 0```


zfsroot:
    version: 5000
    name: 'zfsroot'
    state: 0
    txg: 5465379
    pool_guid: 11525599921928347233
    errata: 0
    hostid: 3660128869
    hostname: 'localhost'
    vdev_children: 4
    vdev_tree:
        type: 'root'
        id: 0
        guid: 11525599921928347233
        children[0]:
            type: 'mirror'
            id: 0
            guid: 11659360043872116845
            metaslab_array: 38
            metaslab_shift: 34
            ashift: 12
            asize: 2000384688128
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 12789527930152168946
                path: '/dev/disk6s1'
                phys_path: '/dev/diskid/DISK-Y3CN2DTGSp1'
                whole_disk: 1
                DTL: 87
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 9631193934027935819
                path: '/dev/disk3s1'
                phys_path: '/dev/diskid/DISK-Y3CN2E0GSp1'
                whole_disk: 1
                DTL: 85
                create_txg: 4
        children[1]:
            type: 'mirror'
            id: 1
            guid: 1750213957219736340
            metaslab_array: 36
            metaslab_shift: 33
            ashift: 12
            asize: 1500287336448
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 5592394544075990067
                path: '/dev/disk8s1'
                phys_path: '/dev/diskid/DISK-S2HEJX0C900305p1'
                whole_disk: 1
                DTL: 89
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 3468705898681880350
                path: '/dev/disk4s1'
                phys_path: '/dev/diskid/DISK-S2HEJX0CA00263p1'
                whole_disk: 1
                DTL: 83
                create_txg: 4
        children[2]:
            type: 'mirror'
            id: 2
            guid: 18401647215779754242
            metaslab_array: 34
            metaslab_shift: 33
            ashift: 12
            asize: 1500287336448
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 5069286759723677467
                path: '/dev/disk7s1'
                phys_path: '/dev/diskid/DISK-S2HEJ1DB302077p1'
                whole_disk: 1
                DTL: 88
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 12133104694213597855
                path: '/dev/disk5s1'
                phys_path: '/dev/diskid/DISK-S2HEJ1DB302082p1'
                whole_disk: 1
                DTL: 81
                create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 1166490624700717247
            path: '/dev/disk1s4'
            phys_path: '/dev/ada7p4'
            whole_disk: 0
            metaslab_array: 116
            metaslab_shift: 25
            ashift: 9
            asize: 5726797824
            is_log: 1
            DTL: 203
            create_txg: 56949
    features_for_read:```
brendonhumphrey commented 10 years ago

Alan,

Its almost certainly the ZFS ARC. On a 32 GB machine with default settings it will grow to about 10GB and remain static from then on.

alanruttenberg commented 10 years ago

Thanks for the response Brendon. My machine is 16GB - what should I expect? Is the ARC memory shown as wired/kernel memory?

I left my machine after rebooting for an hour (forgot to do it right after rebooting) and see that it is already using quite a lot of memory:

8.4GB wired Kernel Task: 8GB

Thanks, Alan

zfs.arc_max: 0
zfs.arc_min: 0
zfs.arc_meta_used: 1499092352
zfs.arc_meta_limit: 3221225472
zfs.l2arc_write_max: 8388608
zfs.l2arc_write_boost: 8388608
zfs.l2arc_headroom: 2
zfs.l2arc_feed_secs: 1
zfs.l2arc_feed_min_ms: 200
zfs.l2arc_noprefetch: 1
zfs.l2arc_feed_again: 1
zfs.l2arc_norw: 0
zfs.anon_size: 40960
zfs.anon_metadata_lsize: 0
zfs.anon_data_lsize: 0
zfs.mru_size: 3368943104
zfs.mru_metadata_lsize: 751133696
zfs.mru_data_lsize: 2504770048
zfs.mru_ghost_size: 937030656
zfs.mru_ghost_metadata_lsize: 1233920
zfs.mru_ghost_data_lsize: 935796736
zfs.mfu_size: 292008960
zfs.mfu_metadata_lsize: 68096
zfs.mfu_data_lsize: 291777024
zfs.mfu_ghost_size: 125116416
zfs.mfu_ghost_metadata_lsize: 0
zfs.mfu_ghost_data_lsize: 125116416
zfs.l2c_only_size: 11672340480
zfs.vnop_osx_debug: 0
zfs.vnop_ignore_negatives: 0
zfs.vnop_ignore_positives: 0
zfs.vnop_create_negatives: 1
zfs.reclaim_list: 0
zfs.vnop_reclaim_throttle: 33280
zfs.vfs_suspend_fs_begin_delay: 2
zfs.vfs_suspend_fs_end_delay: 2
zfs.vnop_force_formd_normalized_output: 0
brendonhumphrey commented 10 years ago

About 40% of memory assuming you have not increased the zfs.arc_max sysctl or changed the code in arc.c. And yes, ARC appears as wired memory. However, not all wired memory is ARC memory.

perhaps you could post the current values of the spl sysctls?

lundman commented 10 years ago

Also, when using sysctl, don't forget the spl tree, it shows total memory in use by all ZFS.

# sysctl spl
spl.kmem_bytes_total: 29010314
spl.num_threads: 103
alanruttenberg commented 10 years ago

Mine says:

spl.kmem_bytes_total: 4359841962 spl.num_threads: 107

TBH, however, I am suspicious that either this is low or that there is a memory leak as my kernel wired memory is 8.7GB and I seem to remember there being substantially less wired memory than the current 4.6 GB that is now wired, exclusive of ZFS.

Looks like this amount of wired memory is up 300M in the last couple of days.

-Alan

rottegift commented 10 years ago
zfs.l2c_only_size: 55 236 383 232
zfs.l2c_only_size: 11 672 340 480

Those are big. What do your "zpool list -v" and "zpool status -vD" say?

brendonhumphrey commented 10 years ago

Alan,

As I explained above, the allocator in spl is not very space efficient. I would not suprise me if the ~4 GB of allocations reported above resulted in ~8GB of actual wired memory.

The only aspect of this discussion that concerns me is that you are apparently reporting growth in use beyond the expected 40-50% of main memory. This may imply a leak somewhere.

alanruttenberg commented 10 years ago

Here are the outputs of those commands:

den:~ alanr$ zpool list -v
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zfsroot  4.53T  3.50T  1.03T    77%  1.00x  ONLINE  -
  mirror  1.81T  1.27T   554G         -
    disk8      -      -      -         -
    disk3      -      -      -         -
  mirror  1.36T  1.10T   265G         -
    disk6      -      -      -         -
    disk5      -      -      -         -
  mirror  1.36T  1.13T   235G         -
    disk7      -      -      -         -
    disk4      -      -      -         -
  disk1s4  5.31G   288K  5.31G         -
cache      -      -      -      -      -      -
  disk1s3  54.7G  54.4G   287M         -
den:~ alanr$ zpool status -vD 
  pool: zfsroot
 state: ONLINE
  scan: scrub repaired 0 in 10h42m with 0 errors on Sat Jul 26 12:10:02 2014
config:

    NAME        STATE     READ WRITE CKSUM
    zfsroot     ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        disk8   ONLINE       0     0     0
        disk3   ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        disk6   ONLINE       0     0     0
        disk5   ONLINE       0     0     0
      mirror-2  ONLINE       0     0     0
        disk7   ONLINE       0     0     0
        disk4   ONLINE       0     0     0
    logs
      disk1s4   ONLINE       0     0     0
    cache
      disk1s3   ONLINE       0     0     0

errors: No known data errors

 dedup: no DDT entries
alanruttenberg commented 10 years ago

zfs memory usage Attached is memory usage i logged at 1 minute intervals over 4 days. Unfortunately it doesn't include all times starting at boot up, but hopefully it is informative. The first spike at 12 hours is the only event I made a note of - a copy of 38G to the pool. I'm not sure what the other stuff is, particularly the dip to 1 gig. Maybe when I deleted a pool? During the time I also restored a hard disk from a 250G DMG on one of the pools.

Perhaps this can give you some insight into what is going on with memory - whether it is as you would expect/desire.

zfs memory usage

brendonhumphrey commented 10 years ago

I think your key may be inverted. Having said that it looks nominal to me.

alanruttenberg commented 10 years ago

Yes, key is inverted - oops [fixed] Under what circumstances does a GC occur? Can it be triggered with some command line incantation? In any case, I agree with Lundman's comment in Issue 215. This is an unfortunate amount of wired memory unused, and it would be nice if peak usage didn't go as high as 11.5 GB on the 16GB machine. I'm pretty sure that the protracted higher memory usage period was during the restore from DMG. In that case I was restoring to firewire, so peak write rate was 30M/s or so and in many cases was under 10M/s. Peak read rates in my setup is something like 180M/s. I don't think my machine would need so much buffer to keep read up to write rates. Having ZFS recognize that and not use all that space "would be nice". I'll try to get to uploading a fixed image, because the oops bugs me...

brendonhumphrey commented 10 years ago

There is no way to manually trigger a GC. The GC behavior is internal to bmalloc, and essentially results in all memory released by the ZFS being returned to the OS ~120 seconds later provided there is no further request to allocate that memory.

Alan, in your case I recommend that you simply reduce the size of the ARC. You can do it in code (zfs/module/zfs/arc.c - (see https://github.com/openzfsonosx/zfs/commit/91dc3f5df0eb4644c24021f9f2c8e35807e901b6) for an example, or alternatively alter zfs.arc_max sysctl to taste after the zfs and spl kexts are loaded.

I do intend revisiting some aspects of bmalloc in the hope of reducing its overhead, but this is not something I will be releasing without extensive testing as failures in memory handling could easily screw up someones data.

ilovezfs commented 10 years ago

The inefficiency is certainly not ideal. That said, if the allocations were indeed more efficient, there is nothing about 11.5 GB/16 GB that is in itself concerning. The default on illumos is actually up to 75% of memory:

    /* set min cache to 1/32 of all memory, or 64MB, whichever is more */
    arc_c_min = MAX(arc_c / 4, 64<<20);
    /* set max to 3/4 of all memory, or all but 1GB, whichever is more */
    if (arc_c * 8 >= 1<<30)
        arc_c_max = (arc_c * 8) - (1<<30);
    else
        arc_c_max = arc_c_min;
    arc_c_max = MAX(arc_c * 6, arc_c_max);

https://github.com/illumos/illumos-gate/blob/63e911b6fce0acc8e2a1d31ebdaf0c4c12580a14/usr/src/uts/common/fs/zfs/arc.c#L3742-L3749

emory commented 10 years ago

As another datapoint I see on average ~14-15GB of memory wired for kernel task typically on a system (10.9.5) with 32GB of memory. Upon a fresh boot it will be about ~1GB and then doing anything substantial with the pool (Arq running a backup job, enabling spotlight on a dataset, etc.) it quickly escalates.

I'm interested in this because I have had instances where it seems like Arq or ZFS (or both) may get into a race to deplete memory prompting the "Force quit you're out of memory" message but I don't have anything beyond anecdotes and observations at the time and haven't been able to chase anything down.

Presently after I enabled spotlight my sysctl variables look like:

kstat.spl.misc.bmalloc.apps_allocated: 13113991877
kstat.spl.misc.bmalloc.bmalloc_allocated: 13786657929
kstat.spl.misc.bmalloc.space_efficiency_percent: 95
kstat.spl.misc.bmalloc.active_threads: 110
kstat.spl.misc.bmalloc.pressure_thr_wakes: 1080615
kstat.spl.misc.bmalloc.pressure_pages_wanted: 874
kstat.spl.misc.bmalloc.pressure_pages_released: 903
kstat.spl.misc.bmalloc.gc_wake_count: 12236
kstat.spl.misc.bmalloc.gc_pages_released: 0

&&

hw.physmem = 2147483648
hw.usermem = 3344109568
hw.memsize = 34359738368
kern.memorystatus_vm_pressure_level: 1
kern.memorystatus_purge_on_warning: 2
kern.memorystatus_purge_on_urgent: 5
kern.memorystatus_purge_on_critical: 8
vm.memory_pressure: 0
hw.memsize: 34359738368
machdep.memmap.Conventional: 34319855616
machdep.memmap.RuntimeServices: 0
machdep.memmap.ACPIReclaim: 20480
machdep.memmap.ACPINVS: 2895872
machdep.memmap.PalCode: 0
machdep.memmap.Reserved: 95465472
machdep.memmap.Unusable: 0
machdep.memmap.Other: 0
audit.session.member_set_sflags_mask: 0
audit.session.member_clear_sflags_mask: 16384
kstat.zfs.misc.dmu_tx.dmu_tx_memory_reserve: 0
kstat.zfs.misc.dmu_tx.dmu_tx_memory_reclaim: 0
kstat.zfs.misc.arcstats.l2_abort_lowmem: 0
kstat.zfs.misc.arcstats.memory_throttle_count: 0
kstat.zfs.misc.arcstats.memory_direct_count: 0
kstat.zfs.misc.arcstats.memory_indirect_count: 0
emory commented 10 years ago

Two days of uptime post-Yosemite update, same system with 32GB of memory and I recently did some shuffling of 400GB of data across this mirror (via mv between filesystems to clear copies=2 status after changing it to copies=1 fwiw) — no zil devices, no cache devices, just a two-disk pool.

kstat.spl.misc.bmalloc.apps_allocated: 14168822968
kstat.spl.misc.bmalloc.bmalloc_allocated: 16501253257
kstat.spl.misc.bmalloc.space_efficiency_percent: 85

This time I opted to spam my kstat.zfs* values into a gist; https://gist.github.com/emory/2e472eba7526dbed9262#file-gistfile1-txt

Question

When the pool heats up with activity it negatively impacts performance for the user at-console — UIs become less responsive, mouse movement is choppy. These are dedicated devices as whole disks and I have them on two different controllers (one external via eSATA one internal via SATA both identical disks) and if I think about it I think I'd rather have slower file operations on the pool than slower everything that is irregularly moving windows and trying to work.

My use case is that my Photography and Home Videos live on that pool — I use Lightroom and have my catalogs on my Fusion Drive HFS+ Volume which is separate from a dedicated SSD boot volume. I think I'm doing everything right in terms of optimizing use (catalogs and indexes on HFS+ Fusion Drives for performance and the images referenced are all on my zpool for reliability and integrity) so it isn't my intent to hijack someone else's issue but I think I'd prefer artificially capping resource utilization if it means the rest of my activities aren't impacted (obviously file operations will be impacted if they use that volume but I think that's what people expect already?)

In my particular circumstance is everything working as designed and should I consider a cache or ZIL in spite of my somewhat modest needs? I'm using 2x2TB physical volumes without dedupe but I do use compression.

ilovezfs commented 10 years ago

When was the last time you updated your source trees and upgraded spl and zfs? The memory model was just totally overhauled two days ago, and this issue should probably be closed since it was about the old memory model.

emory commented 10 years ago

It's been a couple of weeks; I rebuilt binaries last night but haven't tested or activated. I'll be sure to prioritize that!

alanruttenberg commented 10 years ago

FWIW I'm using trunk as of today and I'm also experiencing the choppy, delayed mouse and less-responsiveness, though not always - seemingly after some extended use. Memory, at least as reported by istat, doesn't seem too problematic. I've tried limiting arc_max to between 3 and 6g and still have this happen.

Another anomaly is that it takes a longish time for my time machine ZVOL to mount. It's 500G quote, 325G full. It got so bad that I landed up cloning to a new ZVOL and while it is better now, it still has a delay mounting.

emory commented 10 years ago

I haven't noticed any delays in mounting personally, my openzfsosx workstation only has a 2TB mirror that is at about 70% capacity but I don't have a TM volume on it — maybe Time Machine is the cause of the delay in your case?

I rebuilt my binaries and installed two days ago. Heavy utilization still results in some chop but it isn't as severe as it used to be. Memory utilization via kernel_task isn't as exaggerated at all but holy hell File Cache seems excessive or maybe it's just accurately reporting now?

Right now memory_pressure is super low but I'm using 31.46GB of memory and 174MB of swap. 5GB wired, 12.78GB of App Memory, 220MB compressed and uh 13.7GB File Cache.

What really makes my machine get weird is doing something that traverses several hundred directories and looks for changed files — a common occurrence for Lightroom and checking metadata updates. FWIW my workflow is not in Adobe DNG but regular RAW files with sidecars so each image has an accompanying file with it. In my most usual workflow it skims 1335 directories and checks about 90k files ranging in size from 400KB to 50MB with some outliers that are much larger (exported TIFFs that stay in-library can be a few hundred MB.)

brendonhumphrey commented 9 years ago

Current code addresses this issue. It is believed that the problem was the allocator we have built into the SPL was driving the OS underlying allocator in such a way as to cause the stuttering/choking behaviours discussed above.