Closed ronnyegner closed 4 years ago
Can you show the output of free -m
just after ARC shrinks?
OK. Happened once again with almost no load:
ARCSTAT
=======
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
22:18:59 50 3 7 0 0 3 100 0 0 48G 48G
22:19:01 96 9 9 0 0 9 86 4 5 48G 48G
22:19:03 7.6K 5 0 0 0 5 83 0 0 48G 48G
22:19:05 70 5 7 0 0 5 90 0 0 48G 48G
22:19:07 236 1 0 1 0 0 50 0 0 48G 48G
22:19:09 35K 16 0 7 0 9 100 6 0 8.7G 48G <====
22:19:11 117 8 6 7 6 0 100 0 0 8.7G 48G
22:19:13 127 12 9 6 4 6 100 4 5 8.7G 48G
22:19:15 57 18 31 9 18 9 100 4 19 8.7G 48G
VMSTAT
======
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
r b swpd free inact active si so bi bo in cs us sy id wa st CEST
12 0 0 37439392 1547168 3295512 0 0 5120 12450 11416 28559 9 6 84 1 0 2020-05-07 22:19:03
0 0 0 37426532 1547172 3295280 0 0 5120 0 9392 29669 13 4 83 0 0 2020-05-07 22:19:05
10 1 0 37435864 1547176 3295284 0 0 1114 30896 11278 25410 5 9 85 0 0 2020-05-07 22:19:07
1 0 0 78722512 1547180 3295372 0 0 8224 4152 14119 34663 9 35 55 1 0 2020-05-07 22:19:09
6 0 0 78725712 1547180 3295172 0 0 5552 4908 9193 22802 3 4 92 1 0 2020-05-07 22:19:11
0 1 0 78754912 1547184 3295208 0 0 7486 27696 11428 29522 10 11 78 1 0 2020-05-07 22:19:13
FREE -m
=======
Thu May 7 22:18:53 CEST 2020
total used free shared buff/cache available
Mem: 128894 89843 36611 24 2440 33903
Swap: 7627 0 7627
Thu May 7 22:18:55 CEST 2020
total used free shared buff/cache available
Mem: 128894 89849 36604 24 2440 33897
Swap: 7627 0 7627
Thu May 7 22:18:57 CEST 2020
total used free shared buff/cache available
Mem: 128894 89871 36582 24 2440 33875
Swap: 7627 0 7627
Thu May 7 22:18:59 CEST 2020
total used free shared buff/cache available
Mem: 128894 89873 36580 24 2440 33873
Swap: 7627 0 7627
Thu May 7 22:19:01 CEST 2020
total used free shared buff/cache available
Mem: 128894 89876 36577 24 2440 33870
Swap: 7627 0 7627
Thu May 7 22:19:03 CEST 2020
total used free shared buff/cache available
Mem: 128894 89896 36558 24 2440 33851
Swap: 7627 0 7627
Thu May 7 22:19:05 CEST 2020
total used free shared buff/cache available
Mem: 128894 89905 36548 24 2440 33841
Swap: 7627 0 7627
Thu May 7 22:19:07 CEST 2020
total used free shared buff/cache available
Mem: 128894 89059 37396 24 2438 34687
Swap: 7627 0 7627
Thu May 7 22:19:09 CEST 2020
total used free shared buff/cache available
Mem: 128894 49681 76876 24 2336 74110
Swap: 7627 0 7627
Thu May 7 22:19:11 CEST 2020
total used free shared buff/cache available
Mem: 128894 49676 76880 24 2336 74114
Swap: 7627 0 7627
Thu May 7 22:19:13 CEST 2020
total used free shared buff/cache available
Mem: 128894 49649 76908 24 2336 74142
Swap: 7627 0 7627
Thu May 7 22:19:15 CEST 2020
total used free shared buff/cache available
Mem: 128894 49668 76889 24 2336 74123
Swap: 7627 0 7627
Ok, can you show the output of smem -wkt
, slabtop
, cat /proc/meminfo
and cat /proc/spl/kstat/zfs/arcstats
(again, before and after ARC shrinks)?
Does the shrinking happen at specific time? Can you check if some cron scripts is running just before ARC reduces?
Does the shrinking happen at specific time? Can you check if some cron scripts is running just before ARC reduces?
It does not happen at a specific time. It grows, reaches or almost reaches ZFS_ARC_MAX and then drops. And then it starts again. There are no cronjobs running that cause high I/O. You can see that from the VMSTAT (Column "bo" and "bi").
Edit: Managed to reproduce it. Log files for download here: https://88.99.150.223/trace.tgz
Timestamp in the logs (+/- a few seconds; usual sample time is 2 secs):
13:30:41 6.9K 367 5 3 0 364 57 52 0 47G 48G
13:30:43 3.7K 326 8 0 0 325 59 0 0 43G 48G <===
13:30:45 2.4K 276 11 13 0 263 79 19 1 15G 48G <===
13:30:48 5.1K 418 8 30 0 387 98 107 2 14G 48G
I have also included a "zpool iostat" for you. Alebeit without a time stamp. But you can see in genereal there is not much going on. At that time there was a "zfs send" running with average 3 MB/s read, but that is also a co-incidence.
Thanks for the detailed logs. From the spl-arcstat.txt
it seems that you are constantly running above the metadata limit, causing frequent prune call. For example, lets comparing some stats between 13:30:42 and 13:30:46 (just before and after ARC shrinks):
13:30:42
arc_prune 4 424606
arc_meta_used 4 17241616352
arc_meta_limit 4 8589934592
arc_dnode_limit 4 858993459
arc_meta_max 4 17493991168
13:30:46
arc_prune 4 476472
arc_meta_used 4 15572712200
arc_meta_limit 4 8589934592
arc_dnode_limit 4 858993459
arc_meta_max 4 17493991168
difference
arc_prune 4 476472 - 424606 = 51866 (prune thread invocation)
arc_meta_used 4 15572712200 - 17241616352 = -1668904152 (~1.6 GB)
arc_meta_limit 4 8589934592 - 8589934592 = 0
arc_dnode_limit 4 858993459 - 858993459 = 0
arc_meta_max 4 17493991168 - 17493991168 = 0
It is my understanding that large metadata prune can unpin much data memory, leading to sudden ARC shrink. I think the real question is why and what consumes so much metadata memory (bringing it even above zfs self-imposed limit).
Can you describe your pool layout, datasets (and their properties, especially "primarycache"), workload, etc?
Usage is nothing fancy. Two pools, basic usage (small home NAS) for a few clients and some VMs. No heavy use. But the meta size is strange:
Primarycache is default (="all") for all file systems. No change here.
root@homenas:~# zpool get all
NAME PROPERTY VALUE SOURCE
pool1 size 3.53T -
pool1 capacity 21% -
pool1 altroot - default
pool1 health ONLINE -
pool1 guid 1658318829284017992 -
pool1 version - default
pool1 bootfs - default
pool1 delegation on default
pool1 autoreplace off default
pool1 cachefile - default
pool1 failmode wait default
pool1 listsnapshots off default
pool1 autoexpand off default
pool1 dedupditto 0 default
pool1 dedupratio 1.00x -
pool1 free 2.77T -
pool1 allocated 780G -
pool1 readonly off -
pool1 ashift 12 local
pool1 comment - default
pool1 expandsize - -
pool1 freeing 0 -
pool1 fragmentation 31% -
pool1 leaked 0 -
pool1 multihost off default
pool1 checkpoint - -
pool1 load_guid 14323955545827788521 -
pool1 autotrim off default
pool1 feature@async_destroy enabled local
pool1 feature@empty_bpobj active local
pool1 feature@lz4_compress active local
pool1 feature@multi_vdev_crash_dump enabled local
pool1 feature@spacemap_histogram active local
pool1 feature@enabled_txg active local
pool1 feature@hole_birth active local
pool1 feature@extensible_dataset active local
pool1 feature@embedded_data active local
pool1 feature@bookmarks enabled local
pool1 feature@filesystem_limits enabled local
pool1 feature@large_blocks active local
pool1 feature@large_dnode enabled local
pool1 feature@sha512 enabled local
pool1 feature@skein active local
pool1 feature@edonr enabled local
pool1 feature@userobj_accounting active local
pool1 feature@encryption active local
pool1 feature@project_quota active local
pool1 feature@device_removal enabled local
pool1 feature@obsolete_counts enabled local
pool1 feature@zpool_checkpoint enabled local
pool1 feature@spacemap_v2 active local
pool1 feature@allocation_classes enabled local
pool1 feature@resilver_defer enabled local
pool1 feature@bookmark_v2 enabled local
****************************
pool4 size 118T -
pool4 capacity 38% -
pool4 altroot - default
pool4 health ONLINE -
pool4 guid 14100976841961686166 -
pool4 version - default
pool4 bootfs - default
pool4 delegation on default
pool4 autoreplace off default
pool4 cachefile - default
pool4 failmode wait default
pool4 listsnapshots off default
pool4 autoexpand off default
pool4 dedupditto 0 default
pool4 dedupratio 1.00x -
pool4 free 72.1T -
pool4 allocated 45.9T -
pool4 readonly off -
pool4 ashift 12 local
pool4 comment - default
pool4 expandsize - -
pool4 freeing 0 -
pool4 fragmentation 6% -
pool4 leaked 0 -
pool4 multihost off default
pool4 checkpoint - -
pool4 load_guid 10042272303506965039 -
pool4 autotrim off default
pool4 feature@async_destroy enabled local
pool4 feature@empty_bpobj active local
pool4 feature@lz4_compress active local
pool4 feature@multi_vdev_crash_dump enabled local
pool4 feature@spacemap_histogram active local
pool4 feature@enabled_txg active local
pool4 feature@hole_birth active local
pool4 feature@extensible_dataset active local
pool4 feature@embedded_data active local
pool4 feature@bookmarks enabled local
pool4 feature@filesystem_limits enabled local
pool4 feature@large_blocks active local
pool4 feature@large_dnode enabled local
pool4 feature@sha512 active local
pool4 feature@skein active local
pool4 feature@edonr enabled local
pool4 feature@userobj_accounting active local
pool4 feature@encryption active local
pool4 feature@project_quota active local
pool4 feature@device_removal enabled local
pool4 feature@obsolete_counts enabled local
pool4 feature@zpool_checkpoint enabled local
pool4 feature@spacemap_v2 active local
pool4 feature@allocation_classes enabled local
pool4 feature@resilver_defer enabled local
pool4 feature@bookmark_v2 enabled local
root@homenas:~# zpool status
pool: pool1
state: ONLINE
scan: scrub repaired 0B in 0 days 04:57:17 with 0 errors on Sun May 10 05:21:19 2020
config:
NAME STATE READ WRITE CKSUM
pool1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdl3 ONLINE 0 0 0
sdm3 ONLINE 0 0 0
sdj3 ONLINE 0 0 0
errors: No known data errors
pool: pool4
state: ONLINE
scan: scrub in progress since Sun May 10 00:24:09 2020
17.5T scanned at 290M/s, 17.5T issued at 289M/s, 45.9T total
0B repaired, 38.15% done, 1 days 04:33:21 to go
config:
NAME STATE READ WRITE CKSUM
pool4 ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdn ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
sdk ONLINE 0 0 0
sdo ONLINE 0 0 0
sdf ONLINE 0 0 0
sdp ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdb ONLINE 0 0 0
sdg ONLINE 0 0 0
sda ONLINE 0 0 0
Full details of all file systems here: https://88.99.150.223/zfs.tgz
I have increased the metadata size to 16GB and the problem is gone. I no longer see arc evictions down to < 10 GB. But still i´d like to know what is consuming so much metadata. Dedup is off and has never been enabled.
If I read correcly the data you provided, you have over 5000 snapshots and a very large number of relatively small files (ie: your dnode_size
is over the arc_dnode_limit
, meaning many inodes are cached). This can explain the very high metadata size and the sudden ARC shrink when ZFS tries hard to reclaim some memory.
I suggest continuing the discussion on the mailing list, and to close that issue.
System information
Describe the problem you're observing
After updating ZFS from 0.8.1 to 0.8.3 i see frequent arc shrinks from ARC_MAX_SIZE (51GB) to way beyond ARC_MIN_SIZE (50GB) without any load on the system and plenty of free memory.
The system has 128GB memory and less than half of it is used.
The following ZFS parameters are set:
zfs_arc_max = 51539607552 zfs_arc_min = 50539607552 zfs_arc_meta_limit = 8589934592
Describe how to reproduce the problem
Just run the system.
I sampled the arcstats along with vmstat -a for memory stats while the problem happens:
VMSTAT at the same time: