openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.43k stars 1.73k forks source link

Frequent ARC shrinks after update to 0.8.3 #10298

Closed ronnyegner closed 4 years ago

ronnyegner commented 4 years ago

System information

Type Version/Name
Distribution Name Ubuntu Server
Distribution Version 18.04 LTS
Linux Kernel 4.14.91-041491-generic
Architecture x86_64
ZFS Version v0.8.3-1
SPL Version v0.8.3-1

Describe the problem you're observing

After updating ZFS from 0.8.1 to 0.8.3 i see frequent arc shrinks from ARC_MAX_SIZE (51GB) to way beyond ARC_MIN_SIZE (50GB) without any load on the system and plenty of free memory.

The system has 128GB memory and less than half of it is used.

The following ZFS parameters are set:

zfs_arc_max = 51539607552 zfs_arc_min = 50539607552 zfs_arc_meta_limit = 8589934592

Describe how to reproduce the problem

Just run the system.

I sampled the arcstats along with vmstat -a for memory stats while the problem happens:

    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
13:31:39  2.4K   337     14     1    0   336   98     0    0    48G   48G  
13:31:41  2.0K   273     13     0    0   272   98     0    0    48G   48G  
13:31:43  1.5K   247     16    13    1   234  100     5    0    10G   48G    <== here
13:31:45  1.7K   264     15     6    0   258  100     3    0    10G   48G  
13:31:47  1.8K   240     13     8    0   232  100     3    0    10G   48G  

VMSTAT at the same time:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st                CEST
 2  1      0 38849668 1533192 2799920    0    0 54700 100059 12972 43307  0 12 85  3  0 2020-05-07 13:31:33
 5  1      0 38436228 1533192 2799920    0    0 97168 142081 15443 52283  0 15 83  2  0 2020-05-07 13:31:35
 2  1      0 38192972 1533192 2799920    0    0 57572 92432 12617 40545  0 10 86  4  0 2020-05-07 13:31:37
 4  1      0 37906792 1533192 2799904    0    0 65442 152210 15057 51199  0 14 84  1  0 2020-05-07 13:31:39
 2  1      0 37693004 1533192 2799936    0    0 47748 92038 13909 42301  0 21 76  3  0 2020-05-07 13:31:41
 6  2      0 77466432 1533192 2800316    0    0 56084 56340 14867 39276  0 43 55  1  0 2020-05-07 13:31:43 <<== SAME TIME AS ABOVE
 1  1      0 77297528 1555448 2800352    0    0 67594 145510 16544 48996  1 17 81  2  0 2020-05-07 13:31:45
 5  1      0 77650968 1533116 2800332    0    0 44224 82849 13932 39277  0 11 84  5  0 2020-05-07 13:31:47
 4  0      0 77508592 1533120 2800344    0    0 67826 64636 12410 41300  0 10 86  3  0 2020-05-07 13:31:49
 1  1      0 77300688 1533120 2800344    0    0 52062 156014 16078 53211  0 14 83  3  0 2020-05-07 13:31:51
shodanshok commented 4 years ago

Can you show the output of free -m just after ARC shrinks?

ronnyegner commented 4 years ago

OK. Happened once again with almost no load:

ARCSTAT
=======
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
22:18:59    50     3      7     0    0     3  100     0    0    48G   48G  
22:19:01    96     9      9     0    0     9   86     4    5    48G   48G  
22:19:03  7.6K     5      0     0    0     5   83     0    0    48G   48G  
22:19:05    70     5      7     0    0     5   90     0    0    48G   48G  
22:19:07   236     1      0     1    0     0   50     0    0    48G   48G  
22:19:09   35K    16      0     7    0     9  100     6    0   8.7G   48G   <====
22:19:11   117     8      6     7    6     0  100     0    0   8.7G   48G  
22:19:13   127    12      9     6    4     6  100     4    5   8.7G   48G  
22:19:15    57    18     31     9   18     9  100     4   19   8.7G   48G  

VMSTAT
======
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st                CEST
12  0      0 37439392 1547168 3295512    0    0  5120 12450 11416 28559  9  6 84  1  0 2020-05-07 22:19:03
 0  0      0 37426532 1547172 3295280    0    0  5120     0 9392 29669 13  4 83  0  0 2020-05-07 22:19:05
10  1      0 37435864 1547176 3295284    0    0  1114 30896 11278 25410  5  9 85  0  0 2020-05-07 22:19:07
 1  0      0 78722512 1547180 3295372    0    0  8224  4152 14119 34663  9 35 55  1  0 2020-05-07 22:19:09
 6  0      0 78725712 1547180 3295172    0    0  5552  4908 9193 22802  3  4 92  1  0 2020-05-07 22:19:11
 0  1      0 78754912 1547184 3295208    0    0  7486 27696 11428 29522 10 11 78  1  0 2020-05-07 22:19:13

FREE -m
=======
Thu May  7 22:18:53 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       89843       36611          24        2440       33903
Swap:          7627           0        7627
Thu May  7 22:18:55 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       89849       36604          24        2440       33897
Swap:          7627           0        7627
Thu May  7 22:18:57 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       89871       36582          24        2440       33875
Swap:          7627           0        7627
Thu May  7 22:18:59 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       89873       36580          24        2440       33873
Swap:          7627           0        7627
Thu May  7 22:19:01 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       89876       36577          24        2440       33870
Swap:          7627           0        7627
Thu May  7 22:19:03 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       89896       36558          24        2440       33851
Swap:          7627           0        7627
Thu May  7 22:19:05 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       89905       36548          24        2440       33841
Swap:          7627           0        7627
Thu May  7 22:19:07 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       89059       37396          24        2438       34687
Swap:          7627           0        7627

Thu May  7 22:19:09 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       49681       76876          24        2336       74110
Swap:          7627           0        7627

Thu May  7 22:19:11 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       49676       76880          24        2336       74114
Swap:          7627           0        7627

Thu May  7 22:19:13 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       49649       76908          24        2336       74142
Swap:          7627           0        7627
Thu May  7 22:19:15 CEST 2020
              total        used        free      shared  buff/cache   available
Mem:         128894       49668       76889          24        2336       74123
Swap:          7627           0        7627
shodanshok commented 4 years ago

Ok, can you show the output of smem -wkt, slabtop, cat /proc/meminfo and cat /proc/spl/kstat/zfs/arcstats (again, before and after ARC shrinks)?

Does the shrinking happen at specific time? Can you check if some cron scripts is running just before ARC reduces?

ronnyegner commented 4 years ago

Does the shrinking happen at specific time? Can you check if some cron scripts is running just before ARC reduces?

It does not happen at a specific time. It grows, reaches or almost reaches ZFS_ARC_MAX and then drops. And then it starts again. There are no cronjobs running that cause high I/O. You can see that from the VMSTAT (Column "bo" and "bi").

Edit: Managed to reproduce it. Log files for download here: https://88.99.150.223/trace.tgz

Timestamp in the logs (+/- a few seconds; usual sample time is 2 secs):

13:30:41  6.9K   367      5     3    0   364   57    52    0    47G   48G  
13:30:43  3.7K   326      8     0    0   325   59     0    0    43G   48G   <===
13:30:45  2.4K   276     11    13    0   263   79    19    1    15G   48G   <===
13:30:48  5.1K   418      8    30    0   387   98   107    2    14G   48G  

I have also included a "zpool iostat" for you. Alebeit without a time stamp. But you can see in genereal there is not much going on. At that time there was a "zfs send" running with average 3 MB/s read, but that is also a co-incidence.

shodanshok commented 4 years ago

Thanks for the detailed logs. From the spl-arcstat.txt it seems that you are constantly running above the metadata limit, causing frequent prune call. For example, lets comparing some stats between 13:30:42 and 13:30:46 (just before and after ARC shrinks):

13:30:42
arc_prune                       4    424606
arc_meta_used                   4    17241616352
arc_meta_limit                  4    8589934592
arc_dnode_limit                 4    858993459
arc_meta_max                    4    17493991168

13:30:46
arc_prune                       4    476472
arc_meta_used                   4    15572712200
arc_meta_limit                  4    8589934592
arc_dnode_limit                 4    858993459
arc_meta_max                    4    17493991168

difference
arc_prune                       4    476472 - 424606 = 51866 (prune thread invocation)
arc_meta_used                   4    15572712200 - 17241616352 = -1668904152 (~1.6 GB)
arc_meta_limit                  4    8589934592 - 8589934592 = 0
arc_dnode_limit                 4    858993459 - 858993459 = 0
arc_meta_max                    4    17493991168 - 17493991168 = 0

It is my understanding that large metadata prune can unpin much data memory, leading to sudden ARC shrink. I think the real question is why and what consumes so much metadata memory (bringing it even above zfs self-imposed limit).

Can you describe your pool layout, datasets (and their properties, especially "primarycache"), workload, etc?

ronnyegner commented 4 years ago

Usage is nothing fancy. Two pools, basic usage (small home NAS) for a few clients and some VMs. No heavy use. But the meta size is strange:

Primarycache is default (="all") for all file systems. No change here.

root@homenas:~# zpool get all
NAME   PROPERTY                       VALUE                          SOURCE
pool1  size                           3.53T                          -
pool1  capacity                       21%                            -
pool1  altroot                        -                              default
pool1  health                         ONLINE                         -
pool1  guid                           1658318829284017992            -
pool1  version                        -                              default
pool1  bootfs                         -                              default
pool1  delegation                     on                             default
pool1  autoreplace                    off                            default
pool1  cachefile                      -                              default
pool1  failmode                       wait                           default
pool1  listsnapshots                  off                            default
pool1  autoexpand                     off                            default
pool1  dedupditto                     0                              default
pool1  dedupratio                     1.00x                          -
pool1  free                           2.77T                          -
pool1  allocated                      780G                           -
pool1  readonly                       off                            -
pool1  ashift                         12                             local
pool1  comment                        -                              default
pool1  expandsize                     -                              -
pool1  freeing                        0                              -
pool1  fragmentation                  31%                            -
pool1  leaked                         0                              -
pool1  multihost                      off                            default
pool1  checkpoint                     -                              -
pool1  load_guid                      14323955545827788521           -
pool1  autotrim                       off                            default
pool1  feature@async_destroy          enabled                        local
pool1  feature@empty_bpobj            active                         local
pool1  feature@lz4_compress           active                         local
pool1  feature@multi_vdev_crash_dump  enabled                        local
pool1  feature@spacemap_histogram     active                         local
pool1  feature@enabled_txg            active                         local
pool1  feature@hole_birth             active                         local
pool1  feature@extensible_dataset     active                         local
pool1  feature@embedded_data          active                         local
pool1  feature@bookmarks              enabled                        local
pool1  feature@filesystem_limits      enabled                        local
pool1  feature@large_blocks           active                         local
pool1  feature@large_dnode            enabled                        local
pool1  feature@sha512                 enabled                        local
pool1  feature@skein                  active                         local
pool1  feature@edonr                  enabled                        local
pool1  feature@userobj_accounting     active                         local
pool1  feature@encryption             active                         local
pool1  feature@project_quota          active                         local
pool1  feature@device_removal         enabled                        local
pool1  feature@obsolete_counts        enabled                        local
pool1  feature@zpool_checkpoint       enabled                        local
pool1  feature@spacemap_v2            active                         local
pool1  feature@allocation_classes     enabled                        local
pool1  feature@resilver_defer         enabled                        local
pool1  feature@bookmark_v2            enabled                        local

****************************

pool4  size                           118T                           -
pool4  capacity                       38%                            -
pool4  altroot                        -                              default
pool4  health                         ONLINE                         -
pool4  guid                           14100976841961686166           -
pool4  version                        -                              default
pool4  bootfs                         -                              default
pool4  delegation                     on                             default
pool4  autoreplace                    off                            default
pool4  cachefile                      -                              default
pool4  failmode                       wait                           default
pool4  listsnapshots                  off                            default
pool4  autoexpand                     off                            default
pool4  dedupditto                     0                              default
pool4  dedupratio                     1.00x                          -
pool4  free                           72.1T                          -
pool4  allocated                      45.9T                          -
pool4  readonly                       off                            -
pool4  ashift                         12                             local
pool4  comment                        -                              default
pool4  expandsize                     -                              -
pool4  freeing                        0                              -
pool4  fragmentation                  6%                             -
pool4  leaked                         0                              -
pool4  multihost                      off                            default
pool4  checkpoint                     -                              -
pool4  load_guid                      10042272303506965039           -
pool4  autotrim                       off                            default
pool4  feature@async_destroy          enabled                        local
pool4  feature@empty_bpobj            active                         local
pool4  feature@lz4_compress           active                         local
pool4  feature@multi_vdev_crash_dump  enabled                        local
pool4  feature@spacemap_histogram     active                         local
pool4  feature@enabled_txg            active                         local
pool4  feature@hole_birth             active                         local
pool4  feature@extensible_dataset     active                         local
pool4  feature@embedded_data          active                         local
pool4  feature@bookmarks              enabled                        local
pool4  feature@filesystem_limits      enabled                        local
pool4  feature@large_blocks           active                         local
pool4  feature@large_dnode            enabled                        local
pool4  feature@sha512                 active                         local
pool4  feature@skein                  active                         local
pool4  feature@edonr                  enabled                        local
pool4  feature@userobj_accounting     active                         local
pool4  feature@encryption             active                         local
pool4  feature@project_quota          active                         local
pool4  feature@device_removal         enabled                        local
pool4  feature@obsolete_counts        enabled                        local
pool4  feature@zpool_checkpoint       enabled                        local
pool4  feature@spacemap_v2            active                         local
pool4  feature@allocation_classes     enabled                        local
pool4  feature@resilver_defer         enabled                        local
pool4  feature@bookmark_v2            enabled                        local

root@homenas:~# zpool status
  pool: pool1
 state: ONLINE
  scan: scrub repaired 0B in 0 days 04:57:17 with 0 errors on Sun May 10 05:21:19 2020
config:

    NAME        STATE     READ WRITE CKSUM
    pool1       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdl3    ONLINE       0     0     0
        sdm3    ONLINE       0     0     0
        sdj3    ONLINE       0     0     0

errors: No known data errors

  pool: pool4
 state: ONLINE
  scan: scrub in progress since Sun May 10 00:24:09 2020
    17.5T scanned at 290M/s, 17.5T issued at 289M/s, 45.9T total
    0B repaired, 38.15% done, 1 days 04:33:21 to go
config:

    NAME        STATE     READ WRITE CKSUM
    pool4       ONLINE       0     0     0
      raidz3-0  ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdn     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
        sdo     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdp     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sda     ONLINE       0     0     0

Full details of all file systems here: https://88.99.150.223/zfs.tgz

ronnyegner commented 4 years ago

I have increased the metadata size to 16GB and the problem is gone. I no longer see arc evictions down to < 10 GB. But still i´d like to know what is consuming so much metadata. Dedup is off and has never been enabled.

shodanshok commented 4 years ago

If I read correcly the data you provided, you have over 5000 snapshots and a very large number of relatively small files (ie: your dnode_size is over the arc_dnode_limit, meaning many inodes are cached). This can explain the very high metadata size and the sudden ARC shrink when ZFS tries hard to reclaim some memory.

I suggest continuing the discussion on the mailing list, and to close that issue.