openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

OOM / Panic on files remove #16037

Open osleg opened 5 months ago

osleg commented 5 months ago

Problem

Upon testing OpenZFS versions 2.1.13-2.1.15 and 2.2.2-2.2.3 on CentOS 8 Stream with various kernel versions ranging from 4.18.0-408 to .547, and utilizing Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz with 8GB ECC RAM, we encountered a memory consumption issue which leads to kerno panic during disk usage stress testing.

Test setup

Utilizing zpool with multiple configurations:

The test involves running multiple writers to fill the disk with random-sized files ranging from 1KB to 2GB. Once the disks are filled, all files are removed, and the process is repeated.

Observed issue

Across all tested versions, particularly pronounced in versions prior to 2.2.3, significant memory consumption occurs when files are removed.

Memory usage spikes, consuming all available memory.

The OOM killer activates in an attempt to free memory, resulting in kernel panics when no further resources are available for the OOM killer to release.

With 8GB RAM, the issue consistently occurs in every test instance before version 2.2.3, with a decreased frequency in version 2.2.3 (5 out of 20 CentOS test instances experienced kernel panics).

Logs

Machine info

Current instance is the only one that I left with for testing rn:

# cat /etc/os-release
NAME="CentOS Stream"
VERSION="8"

# zfs --version
zfs-2.2.3-1
zfs-kmod-2.2.3-1

# dmidecode -t memory
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0008, DMI type 16, 23 bytes
Physical Memory Array
    Location: System Board Or Motherboard
    Use: System Memory
    Error Correction Type: Unknown
    Maximum Capacity: 8 GB
    Error Information Handle: Not Provided
    Number Of Devices: 1

Handle 0x0009, DMI type 17, 34 bytes
Memory Device
    Array Handle: 0x0008
    Error Information Handle: Not Provided
    Total Width: 72 bits
    Data Width: 64 bits
    Size: 8 GB
    Form Factor: DIMM
    Set: None
    Locator: Not Specified
    Bank Locator: Not Specified
    Type: DDR4
    Type Detail: Static Column Pseudo-static Synchronous Window DRAM
    Speed: 2933 MT/s
    Manufacturer: Not Specified
    Serial Number: Not Specified
    Asset Tag: Not Specified
    Part Number: Not Specified
    Rank: Unknown
    Configured Memory Speed: Unknown

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel(R) Corporation
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
BIOS Model name:     Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping:            4
CPU MHz:             2999.998
BogoMIPS:            5999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

# uname -srm
Linux 4.18.0-540.el8.x86_64 x86_64

issue demo

total 295G
-rw-r--r--. 1 root root 2.0G Mar 20 02:45 Swordsman-13505
-rw-r--r--. 1 root root 2.0G Mar 20 03:37 Swordsman-13554
....
-rw-r--r--. 1 root root 2.0G Mar 20 19:32 Swordsman-14370
-rw-r--r--. 1 root root 884M Mar 20 19:33 Swordsman-14371

# ll | wc -l
138

# du -ch
295G    .
295G    total

# rm -f *
# 
frclient_loop: send disconnect: Broken pipe

After re-ssh directory still has all the files

138
# du -ch /mnt/dir2
295G    /mnt/dir2
295G    total

zpool status

  pool: mnt
 state: ONLINE
remove: Removal of vdev 19 copied 28.6G in 0h3m, completed on Thu Mar 21 20:28:28 2024
    14.7M memory used for removed device mappings
config:

    NAME           STATE     READ WRITE CKSUM
    mnt            ONLINE       0     0     0
      nvme4n1      ONLINE       0     0     0
      nvme3n1      ONLINE       0     0     0
      nvme1n1      ONLINE       0     0     0
      nvme2n1      ONLINE       0     0     0

errors: No known data errors

zpool list

NAME            SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
mnt            3.05T  2.52T   549G        -         -     0%    82%  1.00x    ONLINE  -
  indirect-0       -      -      -        -         -      -      -      -    ONLINE
  indirect-1       -      -      -        -         -      -      -      -    ONLINE
  indirect-2       -      -      -        -         -      -      -      -    ONLINE
  indirect-3       -      -      -        -         -      -      -      -    ONLINE
  indirect-4       -      -      -        -         -      -      -      -    ONLINE
  indirect-5       -      -      -        -         -      -      -      -    ONLINE
  indirect-6       -      -      -        -         -      -      -      -    ONLINE
  indirect-7       -      -      -        -         -      -      -      -    ONLINE
  indirect-8       -      -      -        -         -      -      -      -    ONLINE
  indirect-9       -      -      -        -         -      -      -      -    ONLINE
  indirect-10      -      -      -        -         -      -      -      -    ONLINE
  indirect-11      -      -      -        -         -      -      -      -    ONLINE
  indirect-12      -      -      -        -         -      -      -      -    ONLINE
  indirect-13      -      -      -        -         -      -      -      -    ONLINE
  indirect-14      -      -      -        -         -      -      -      -    ONLINE
  indirect-15      -      -      -        -         -      -      -      -    ONLINE
  indirect-16      -      -      -        -         -      -      -      -    ONLINE
  nvme4n1      2.93T  2.47T   469G        -         -     0%  84.3%      -    ONLINE
  nvme3n1      25.0G  24.3G   169M        -         -     0%  99.3%      -    ONLINE
  indirect-19      -      -      -        -         -      -      -      -    ONLINE
  nvme1n1      25.0G  24.3G   232M        -         -    22%  99.1%      -    ONLINE
  nvme2n1      80.0G   680K  79.5G        -         -     0%  0.00%      -    ONLINE

zpool config

NAME  PROPERTY                       VALUE                          SOURCE
mnt   size                           3.05T                          -
mnt   capacity                       82%                            -
mnt   altroot                        -                              default
mnt   health                         ONLINE                         -
mnt   guid                           8946787721482689307            -
mnt   version                        -                              default
mnt   bootfs                         -                              default
mnt   delegation                     on                             default
mnt   autoreplace                    off                            default
mnt   cachefile                      -                              default
mnt   failmode                       wait                           default
mnt   listsnapshots                  off                            default
mnt   autoexpand                     on                             local
mnt   dedupratio                     1.00x                          -
mnt   free                           549G                           -
mnt   allocated                      2.52T                          -
mnt   readonly                       off                            -
mnt   ashift                         12                             local
mnt   comment                        -                              default
mnt   expandsize                     -                              -
mnt   freeing                        0                              -
mnt   fragmentation                  0%                             -
mnt   leaked                         0                              -
mnt   multihost                      off                            default
mnt   checkpoint                     -                              -
mnt   load_guid                      17249711793930708177           -
mnt   autotrim                       off                            default
mnt   compatibility                  off                            default
mnt   bcloneused                     0                              -
mnt   bclonesaved                    0                              -
mnt   bcloneratio                    1.00x                          -
mnt   feature@async_destroy          enabled                        local
mnt   feature@empty_bpobj            enabled                        local
mnt   feature@lz4_compress           active                         local
mnt   feature@multi_vdev_crash_dump  enabled                        local
mnt   feature@spacemap_histogram     active                         local
mnt   feature@enabled_txg            active                         local
mnt   feature@hole_birth             active                         local
mnt   feature@extensible_dataset     active                         local
mnt   feature@embedded_data          active                         local
mnt   feature@bookmarks              enabled                        local
mnt   feature@filesystem_limits      enabled                        local
mnt   feature@large_blocks           enabled                        local
mnt   feature@large_dnode            enabled                        local
mnt   feature@sha512                 enabled                        local
mnt   feature@skein                  enabled                        local
mnt   feature@edonr                  enabled                        local
mnt   feature@userobj_accounting     active                         local
mnt   feature@encryption             enabled                        local
mnt   feature@project_quota          active                         local
mnt   feature@device_removal         active                         local
mnt   feature@obsolete_counts        active                         local
mnt   feature@zpool_checkpoint       enabled                        local
mnt   feature@spacemap_v2            active                         local
mnt   feature@allocation_classes     enabled                        local
mnt   feature@resilver_defer         enabled                        local
mnt   feature@bookmark_v2            enabled                        local
mnt   feature@redaction_bookmarks    enabled                        local
mnt   feature@redacted_datasets      enabled                        local
mnt   feature@bookmark_written       enabled                        local
mnt   feature@log_spacemap           active                         local
mnt   feature@livelist               enabled                        local
mnt   feature@device_rebuild         enabled                        local
mnt   feature@zstd_compress          enabled                        local
mnt   feature@draid                  enabled                        local
mnt   feature@zilsaxattr             enabled                        local
mnt   feature@head_errlog            active                         local
mnt   feature@blake3                 enabled                        local
mnt   feature@block_cloning          enabled                        local
mnt   feature@vdev_zaps_v2           active                         local

zfs config

NAME  PROPERTY              VALUE                  SOURCE
mnt   type                  filesystem             -
mnt   creation              Sun Mar 10 13:54 2024  -
mnt   used                  2.52T                  -
mnt   available             451G                   -
mnt   referenced            2.52T                  -
mnt   compressratio         1.00x                  -
mnt   mounted               yes                    -
mnt   quota                 none                   default
mnt   reservation           none                   default
mnt   recordsize            128K                   default
mnt   mountpoint            /mnt                   local
mnt   sharenfs              off                    default
mnt   checksum              on                     default
mnt   compression           on                     default
mnt   atime                 on                     default
mnt   devices               on                     default
mnt   exec                  on                     default
mnt   setuid                on                     default
mnt   readonly              off                    default
mnt   zoned                 off                    default
mnt   snapdir               hidden                 default
mnt   aclmode               discard                default
mnt   aclinherit            restricted             default
mnt   createtxg             1                      -
mnt   canmount              on                     default
mnt   xattr                 on                     default
mnt   copies                1                      default
mnt   version               5                      -
mnt   utf8only              off                    -
mnt   normalization         none                   -
mnt   casesensitivity       sensitive              -
mnt   vscan                 off                    default
mnt   nbmand                off                    default
mnt   sharesmb              off                    default
mnt   refquota              none                   default
mnt   refreservation        none                   default
mnt   guid                  11115806655719226472   -
mnt   primarycache          all                    default
mnt   secondarycache        all                    default
mnt   usedbysnapshots       0B                     -
mnt   usedbydataset         2.52T                  -
mnt   usedbychildren        120M                   -
mnt   usedbyrefreservation  0B                     -
mnt   logbias               latency                default
mnt   objsetid              54                     -
mnt   dedup                 off                    default
mnt   mlslabel              none                   default
mnt   sync                  standard               default
mnt   dnodesize             legacy                 default
mnt   refcompressratio      1.00x                  -
mnt   written               2.52T                  -
mnt   logicalused           2.35T                  -
mnt   logicalreferenced     2.35T                  -
mnt   volmode               default                default
mnt   filesystem_limit      none                   default
mnt   snapshot_limit        none                   default
mnt   filesystem_count      none                   default
mnt   snapshot_count        none                   default
mnt   snapdev               hidden                 default
mnt   acltype               off                    default
mnt   context               none                   default
mnt   fscontext             none                   default
mnt   defcontext            none                   default
mnt   rootcontext           none                   default
mnt   relatime              on                     default
mnt   redundant_metadata    all                    default
mnt   overlay               on                     default
mnt   encryption            off                    default
mnt   keylocation           none                   default
mnt   keyformat             none                   default
mnt   pbkdf2iters           0                      default
mnt   special_small_blocks  0                      default

vmcore dmesg

dmesg.txt

maybe related:

14732

15776

14914

robn commented 5 months ago

@osleg can you post /proc/spl/kmem/slab from before and after the OOM event? Doesn't need to be exact, but I'd like to see what happens as more files are deleted, into the kernel attempting to reclaim memory, before finally giving up and killing something.

osleg commented 5 months ago

@robn sorry took me a bit of time to get those, here's 3 logs: first from before rm -f /mnt/dir2/* started, second is right after rm returned and third one is the last one I was able to fetch before the kernel panic slab_1711966416_169.log slab_1711966417_170.log slab_1711966419_171.log

eliran-zada-zesty commented 5 months ago

Got this issue as well... :-(

egadsthefuzz commented 4 months ago

I've also hit this recently and it looks like it is similar if not the same as https://github.com/openzfs/zfs/issues/6783