openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.75k forks source link

Random kernel BUG at mm/usercopy.c:99 from SLUB object 'zio_buf_comb_16384' #12543

Open Rid opened 3 years ago

Rid commented 3 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04.3
Kernel Version 5.8.0-63-generic
Architecture amd64
OpenZFS Version zfs-2.1.0-0york2~20.04

Describe the problem you're observing

Kernel randomly panics, our fleet has experienced 13 panics over 2 weeks since upgrading to zfs 2.1.0.

Describe how to reproduce the problem

Install zfs-2.1.0-0york2~20.04 on Ubuntu 20.04.3, run a high load.

Include any warning/errors/backtraces from the system logs

[62564.778489] usercopy: Kernel memory exposure attempt detected from SLUB object 'zio_buf_comb_16384' (offset 4629, size 28139)!
[62564.778550] ------------[ cut here ]------------
[62564.778551] kernel BUG at mm/usercopy.c:99!
[62564.778591] invalid opcode: 0000 [#1] SMP PTI
[62564.778610] CPU: 11 PID: 2425187 Comm: qbittorrent-nox Kdump: loaded Tainted: P           OE     5.8.0-63-generic #71~20.04.1-Ubuntu
[62564.778652] Hardware name: Dell Inc. PowerEdge R730xd/0WCJNT, BIOS 2.13.0 05/14/2021
[62564.778692] RIP: 0010:usercopy_abort+0x7b/0x7d
[62564.778712] Code: 4c 0f 45 de 51 4c 89 d1 48 c7 c2 9d 47 7e ab 57 48 c7 c6 e0 f4 7c ab 48 c7 c7 68 48 7e ab 48 0f 45 f2 4c 89 da e8 23 7c ff ff <0f> 0b 4c 89 e1 49 89 d8 44 89 ea 31 f6 48 29 c1 48 c7 c7 df 47 7e
[62564.778785] RSP: 0018:ffffad0ac3adbb50 EFLAGS: 00010246
[62564.778806] RAX: 0000000000000072 RBX: 0000000000006deb RCX: 0000000000000000
[62564.778832] RDX: 0000000000000000 RSI: ffff94a73f958cd0 RDI: ffff94a73f958cd0
[62564.778862] RBP: ffffad0ac3adbb68 R08: ffff94a73f958cd0 R09: ffffad0a48ce0020
[62564.778889] R10: ffff94972dba6820 R11: 0000000000000001 R12: ffff9498d4119215
[62564.778914] R13: 0000000000000001 R14: ffff9498d4120000 R15: 0000000000000000
[62564.778941] FS:  00007f4d527fc700(0000) GS:ffff94a73f940000(0000) knlGS:0000000000000000
[62564.778975] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[62564.778997] CR2: 00007f4cae86e000 CR3: 0000001227620002 CR4: 00000000003606e0
[62564.779027] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[62564.779057] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[62564.779088] Call Trace:
[62564.779117]  __check_heap_object+0xe6/0x120
[62564.779137]  __check_object_size+0x13f/0x150
[62564.779216]  zfs_uiomove_iter+0x61/0xf0 [zfs]
[62564.779304]  zfs_uiomove+0x25/0x30 [zfs]
[62564.779361]  dmu_read_uio_dnode+0xa5/0xf0 [zfs]
[62564.779451]  ? zfs_rangelock_enter_impl+0x271/0x5c0 [zfs]
[62564.779513]  dmu_read_uio_dbuf+0x47/0x60 [zfs]
[62564.779592]  zfs_read+0x136/0x3a0 [zfs]
[62564.779665]  zpl_iter_read+0xd8/0x180 [zfs]
[62564.779689]  do_iter_readv_writev+0x18b/0x1b0
[62564.779711]  do_iter_read+0xe2/0x1a0
[62564.779732]  vfs_readv+0x6e/0xb0
[62564.779751]  ? __secure_computing+0x42/0xe0
[62564.779774]  do_preadv+0x93/0xd0
[62564.779793]  __x64_sys_preadv+0x21/0x30
[62564.780999]  do_syscall_64+0x49/0xc0
[62564.782046]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[62564.783105] RIP: 0033:0x7f4e4b5424da
[62564.784158] Code: Unable to access opcode bytes at RIP 0x7f4e4b5424b0.
[62564.785165] RSP: 002b:00007f4d527f9cc0 EFLAGS: 00000246 ORIG_RAX: 0000000000000127
[62564.786122] usercopy: Kernel memory exposure attempt detected from SLUB object 'zio_buf_comb_16384' (offset 544, size 32224)!
[62564.786154] RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007f4e4b5424da
[62564.786157] RDX: 0000000000000020 RSI: 00007f4d527f9d00 RDI: 0000000000000068
[62564.787278] ------------[ cut here ]------------
[62564.788328] RBP: 00007f4d527f9d00 R08: 0000000000000000 R09: 0000000000000000
[62564.788331] R10: 000000002510fb15 R11: 0000000000000246 R12: 0000000000000020
[62564.789358] kernel BUG at mm/usercopy.c:99!
[62564.790352] R13: 000000002510fb15 R14: 000000002510fb15 R15: 0000000000000000
[62564.790354] Modules linked in: xt_multiport act_mirred cls_u32 sch_ingress sch_hfsc veth nf_conntrack_netlink nfnetlink xfrm_user sch_fq_codel bridge stp llc aufs overlay xt_MASQUERADE xt_nat xt_addrtype iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack iptable_filter bpfilter binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm rapl intel_cstate ipmi_ssif dcdbas mei_me mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ifb ip_tables x_tables autofs4 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 ses enclosure scsi_transport_sas mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul sysimgblt crc32_pclmul fb_sys_fops ghash_clmulni_intel
[62564.790459]  cec rc_core aesni_intel ixgbe crypto_simd xfrm_algo cryptd ahci mxm_wmi glue_helper drm lpc_ich dca libahci mdio tg3 megaraid_sas wmi
rincebrain commented 3 years ago

20.04 is shipping 5.4.x, and the HWE kernel is 5.11.x at this point (20.10 was 5.8.x). So you seem to be running the old 20.10 HWE kernel on 20.04, not "just" 20.04.

What version were you running before? What are the dataset/pool properties on one of the systems where this happens? What kind of workload are you running?

Rid commented 3 years ago

I was originally running 5.13.12 mainline (which doesn't have a debug symbols package) that was producing the same panic. I then switched to 5.11.x HWE which was producing the same panic, however due to this bug https://bugs.launchpad.net/ubuntu/+source/linux-hwe-5.11/+bug/1939287 it again does not have a debug symbols package.

I also tried 5.10.x however that also has the same debug symbols package missing.

So 5.8.x was the latest Ubuntu kernel I could find which has debug symbols and was still producing the same panic. Which is why I'm running that instead of 5.4.x.

The pools all have identical properties to this:

NAME  PROPERTY              VALUE                  SOURCE
sda   type                  filesystem             -
sda   creation              Wed Mar 13 13:10 2019  -
sda   used                  1.80T                  -
sda   available             1.65T                  -
sda   referenced            112K                   -
sda   compressratio         1.00x                  -
sda   mounted               yes                    -
sda   quota                 none                   default
sda   reservation           none                   default
sda   recordsize            16K                    local
sda   mountpoint            /cylostore/sda         local
sda   sharenfs              off                    default
sda   checksum              on                     default
sda   compression           lz4                    local
sda   atime                 off                    local
sda   devices               on                     default
sda   exec                  on                     default
sda   setuid                on                     default
sda   readonly              off                    default
sda   zoned                 off                    default
sda   snapdir               hidden                 default
sda   aclmode               discard                default
sda   aclinherit            restricted             default
sda   createtxg             1                      -
sda   canmount              on                     default
sda   xattr                 on                     default
sda   copies                1                      default
sda   version               5                      -
sda   utf8only              off                    -
sda   normalization         none                   -
sda   casesensitivity       sensitive              -
sda   vscan                 off                    default
sda   nbmand                off                    default
sda   sharesmb              off                    default
sda   refquota              none                   default
sda   refreservation        none                   default
sda   guid                  4550442006279899623    -
sda   primarycache          all                    default
sda   secondarycache        all                    default
sda   usedbysnapshots       0B                     -
sda   usedbydataset         112K                   -
sda   usedbychildren        1.80T                  -
sda   usedbyrefreservation  0B                     -
sda   logbias               latency                default
sda   objsetid              54                     -
sda   dedup                 off                    default
sda   mlslabel              none                   default
sda   sync                  standard               default
sda   dnodesize             legacy                 default
sda   refcompressratio      1.00x                  -
sda   written               112K                   -
sda   logicalused           1.80T                  -
sda   logicalreferenced     50K                    -
sda   volmode               default                default
sda   filesystem_limit      none                   default
sda   snapshot_limit        none                   default
sda   filesystem_count      none                   default
sda   snapshot_count        none                   default
sda   snapdev               hidden                 default
sda   acltype               off                    default
sda   context               none                   default
sda   fscontext             none                   default
sda   defcontext            none                   default
sda   rootcontext           none                   default
sda   relatime              off                    default
sda   redundant_metadata    all                    default
sda   overlay               on                     default
sda   encryption            off                    default
sda   keylocation           none                   default
sda   keyformat             none                   default
sda   pbkdf2iters           0                      default
sda   special_small_blocks  0                      default

The workload is very heavy random read/writes on SATA disks, it's a multi-tenant container system with a wide range of apps, so DBs, file transfer, streaming data, workstations and more. Around 7% of the disks we run are at 90%+ utilisation most of the time.

rincebrain commented 3 years ago

Okay, that's some data.

What about enabled features on the pools? (e.g. zpool get all [pool] | grep feature@) How many disks per system? Are they all single-disk pools? I'm guessing they're (mostly? all?) filesystems not zvols?

Are you exposing any of them over NFS/SMB/iSCSI? (I just touched this code for an NFS issue in much, much older kernels, though that patch isn't in 2.1.0 and I don't think was remotely related to this issue, I'm just wondering if there were more gremlins lurking...)

Rid commented 3 years ago

Here's the features:

sda   feature@empty_bpobj            active                         local
sda   feature@lz4_compress           active                         local
sda   feature@multi_vdev_crash_dump  enabled                        local
sda   feature@spacemap_histogram     active                         local
sda   feature@enabled_txg            active                         local
sda   feature@hole_birth             active                         local
sda   feature@extensible_dataset     active                         local
sda   feature@embedded_data          active                         local
sda   feature@bookmarks              enabled                        local
sda   feature@filesystem_limits      enabled                        local
sda   feature@large_blocks           enabled                        local
sda   feature@large_dnode            enabled                        local
sda   feature@sha512                 enabled                        local
sda   feature@skein                  enabled                        local
sda   feature@edonr                  enabled                        local
sda   feature@userobj_accounting     active                         local
sda   feature@encryption             enabled                        local
sda   feature@project_quota          active                         local
sda   feature@device_removal         enabled                        local
sda   feature@obsolete_counts        enabled                        local
sda   feature@zpool_checkpoint       enabled                        local
sda   feature@spacemap_v2            active                         local
sda   feature@allocation_classes     enabled                        local
sda   feature@resilver_defer         enabled                        local
sda   feature@bookmark_v2            enabled                        local
sda   feature@redaction_bookmarks    enabled                        local
sda   feature@redacted_datasets      enabled                        local
sda   feature@bookmark_written       enabled                        local
sda   feature@log_spacemap           active                         local
sda   feature@livelist               active                         local
sda   feature@device_rebuild         enabled                        local
sda   feature@zstd_compress          enabled                        local

There are 12 disks in the system, all single-disk pools, and all filesystems.

The pools share the disk with RAID1 & RAID10 mdadm partitions (ext4).

We're not exposing them over any interface such as NFS etc.

We also have the following udev rules:

ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/iosched/low_latency}="0"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/read_ahead_kb}="32"

The only zfs parameter we have set is: zfs_prefetch_disable=1

Rid commented 3 years ago

I've tried downgrading to:

zfs-2.0.4-1 zfs-kmod-2.0.4-1

However we're still getting panics:

PID: 953856  TASK: ffff9c406c392f00  CPU: 23  COMMAND: "qbittorrent-nox"
 #0 [ffffac8a996cf840] machine_kexec at ffffffff8a46e973
 #1 [ffffac8a996cf8a0] __crash_kexec at ffffffff8a5536d2
 #2 [ffffac8a996cf970] crash_kexec at ffffffff8a5544e9
 #3 [ffffac8a996cf988] oops_end at ffffffff8a435cc9
 #4 [ffffac8a996cf9b0] die at ffffffff8a435f93
 #5 [ffffac8a996cf9e0] do_trap at ffffffff8a432c05
 #6 [ffffac8a996cfa30] do_error_trap at ffffffff8a432e90
 #7 [ffffac8a996cfa78] exc_invalid_op at ffffffff8af73561
 #8 [ffffac8a996cfaa0] asm_exc_invalid_op at ffffffff8b000a72
    [exception RIP: usercopy_abort+123]
    RIP: ffffffff8af2ca87  RSP: ffffac8a996cfb50  RFLAGS: 00010246
    RAX: 0000000000000073  RBX: 0000000000004a1e  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: ffff9c5a7fad8cd0  RDI: ffff9c5a7fad8cd0
    RBP: ffffac8a996cfb68   R8: ffff9c5a7fad8cd0   R9: ffffac89ccc41020
    R10: ffff9c4a71531680  R11: 0000000000000001  R12: ffff9c3e3af035e2
    R13: 0000000000000001  R14: ffff9c3e3af08000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffffac8a996cfb70] __check_heap_object at ffffffff8a6bcc36
#10 [ffffac8a996cfb80] __check_object_size at ffffffff8a6e52af
#11 [ffffac8a996cfbb0] uiomove_iter at ffffffffc0a71131 [zfs]
#12 [ffffac8a996cfbf0] uiomove at ffffffffc0a711e5 [zfs]
#13 [ffffac8a996cfc00] dmu_read_uio_dnode at ffffffffc09754a5 [zfs]
#14 [ffffac8a996cfc50] dmu_read_uio_dbuf at ffffffffc0975537 [zfs]
#15 [ffffac8a996cfc80] zfs_read at ffffffffc0a4c386 [zfs]
#16 [ffffac8a996cfcd8] zpl_iter_read at ffffffffc0a87ee8 [zfs]
#17 [ffffac8a996cfd58] do_iter_readv_writev at ffffffff8a6ebafb
#18 [ffffac8a996cfda8] do_iter_read at ffffffff8a6ed782
#19 [ffffac8a996cfde8] vfs_readv at ffffffff8a6f010e
#20 [ffffac8a996cfee0] do_preadv at ffffffff8a6f0343
#21 [ffffac8a996cff28] __x64_sys_preadv at ffffffff8a6f03a1
#22 [ffffac8a996cff38] do_syscall_64 at ffffffff8af72f89
#23 [ffffac8a996cff50] entry_SYSCALL_64_after_hwframe at ffffffff8b00008c
    RIP: 0000000000dd46ba  RSP: 00007f3e12ffa230  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 00000000000000ef  RCX: 0000000000dd46ba
    RDX: 0000000000000046  RSI: 00007f3e12ffa270  RDI: 00000000000000ef
    RBP: 00007f3e12ffa270   R8: 0000000000000000   R9: 0000000000000000
    R10: 00000000ece46732  R11: 0000000000000246  R12: 0000000000000046
    R13: 00000000ece46732  R14: 00000000ece46732  R15: 0000000000000000
    ORIG_RAX: 0000000000000127  CS: 0033  SS: 002b

They always appear to happen directly after an OOM kill:

[388192.408213] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-8681a4e5824d2034d2dccba0817fd22fd076ac97478c8a5e8b918209d46f49b8.scope,mems_allowed=0-1,oom_memcg=/system.slice/docker-8681a4e5824d2034d2dccba0817fd22fd076ac97478c8a5e8b918209d46f49b8.scope,task_memcg=/system.slice/docker-8681a4e5824d2034d2dccba0817fd22fd076ac97478c8a5e8b918209d46f49b8.scope,task=qbittorrent-nox,pid=944961,uid=14696000
[388192.408254] Memory cgroup out of memory: Killed process 944961 (qbittorrent-nox) total-vm:6594388kB, anon-rss:5794780kB, file-rss:0kB, shmem-rss:0kB, UID:14696000 pgtables:11572kB oom_score_adj:0

I have core dumps of a few crashes, please let me know what information you need to help debug the issue.

rincebrain commented 3 years ago

Interesting. So, based on the issue you referenced this one from, you're triggering an OOM based on a program ignoring cgroup limits and getting murdered when the system is otherwise not RAM-constrained.

It seems like docker being involved is probably worth mentioning in bug reports, FYI - it has a way of complicating many otherwise simple things in ways that are difficult to reproduce without it unless you know it's involved.

Rid commented 3 years ago

It looks like the PID which triggered the bug is a thread of the previously killed proc:

crash> ps -g 953856
PID: 944961  TASK: ffff9c4617ae0000  CPU: 45  COMMAND: "qbittorrent-nox"
  PID: 953856  TASK: ffff9c406c392f00  CPU: 23  COMMAND: "qbittorrent-nox"
  PID: 953857  TASK: ffff9c406c394680  CPU: 19  COMMAND: "qbittorrent-nox"

crash> ps 944961
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  944961  202227  45  ffff9c4617ae0000  ZO   0.0       0      0  qbittorrent-nox
crash> ps 953856
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
> 953856  202227  23  ffff9c406c392f00  RU   4.2 6594388 5717412  qbittorrent-nox
Rid commented 3 years ago

OOM kill was disabled however the bug still persists, so it is not related to the OOM kill event.

rincebrain commented 3 years ago

If you could put together a minimal reproducer you could share all the components of so someone could just run it and have the bug appear with even some percentage chance, it would likely be helpful.

On Tue, Sep 21, 2021, 1:26 PM Grant Millar @.***> wrote:

OOM kill was disabled however the bug still persists, so it is not related to the OOM kill event.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/12543#issuecomment-924200098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7K3LDZI2IZL5KT4ZN3UDC543ANCNFSM5DSRJCZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Rid commented 3 years ago

There was another panic this morning hitting the same bug, this time triggered by a different process:

[114751.548897] usercopy: Kernel memory exposure attempt detected from SLUB object 'zio_buf_comb_16384' (offset 12288, size 20480)!
[114751.548956] ------------[ cut here ]------------
[114751.548957] kernel BUG at mm/usercopy.c:99!
[114751.548981] invalid opcode: 0000 [#1] SMP PTI
[114751.549003] CPU: 22 PID: 3885497 Comm: gclone Kdump: loaded Tainted: P           OE     5.8.0-63-generic #71~20.04.1-Ubuntu
[114751.549048] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.13.0 05/14/2021
[114751.549085] RIP: 0010:usercopy_abort+0x7b/0x7d
[114751.549107] Code: 4c 0f 45 de 51 4c 89 d1 48 c7 c2 9d 47 de bd 57 48 c7 c6 e0 f4 dc bd 48 c7 c7 68 48 de bd 48 0f 45 f2 4c 89 da e8 23 7c ff ff <0f> 0b 4c 89 e1 49 89 d8 44 89 ea 31 f6 48 29 c1 48 c7 c7 df 47 de
[114751.549180] RSP: 0018:ffffb76a3d507c10 EFLAGS: 00010246
[114751.549204] RAX: 0000000000000073 RBX: 0000000000005000 RCX: 0000000000000000
[114751.549234] RDX: 0000000000000000 RSI: ffff92ccbfad8cd0 RDI: ffff92ccbfad8cd0
[114751.549265] RBP: ffffb76a3d507c28 R08: ffff92ccbfad8cd0 R09: ffffb769c9240020
[114751.549294] R10: ffff92ccb8e7f590 R11: 0000000000000001 R12: ffff92cf11d7b000
[114751.549324] R13: 0000000000000001 R14: ffff92cf11d80000 R15: 0000000000000000
[114751.550563] FS:  000000c0005cc790(0000) GS:ffff92ccbfac0000(0000) knlGS:0000000000000000
[114751.551758] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[114751.552922] CR2: 00007f5015652000 CR3: 0000001689a56001 CR4: 00000000003606e0
[114751.554098] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[114751.555267] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[114751.556408] Call Trace:
[114751.557529]  __check_heap_object+0xe6/0x120
[114751.558647]  __check_object_size+0x13f/0x150
[114751.559823]  zfs_uiomove_iter+0x61/0xf0 [zfs]
[114751.561207]  zfs_uiomove+0x25/0x30 [zfs]
[114751.562547]  dmu_read_uio_dnode+0xa5/0xf0 [zfs]
[114751.563645]  dmu_read_uio_dbuf+0x47/0x60 [zfs]
[114751.564737]  zfs_read+0x136/0x3a0 [zfs]
[114751.565813]  zpl_iter_read+0xd8/0x180 [zfs]
[114751.566855]  new_sync_read+0x10c/0x1a0
[114751.567871]  vfs_read+0x161/0x190
[114751.568853]  ksys_read+0x67/0xe0
[114751.569819]  __x64_sys_read+0x1a/0x20
[114751.570782]  do_syscall_64+0x49/0xc0
[114751.571741]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[114751.572676] RIP: 0033:0x4bde70
[114751.573592] Code: Unable to access opcode bytes at RIP 0x4bde46.
[114751.574529] RSP: 002b:000000c000684a38 EFLAGS: 00000216 ORIG_RAX: 0000000000000000
[114751.575564] RAX: ffffffffffffffda RBX: 000000c00004f400 RCX: 00000000004bde70
[114751.576491] RDX: 0000000000100000 RSI: 00007f501560a000 RDI: 0000000000000005
[114751.577582] RBP: 000000c000684a88 R08: 0000000000000000 R09: 0000000000000000
[114751.579173] R10: 0000000000000000 R11: 0000000000000216 R12: 00000000000000f4
[114751.580724] R13: 0000000000000000 R14: 00000000019c0546 R15: 0000000000000000
[114751.582269] Modules linked in: xt_multiport act_mirred cls_u32 sch_ingress sch_hfsc veth nf_conntrack_netlink nfnetlink xfrm_user sch_fq_codel bridge stp llc aufs overlay xt_MASQUERADE xt_nat xt_addrtype iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack iptable_filter bpfilter binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm rapl intel_cstate dcdbas mei_me mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ifb ip_tables x_tables autofs4 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 ses enclosure scsi_transport_sas mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt crct10dif_pclmul fb_sys_fops crc32_pclmul ghash_clmulni_intel
[114751.582327]  cec rc_core ixgbe aesni_intel crypto_simd ahci mxm_wmi xfrm_algo cryptd glue_helper lpc_ich drm megaraid_sas dca libahci tg3 mdio wmi

So far I have tried to reproduce it using:

docker -H unix:///cylostore/sda/18220/var/run/docker.sock run -m 500M --memory-swap 500M -it --rm alexeiled/stress-ng -v --iomix 12 --iomix-bytes 10% --vm 1 --vm-bytes 1G --timeout 120s --metrics-brief

I tried this with and without the OOM killer, when the OOM killer was enabled I set the oom_score_adj of the io threads to higher so that they were being killed rather than the vm thread.

However I haven't yet been able to reproduce it myself.

Rid commented 3 years ago

We're still hitting this bug on a daily basis, here's the latest crash from this morning:

[82151.529261] usercopy: Kernel memory exposure attempt detected from SLUB object 'zio_buf_comb_16384' (offset 4635, size 28133)!
[82151.529331] ------------[ cut here ]------------
[82151.529332] kernel BUG at mm/usercopy.c:99!
[82151.529364] invalid opcode: 0000 [#1] SMP PTI
[82151.529383] CPU: 4 PID: 2914085 Comm: ffmpeg Kdump: loaded Tainted: P           OE     5.8.0-63-generic #71~20.04.1-Ubuntu
[82151.529436] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.13.0 05/14/2021
[82151.529469] RIP: 0010:usercopy_abort+0x7b/0x7d
[82151.529490] Code: 4c 0f 45 de 51 4c 89 d1 48 c7 c2 9d 47 1e 83 57 48 c7 c6 e0 f4 1c 83 48 c7 c7 68 48 1e 83 48 0f 45 f2 4c 89 da e8 23 7c ff ff <0f> 0b 4c 89 e1 49 89 d8 44 89 ea 31 f6 48 29 c1 48 c7 c7 df 47 1e
[82151.529568] RSP: 0018:ffffbd2fd22e3c10 EFLAGS: 00010246
[82151.529588] RAX: 0000000000000072 RBX: 0000000000006de5 RCX: 0000000000000000
[82151.529624] RDX: 0000000000000000 RSI: ffff932cbf898cd0 RDI: ffff932cbf898cd0
[82151.529658] RBP: ffffbd2fd22e3c28 R08: ffff932cbf898cd0 R09: ffffbd2f495a0020
[82151.529684] R10: ffff932cb024d7e0 R11: 0000000000000001 R12: ffff931fca5b921b
[82151.529719] R13: 0000000000000001 R14: ffff931fca5c0000 R15: 0000000000000000
[82151.529763] FS:  00007f796f359340(0000) GS:ffff932cbf880000(0000) knlGS:0000000000000000
[82151.529792] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[82151.529825] CR2: 000055948358c000 CR3: 000000179c2f8002 CR4: 00000000003606e0
[82151.529865] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[82151.529900] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[82151.529930] Call Trace:
[82151.529952]  __check_heap_object+0xe6/0x120
[82151.529972]  __check_object_size+0x13f/0x150
[82151.530095]  zfs_uiomove_iter+0x61/0xf0 [zfs]
[82151.530206]  zfs_uiomove+0x25/0x30 [zfs]
[82151.530271]  dmu_read_uio_dnode+0xa5/0xf0 [zfs]
[82151.530351]  ? zfs_rangelock_enter_impl+0x271/0x5c0 [zfs]
[82151.530431]  dmu_read_uio_dbuf+0x47/0x60 [zfs]
[82151.530531]  zfs_read+0x136/0x3a0 [zfs]
[82151.530631]  zpl_iter_read+0xd8/0x180 [zfs]
[82151.530663]  new_sync_read+0x10c/0x1a0
[82151.530691]  vfs_read+0x161/0x190
[82151.530717]  ksys_read+0x67/0xe0
[82151.530743]  __x64_sys_read+0x1a/0x20
[82151.530771]  do_syscall_64+0x49/0xc0
[82151.530799]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[82151.530837] RIP: 0033:0x7f797a72e332
[82151.530867] Code: Unable to access opcode bytes at RIP 0x7f797a72e308.
[82151.530909] RSP: 002b:00007ffd3ebfd118 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[82151.530968] RAX: ffffffffffffffda RBX: 000055948339d440 RCX: 00007f797a72e332
[82151.531012] RDX: 00000000000ae404 RSI: 000055948358bed4 RDI: 0000000000000003
[82151.531058] RBP: 000055948358bed4 R08: 00007f797c0d8980 R09: 00005594833ad890
[82151.531104] R10: 00005594835a9000 R11: 0000000000000246 R12: 00000000000ae404
[82151.531149] R13: 0000000000000005 R14: 000055948339d2f0 R15: 000055948339d2c0
[82151.531197] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha libblake2s_generic xt_multiport sch_hfsc act_mirred cls_u32 sch_ingress veth nf_conntrack_netlink nfnetlink xfrm_user sch_fq_codel bridge stp llc aufs overlay xt_MASQUERADE xt_nat xt_addrtype iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack iptable_filter bpfilter binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd ipmi_ssif cryptd glue_helper rapl intel_cstate mgag200 drm_kms_helper cec mei_me rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt mei ipmi_si dcdbas mxm_wmi ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid tcp_bbr sch_fq
[82151.531247]  nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ifb drm ip_tables x_tables autofs4 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 ses enclosure scsi_transport_sas ixgbe xfrm_algo crc32_pclmul lpc_ich ahci dca tg3 megaraid_sas libahci mdio wmi

We are currently trialing our own custom OOM killer which uses sigterm instead of sigkill, however the above crash happened despite this, so either the OOM killer was not fast enough (we'll try adjusting the params), or it's not related to hitting the cgroup RAM limit.

The trace looks slightly similar to the following:

https://lkml.org/lkml/2020/7/26/73

We may try using the kernel before this commit to confirm.

pinging @behlendorf for any insights.

Rid commented 3 years ago

We have implemented a custom OOM terminator in golang which uses fsnotify and containerd/cgroups to monitor docker containers for memory events, once a container starts emitting critical memory pressure and is over a certain threshold, we then send a SIGTERM to the PID with the highest oom_score.

This works for most applications which gradually use more RAM, the only situation we have at the moment is where an application spikes it's RAM usage faster than we can kill it and that appears to only happen with ffmpeg.

This still results in crashing 1-2 times per week, at the moment that seems the best we can do, and it's much better than the multiple crashes per day.

We would still like to debug this further if anyone is able to help.

Rid commented 3 years ago

This one is slightly different, the slub object here belongs to a different container (a1b14e7ba3a2e170d9a0b4475c61c541811de9f2c19d4d68929c71916b33271c) than the proc mentioned in the bug which belongs to container (86b70f49d51a778366e5cc853f7c9107911b551decb0164dd0de9867cef88fd9).

There didn't appear to be any memory pressure on the containers.

I've started running bpftrace on some of these procs in order to debug further.

[1174955.186691] usercopy: Kernel memory exposure attempt detected from SLUB object 'sock_inode_cache(6899:a1b14e7ba3a2e170d9a0b4475c61c541811de9f2c19d4d68929c71916b33271c)' (offset 177, size 29007)!
[1174955.186821] ------------[ cut here ]------------
[1174955.186823] kernel BUG at mm/usercopy.c:99!
[1174955.186854] invalid opcode: 0000 [#1] SMP PTI
[1174955.186885] CPU: 16 PID: 2373466 Comm: qbittorrent-nox Kdump: loaded Tainted: P           OE     5.8.0-63-generic #71~20.04.1-Ubuntu
[1174955.186951] Hardware name: Dell Inc. PowerEdge R730xd/0WCJNT, BIOS 2.13.0 05/14/2021
[1174955.187000] RIP: 0010:usercopy_abort+0x7b/0x7d
[1174955.187026] Code: 4c 0f 45 de 51 4c 89 d1 48 c7 c2 9d 47 1e 87 57 48 c7 c6 e0 f4 1c 87 48 c7 c7 68 48 1e 87 48 0f 45 f2 4c 89 da e8 23 7c ff ff <0f> 0b 4c 89 e1 49 89 d8 44 89 ea 31 f6 48 29 c1 48 c7 c7 df 47 1e
[1174955.187111] RSP: 0018:ffffb7a230bffb50 EFLAGS: 00010246
[1174955.187144] RAX: 00000000000000b6 RBX: 000000000000714f RCX: 0000000000000000
[1174955.187185] RDX: 0000000000000000 RSI: ffff8ea73fa18cd0 RDI: ffff8ea73fa18cd0
[1174955.187226] RBP: ffffb7a230bffb68 R08: ffff8ea73fa18cd0 R09: ffffb7a1c8980020
[1174955.187267] R10: ffff8ea72e477220 R11: 0000000000000001 R12: ffff8e99d63e4eb1
[1174955.188947] R13: 0000000000000001 R14: ffff8e99d63ec000 R15: 0000000000000000
[1174955.190579] FS:  00007f0e977fe700(0000) GS:ffff8ea73fa00000(0000) knlGS:0000000000000000
[1174955.192228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1174955.193902] CR2: 00007f0d28e3e000 CR3: 000000157730a005 CR4: 00000000003606e0
[1174955.195573] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1174955.197170] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1174955.198743] Call Trace:
[1174955.200438]  __check_heap_object+0xe6/0x120
[1174955.201967]  __check_object_size+0x13f/0x150
[1174955.203643]  zfs_uiomove_iter+0x61/0xf0 [zfs]
[1174955.205557]  zfs_uiomove+0x25/0x30 [zfs]
[1174955.207163]  dmu_read_uio_dnode+0xa5/0xf0 [zfs]
[1174955.208707]  ? zfs_rangelock_enter_impl+0x271/0x5c0 [zfs]
[1174955.210376]  dmu_read_uio_dbuf+0x47/0x60 [zfs]
[1174955.211894]  zfs_read+0x136/0x3a0 [zfs]
[1174955.213397]  zpl_iter_read+0xd8/0x180 [zfs]
[1174955.214908]  do_iter_readv_writev+0x18b/0x1b0
[1174955.216427]  do_iter_read+0xe2/0x1a0
[1174955.217979]  vfs_readv+0x6e/0xb0
[1174955.219360]  ? perf_syscall_enter+0x209/0x210
[1174955.220799]  do_preadv+0x93/0xd0
[1174955.222196]  __x64_sys_preadv+0x21/0x30
[1174955.223520]  do_syscall_64+0x49/0xc0
[1174955.224815]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1174955.226121] RIP: 0033:0x7f0ebad5b6d8
[1174955.227403] Code: 3c 24 48 89 4c 24 18 e8 a6 00 f8 ff 4c 8b 54 24 18 8b 3c 24 45 31 c0 41 89 c1 8b 54 24 14 48 8b 74 24 08 b8 27 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2c 44 89 cf 48 89 04 24 e8 d4 00 f8 ff 48 8b
[1174955.229971] RSP: 002b:00007f0e977fbd80 EFLAGS: 00000246 ORIG_RAX: 0000000000000127
[1174955.230943] RAX: ffffffffffffffda RBX: 00007f0e977fbdb0 RCX: 00007f0ebad5b6d8
[1174955.231900] RDX: 0000000000000020 RSI: 00007f0e977fbdb0 RDI: 0000000000000049
[1174955.232838] RBP: 00007f0e977fc020 R08: 0000000000000000 R09: 0000000000000000
[1174955.233774] R10: 00000000a097a7c1 R11: 0000000000000246 R12: 00000000a097a7c1
[1174955.234707] R13: 00007f0e977fc0c8 R14: 00007f0ebad5b670 R15: 0000000000000000
[1174955.235605] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha libblake2s_generic xt_multiport act_mirred cls_u32 sch_ingress sch_hfsc veth nf_conntrack_netlink nfnetlink xfrm_user sch_fq_codel bridge stp llc aufs overlay xt_MASQUERADE xt_nat xt_addrtype iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack iptable_filter bpfilter binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm rapl intel_cstate ipmi_ssif dcdbas mei_me mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ifb ip_tables x_tables autofs4 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear
[1174955.235641]  raid10 raid1 ses enclosure scsi_transport_sas mgag200 i2c_algo_bit drm_kms_helper crct10dif_pclmul syscopyarea sysfillrect crc32_pclmul sysimgblt ghash_clmulni_intel fb_sys_fops cec aesni_intel ixgbe rc_core crypto_simd xfrm_algo cryptd mxm_wmi glue_helper drm ahci lpc_ich tg3 dca libahci megaraid_sas mdio wmi
Rid commented 3 years ago

So the slub object has a limit of 16K:

zio_buf_comb_16384 171771 183223 16384 2 8 : tunables 0 0 0 : slabdata 92951 92951 0

and it looks like we're attempting to write objects into it over 20K, I can see this code was changed recently in:

https://github.com/openzfs/zfs/commit/783784582225e8ddfbf07993d9fc278bf08025c5#diff-8bd8fe4c03392e93d0678c0b0f2437252d2dcd1073772c60358cbb2f384bc8de

However as we're on amd64 linux that shouldn't have changed anything as afaik our page size has always been 4K.

We're going to try setting spl_kmem_cache_slab_limit=0 to see if that immediately fixes things.

Rid commented 2 years ago

For anybody else hitting this bug in the future, it was resolved by setting spl_kmem_cache_slab_limit=0, there doesn't appear to be much of a performance hit (we cannot see any noticeable difference).

Here's the trace of the bug through the stack:

https://github.com/torvalds/linux/blob/923dcc5eb0c111eccd51cc7ce1658537e3c38b25/mm/slub.c#L4524 https://github.com/torvalds/linux/blob/a4849f6000e29235a2707f22e39da6b897bb9543/mm/usercopy.c#L286 https://github.com/torvalds/linux/blob/a4849f6000e29235a2707f22e39da6b897bb9543/include/linux/thread_info.h#L218 https://github.com/torvalds/linux/blob/a4849f6000e29235a2707f22e39da6b897bb9543/include/linux/uio.h#L161 https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/os/linux/zfs/zfs_uio.c#L172 https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/os/linux/zfs/zfs_uio.c#L205

I think the bug can be fixed by properly checking the object size against the slab limit, but I would guess that an object larger than the limit should already not be possible? If so could someone point me to the code which checks the object size?

cc @rincebrain

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Rid commented 1 year ago

Unmark stale

Rid commented 1 year ago

After upgrading to 2.1.6 last night we hit https://github.com/openzfs/zfs/issues/11574 on 50+ servers. All servers were locked up requiring a reboot into rescue mode as reboots were causing infinite lockups due to OOM.

We're currently investigating how to get the servers back to a stable state.

rincebrain commented 1 year ago

The two easy ways I can think of to make scrubbing OOM a system is by having lots of pools (since currently the new scrub caps each scrub to 10% of total system RAM, with no global maximum usage), or with memory fragmentation.

Do either of those sound plausible?

Either way, you could try zfs_scrub_legacy=1 for now.

Rid commented 1 year ago

@rincebrain we have 12 pools on each server, and the RAM is always under pressure, so 10% usage would add up to 120% RAM without any other processes.

We're testing zfs_scrub_legacy=1 now.

rincebrain commented 1 year ago

There's an internal variable for this threshold, zfs_scan_mem_lim_fact, which is 20, and used in the sense of 1/[that] of total RAM per, I believe. (Unless it was reworked since I last had someone report this problem.)

So that'd be 60% of system RAM for you, I imagine.

It's not exposed as a tunable, but would be simple to doso. You could also just import one pool, pause scrub, repeat, and then stagger running them, though I would agree that shouldn't be necessary.