Open Rid opened 3 years ago
20.04 is shipping 5.4.x, and the HWE kernel is 5.11.x at this point (20.10 was 5.8.x). So you seem to be running the old 20.10 HWE kernel on 20.04, not "just" 20.04.
What version were you running before? What are the dataset/pool properties on one of the systems where this happens? What kind of workload are you running?
I was originally running 5.13.12 mainline (which doesn't have a debug symbols package) that was producing the same panic. I then switched to 5.11.x HWE which was producing the same panic, however due to this bug https://bugs.launchpad.net/ubuntu/+source/linux-hwe-5.11/+bug/1939287 it again does not have a debug symbols package.
I also tried 5.10.x however that also has the same debug symbols package missing.
So 5.8.x was the latest Ubuntu kernel I could find which has debug symbols and was still producing the same panic. Which is why I'm running that instead of 5.4.x.
The pools all have identical properties to this:
NAME PROPERTY VALUE SOURCE
sda type filesystem -
sda creation Wed Mar 13 13:10 2019 -
sda used 1.80T -
sda available 1.65T -
sda referenced 112K -
sda compressratio 1.00x -
sda mounted yes -
sda quota none default
sda reservation none default
sda recordsize 16K local
sda mountpoint /cylostore/sda local
sda sharenfs off default
sda checksum on default
sda compression lz4 local
sda atime off local
sda devices on default
sda exec on default
sda setuid on default
sda readonly off default
sda zoned off default
sda snapdir hidden default
sda aclmode discard default
sda aclinherit restricted default
sda createtxg 1 -
sda canmount on default
sda xattr on default
sda copies 1 default
sda version 5 -
sda utf8only off -
sda normalization none -
sda casesensitivity sensitive -
sda vscan off default
sda nbmand off default
sda sharesmb off default
sda refquota none default
sda refreservation none default
sda guid 4550442006279899623 -
sda primarycache all default
sda secondarycache all default
sda usedbysnapshots 0B -
sda usedbydataset 112K -
sda usedbychildren 1.80T -
sda usedbyrefreservation 0B -
sda logbias latency default
sda objsetid 54 -
sda dedup off default
sda mlslabel none default
sda sync standard default
sda dnodesize legacy default
sda refcompressratio 1.00x -
sda written 112K -
sda logicalused 1.80T -
sda logicalreferenced 50K -
sda volmode default default
sda filesystem_limit none default
sda snapshot_limit none default
sda filesystem_count none default
sda snapshot_count none default
sda snapdev hidden default
sda acltype off default
sda context none default
sda fscontext none default
sda defcontext none default
sda rootcontext none default
sda relatime off default
sda redundant_metadata all default
sda overlay on default
sda encryption off default
sda keylocation none default
sda keyformat none default
sda pbkdf2iters 0 default
sda special_small_blocks 0 default
The workload is very heavy random read/writes on SATA disks, it's a multi-tenant container system with a wide range of apps, so DBs, file transfer, streaming data, workstations and more. Around 7% of the disks we run are at 90%+ utilisation most of the time.
Okay, that's some data.
What about enabled features on the pools? (e.g. zpool get all [pool] | grep feature@
)
How many disks per system? Are they all single-disk pools? I'm guessing they're (mostly? all?) filesystems not zvols?
Are you exposing any of them over NFS/SMB/iSCSI? (I just touched this code for an NFS issue in much, much older kernels, though that patch isn't in 2.1.0 and I don't think was remotely related to this issue, I'm just wondering if there were more gremlins lurking...)
Here's the features:
sda feature@empty_bpobj active local
sda feature@lz4_compress active local
sda feature@multi_vdev_crash_dump enabled local
sda feature@spacemap_histogram active local
sda feature@enabled_txg active local
sda feature@hole_birth active local
sda feature@extensible_dataset active local
sda feature@embedded_data active local
sda feature@bookmarks enabled local
sda feature@filesystem_limits enabled local
sda feature@large_blocks enabled local
sda feature@large_dnode enabled local
sda feature@sha512 enabled local
sda feature@skein enabled local
sda feature@edonr enabled local
sda feature@userobj_accounting active local
sda feature@encryption enabled local
sda feature@project_quota active local
sda feature@device_removal enabled local
sda feature@obsolete_counts enabled local
sda feature@zpool_checkpoint enabled local
sda feature@spacemap_v2 active local
sda feature@allocation_classes enabled local
sda feature@resilver_defer enabled local
sda feature@bookmark_v2 enabled local
sda feature@redaction_bookmarks enabled local
sda feature@redacted_datasets enabled local
sda feature@bookmark_written enabled local
sda feature@log_spacemap active local
sda feature@livelist active local
sda feature@device_rebuild enabled local
sda feature@zstd_compress enabled local
There are 12 disks in the system, all single-disk pools, and all filesystems.
The pools share the disk with RAID1 & RAID10 mdadm partitions (ext4).
We're not exposing them over any interface such as NFS etc.
We also have the following udev rules:
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/iosched/low_latency}="0"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/read_ahead_kb}="32"
The only zfs parameter we have set is:
zfs_prefetch_disable=1
I've tried downgrading to:
zfs-2.0.4-1 zfs-kmod-2.0.4-1
However we're still getting panics:
PID: 953856 TASK: ffff9c406c392f00 CPU: 23 COMMAND: "qbittorrent-nox"
#0 [ffffac8a996cf840] machine_kexec at ffffffff8a46e973
#1 [ffffac8a996cf8a0] __crash_kexec at ffffffff8a5536d2
#2 [ffffac8a996cf970] crash_kexec at ffffffff8a5544e9
#3 [ffffac8a996cf988] oops_end at ffffffff8a435cc9
#4 [ffffac8a996cf9b0] die at ffffffff8a435f93
#5 [ffffac8a996cf9e0] do_trap at ffffffff8a432c05
#6 [ffffac8a996cfa30] do_error_trap at ffffffff8a432e90
#7 [ffffac8a996cfa78] exc_invalid_op at ffffffff8af73561
#8 [ffffac8a996cfaa0] asm_exc_invalid_op at ffffffff8b000a72
[exception RIP: usercopy_abort+123]
RIP: ffffffff8af2ca87 RSP: ffffac8a996cfb50 RFLAGS: 00010246
RAX: 0000000000000073 RBX: 0000000000004a1e RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff9c5a7fad8cd0 RDI: ffff9c5a7fad8cd0
RBP: ffffac8a996cfb68 R8: ffff9c5a7fad8cd0 R9: ffffac89ccc41020
R10: ffff9c4a71531680 R11: 0000000000000001 R12: ffff9c3e3af035e2
R13: 0000000000000001 R14: ffff9c3e3af08000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffffac8a996cfb70] __check_heap_object at ffffffff8a6bcc36
#10 [ffffac8a996cfb80] __check_object_size at ffffffff8a6e52af
#11 [ffffac8a996cfbb0] uiomove_iter at ffffffffc0a71131 [zfs]
#12 [ffffac8a996cfbf0] uiomove at ffffffffc0a711e5 [zfs]
#13 [ffffac8a996cfc00] dmu_read_uio_dnode at ffffffffc09754a5 [zfs]
#14 [ffffac8a996cfc50] dmu_read_uio_dbuf at ffffffffc0975537 [zfs]
#15 [ffffac8a996cfc80] zfs_read at ffffffffc0a4c386 [zfs]
#16 [ffffac8a996cfcd8] zpl_iter_read at ffffffffc0a87ee8 [zfs]
#17 [ffffac8a996cfd58] do_iter_readv_writev at ffffffff8a6ebafb
#18 [ffffac8a996cfda8] do_iter_read at ffffffff8a6ed782
#19 [ffffac8a996cfde8] vfs_readv at ffffffff8a6f010e
#20 [ffffac8a996cfee0] do_preadv at ffffffff8a6f0343
#21 [ffffac8a996cff28] __x64_sys_preadv at ffffffff8a6f03a1
#22 [ffffac8a996cff38] do_syscall_64 at ffffffff8af72f89
#23 [ffffac8a996cff50] entry_SYSCALL_64_after_hwframe at ffffffff8b00008c
RIP: 0000000000dd46ba RSP: 00007f3e12ffa230 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00000000000000ef RCX: 0000000000dd46ba
RDX: 0000000000000046 RSI: 00007f3e12ffa270 RDI: 00000000000000ef
RBP: 00007f3e12ffa270 R8: 0000000000000000 R9: 0000000000000000
R10: 00000000ece46732 R11: 0000000000000246 R12: 0000000000000046
R13: 00000000ece46732 R14: 00000000ece46732 R15: 0000000000000000
ORIG_RAX: 0000000000000127 CS: 0033 SS: 002b
They always appear to happen directly after an OOM kill:
[388192.408213] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-8681a4e5824d2034d2dccba0817fd22fd076ac97478c8a5e8b918209d46f49b8.scope,mems_allowed=0-1,oom_memcg=/system.slice/docker-8681a4e5824d2034d2dccba0817fd22fd076ac97478c8a5e8b918209d46f49b8.scope,task_memcg=/system.slice/docker-8681a4e5824d2034d2dccba0817fd22fd076ac97478c8a5e8b918209d46f49b8.scope,task=qbittorrent-nox,pid=944961,uid=14696000
[388192.408254] Memory cgroup out of memory: Killed process 944961 (qbittorrent-nox) total-vm:6594388kB, anon-rss:5794780kB, file-rss:0kB, shmem-rss:0kB, UID:14696000 pgtables:11572kB oom_score_adj:0
I have core dumps of a few crashes, please let me know what information you need to help debug the issue.
Interesting. So, based on the issue you referenced this one from, you're triggering an OOM based on a program ignoring cgroup limits and getting murdered when the system is otherwise not RAM-constrained.
It seems like docker being involved is probably worth mentioning in bug reports, FYI - it has a way of complicating many otherwise simple things in ways that are difficult to reproduce without it unless you know it's involved.
It looks like the PID which triggered the bug is a thread of the previously killed proc:
crash> ps -g 953856
PID: 944961 TASK: ffff9c4617ae0000 CPU: 45 COMMAND: "qbittorrent-nox"
PID: 953856 TASK: ffff9c406c392f00 CPU: 23 COMMAND: "qbittorrent-nox"
PID: 953857 TASK: ffff9c406c394680 CPU: 19 COMMAND: "qbittorrent-nox"
crash> ps 944961
PID PPID CPU TASK ST %MEM VSZ RSS COMM
944961 202227 45 ffff9c4617ae0000 ZO 0.0 0 0 qbittorrent-nox
crash> ps 953856
PID PPID CPU TASK ST %MEM VSZ RSS COMM
> 953856 202227 23 ffff9c406c392f00 RU 4.2 6594388 5717412 qbittorrent-nox
OOM kill was disabled however the bug still persists, so it is not related to the OOM kill event.
If you could put together a minimal reproducer you could share all the components of so someone could just run it and have the bug appear with even some percentage chance, it would likely be helpful.
On Tue, Sep 21, 2021, 1:26 PM Grant Millar @.***> wrote:
OOM kill was disabled however the bug still persists, so it is not related to the OOM kill event.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/12543#issuecomment-924200098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABUI7K3LDZI2IZL5KT4ZN3UDC543ANCNFSM5DSRJCZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
There was another panic this morning hitting the same bug, this time triggered by a different process:
[114751.548897] usercopy: Kernel memory exposure attempt detected from SLUB object 'zio_buf_comb_16384' (offset 12288, size 20480)!
[114751.548956] ------------[ cut here ]------------
[114751.548957] kernel BUG at mm/usercopy.c:99!
[114751.548981] invalid opcode: 0000 [#1] SMP PTI
[114751.549003] CPU: 22 PID: 3885497 Comm: gclone Kdump: loaded Tainted: P OE 5.8.0-63-generic #71~20.04.1-Ubuntu
[114751.549048] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.13.0 05/14/2021
[114751.549085] RIP: 0010:usercopy_abort+0x7b/0x7d
[114751.549107] Code: 4c 0f 45 de 51 4c 89 d1 48 c7 c2 9d 47 de bd 57 48 c7 c6 e0 f4 dc bd 48 c7 c7 68 48 de bd 48 0f 45 f2 4c 89 da e8 23 7c ff ff <0f> 0b 4c 89 e1 49 89 d8 44 89 ea 31 f6 48 29 c1 48 c7 c7 df 47 de
[114751.549180] RSP: 0018:ffffb76a3d507c10 EFLAGS: 00010246
[114751.549204] RAX: 0000000000000073 RBX: 0000000000005000 RCX: 0000000000000000
[114751.549234] RDX: 0000000000000000 RSI: ffff92ccbfad8cd0 RDI: ffff92ccbfad8cd0
[114751.549265] RBP: ffffb76a3d507c28 R08: ffff92ccbfad8cd0 R09: ffffb769c9240020
[114751.549294] R10: ffff92ccb8e7f590 R11: 0000000000000001 R12: ffff92cf11d7b000
[114751.549324] R13: 0000000000000001 R14: ffff92cf11d80000 R15: 0000000000000000
[114751.550563] FS: 000000c0005cc790(0000) GS:ffff92ccbfac0000(0000) knlGS:0000000000000000
[114751.551758] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[114751.552922] CR2: 00007f5015652000 CR3: 0000001689a56001 CR4: 00000000003606e0
[114751.554098] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[114751.555267] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[114751.556408] Call Trace:
[114751.557529] __check_heap_object+0xe6/0x120
[114751.558647] __check_object_size+0x13f/0x150
[114751.559823] zfs_uiomove_iter+0x61/0xf0 [zfs]
[114751.561207] zfs_uiomove+0x25/0x30 [zfs]
[114751.562547] dmu_read_uio_dnode+0xa5/0xf0 [zfs]
[114751.563645] dmu_read_uio_dbuf+0x47/0x60 [zfs]
[114751.564737] zfs_read+0x136/0x3a0 [zfs]
[114751.565813] zpl_iter_read+0xd8/0x180 [zfs]
[114751.566855] new_sync_read+0x10c/0x1a0
[114751.567871] vfs_read+0x161/0x190
[114751.568853] ksys_read+0x67/0xe0
[114751.569819] __x64_sys_read+0x1a/0x20
[114751.570782] do_syscall_64+0x49/0xc0
[114751.571741] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[114751.572676] RIP: 0033:0x4bde70
[114751.573592] Code: Unable to access opcode bytes at RIP 0x4bde46.
[114751.574529] RSP: 002b:000000c000684a38 EFLAGS: 00000216 ORIG_RAX: 0000000000000000
[114751.575564] RAX: ffffffffffffffda RBX: 000000c00004f400 RCX: 00000000004bde70
[114751.576491] RDX: 0000000000100000 RSI: 00007f501560a000 RDI: 0000000000000005
[114751.577582] RBP: 000000c000684a88 R08: 0000000000000000 R09: 0000000000000000
[114751.579173] R10: 0000000000000000 R11: 0000000000000216 R12: 00000000000000f4
[114751.580724] R13: 0000000000000000 R14: 00000000019c0546 R15: 0000000000000000
[114751.582269] Modules linked in: xt_multiport act_mirred cls_u32 sch_ingress sch_hfsc veth nf_conntrack_netlink nfnetlink xfrm_user sch_fq_codel bridge stp llc aufs overlay xt_MASQUERADE xt_nat xt_addrtype iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack iptable_filter bpfilter binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm rapl intel_cstate dcdbas mei_me mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ifb ip_tables x_tables autofs4 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 ses enclosure scsi_transport_sas mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt crct10dif_pclmul fb_sys_fops crc32_pclmul ghash_clmulni_intel
[114751.582327] cec rc_core ixgbe aesni_intel crypto_simd ahci mxm_wmi xfrm_algo cryptd glue_helper lpc_ich drm megaraid_sas dca libahci tg3 mdio wmi
So far I have tried to reproduce it using:
docker -H unix:///cylostore/sda/18220/var/run/docker.sock run -m 500M --memory-swap 500M -it --rm alexeiled/stress-ng -v --iomix 12 --iomix-bytes 10% --vm 1 --vm-bytes 1G --timeout 120s --metrics-brief
I tried this with and without the OOM killer, when the OOM killer was enabled I set the oom_score_adj of the io threads to higher so that they were being killed rather than the vm thread.
However I haven't yet been able to reproduce it myself.
We're still hitting this bug on a daily basis, here's the latest crash from this morning:
[82151.529261] usercopy: Kernel memory exposure attempt detected from SLUB object 'zio_buf_comb_16384' (offset 4635, size 28133)!
[82151.529331] ------------[ cut here ]------------
[82151.529332] kernel BUG at mm/usercopy.c:99!
[82151.529364] invalid opcode: 0000 [#1] SMP PTI
[82151.529383] CPU: 4 PID: 2914085 Comm: ffmpeg Kdump: loaded Tainted: P OE 5.8.0-63-generic #71~20.04.1-Ubuntu
[82151.529436] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.13.0 05/14/2021
[82151.529469] RIP: 0010:usercopy_abort+0x7b/0x7d
[82151.529490] Code: 4c 0f 45 de 51 4c 89 d1 48 c7 c2 9d 47 1e 83 57 48 c7 c6 e0 f4 1c 83 48 c7 c7 68 48 1e 83 48 0f 45 f2 4c 89 da e8 23 7c ff ff <0f> 0b 4c 89 e1 49 89 d8 44 89 ea 31 f6 48 29 c1 48 c7 c7 df 47 1e
[82151.529568] RSP: 0018:ffffbd2fd22e3c10 EFLAGS: 00010246
[82151.529588] RAX: 0000000000000072 RBX: 0000000000006de5 RCX: 0000000000000000
[82151.529624] RDX: 0000000000000000 RSI: ffff932cbf898cd0 RDI: ffff932cbf898cd0
[82151.529658] RBP: ffffbd2fd22e3c28 R08: ffff932cbf898cd0 R09: ffffbd2f495a0020
[82151.529684] R10: ffff932cb024d7e0 R11: 0000000000000001 R12: ffff931fca5b921b
[82151.529719] R13: 0000000000000001 R14: ffff931fca5c0000 R15: 0000000000000000
[82151.529763] FS: 00007f796f359340(0000) GS:ffff932cbf880000(0000) knlGS:0000000000000000
[82151.529792] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[82151.529825] CR2: 000055948358c000 CR3: 000000179c2f8002 CR4: 00000000003606e0
[82151.529865] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[82151.529900] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[82151.529930] Call Trace:
[82151.529952] __check_heap_object+0xe6/0x120
[82151.529972] __check_object_size+0x13f/0x150
[82151.530095] zfs_uiomove_iter+0x61/0xf0 [zfs]
[82151.530206] zfs_uiomove+0x25/0x30 [zfs]
[82151.530271] dmu_read_uio_dnode+0xa5/0xf0 [zfs]
[82151.530351] ? zfs_rangelock_enter_impl+0x271/0x5c0 [zfs]
[82151.530431] dmu_read_uio_dbuf+0x47/0x60 [zfs]
[82151.530531] zfs_read+0x136/0x3a0 [zfs]
[82151.530631] zpl_iter_read+0xd8/0x180 [zfs]
[82151.530663] new_sync_read+0x10c/0x1a0
[82151.530691] vfs_read+0x161/0x190
[82151.530717] ksys_read+0x67/0xe0
[82151.530743] __x64_sys_read+0x1a/0x20
[82151.530771] do_syscall_64+0x49/0xc0
[82151.530799] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[82151.530837] RIP: 0033:0x7f797a72e332
[82151.530867] Code: Unable to access opcode bytes at RIP 0x7f797a72e308.
[82151.530909] RSP: 002b:00007ffd3ebfd118 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[82151.530968] RAX: ffffffffffffffda RBX: 000055948339d440 RCX: 00007f797a72e332
[82151.531012] RDX: 00000000000ae404 RSI: 000055948358bed4 RDI: 0000000000000003
[82151.531058] RBP: 000055948358bed4 R08: 00007f797c0d8980 R09: 00005594833ad890
[82151.531104] R10: 00005594835a9000 R11: 0000000000000246 R12: 00000000000ae404
[82151.531149] R13: 0000000000000005 R14: 000055948339d2f0 R15: 000055948339d2c0
[82151.531197] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha libblake2s_generic xt_multiport sch_hfsc act_mirred cls_u32 sch_ingress veth nf_conntrack_netlink nfnetlink xfrm_user sch_fq_codel bridge stp llc aufs overlay xt_MASQUERADE xt_nat xt_addrtype iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack iptable_filter bpfilter binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd ipmi_ssif cryptd glue_helper rapl intel_cstate mgag200 drm_kms_helper cec mei_me rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt mei ipmi_si dcdbas mxm_wmi ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid tcp_bbr sch_fq
[82151.531247] nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ifb drm ip_tables x_tables autofs4 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 ses enclosure scsi_transport_sas ixgbe xfrm_algo crc32_pclmul lpc_ich ahci dca tg3 megaraid_sas libahci mdio wmi
We are currently trialing our own custom OOM killer which uses sigterm instead of sigkill, however the above crash happened despite this, so either the OOM killer was not fast enough (we'll try adjusting the params), or it's not related to hitting the cgroup RAM limit.
The trace looks slightly similar to the following:
https://lkml.org/lkml/2020/7/26/73
We may try using the kernel before this commit to confirm.
pinging @behlendorf for any insights.
We have implemented a custom OOM terminator in golang which uses fsnotify and containerd/cgroups to monitor docker containers for memory events, once a container starts emitting critical memory pressure and is over a certain threshold, we then send a SIGTERM to the PID with the highest oom_score.
This works for most applications which gradually use more RAM, the only situation we have at the moment is where an application spikes it's RAM usage faster than we can kill it and that appears to only happen with ffmpeg.
This still results in crashing 1-2 times per week, at the moment that seems the best we can do, and it's much better than the multiple crashes per day.
We would still like to debug this further if anyone is able to help.
This one is slightly different, the slub object here belongs to a different container (a1b14e7ba3a2e170d9a0b4475c61c541811de9f2c19d4d68929c71916b33271c) than the proc mentioned in the bug which belongs to container (86b70f49d51a778366e5cc853f7c9107911b551decb0164dd0de9867cef88fd9).
There didn't appear to be any memory pressure on the containers.
I've started running bpftrace on some of these procs in order to debug further.
[1174955.186691] usercopy: Kernel memory exposure attempt detected from SLUB object 'sock_inode_cache(6899:a1b14e7ba3a2e170d9a0b4475c61c541811de9f2c19d4d68929c71916b33271c)' (offset 177, size 29007)!
[1174955.186821] ------------[ cut here ]------------
[1174955.186823] kernel BUG at mm/usercopy.c:99!
[1174955.186854] invalid opcode: 0000 [#1] SMP PTI
[1174955.186885] CPU: 16 PID: 2373466 Comm: qbittorrent-nox Kdump: loaded Tainted: P OE 5.8.0-63-generic #71~20.04.1-Ubuntu
[1174955.186951] Hardware name: Dell Inc. PowerEdge R730xd/0WCJNT, BIOS 2.13.0 05/14/2021
[1174955.187000] RIP: 0010:usercopy_abort+0x7b/0x7d
[1174955.187026] Code: 4c 0f 45 de 51 4c 89 d1 48 c7 c2 9d 47 1e 87 57 48 c7 c6 e0 f4 1c 87 48 c7 c7 68 48 1e 87 48 0f 45 f2 4c 89 da e8 23 7c ff ff <0f> 0b 4c 89 e1 49 89 d8 44 89 ea 31 f6 48 29 c1 48 c7 c7 df 47 1e
[1174955.187111] RSP: 0018:ffffb7a230bffb50 EFLAGS: 00010246
[1174955.187144] RAX: 00000000000000b6 RBX: 000000000000714f RCX: 0000000000000000
[1174955.187185] RDX: 0000000000000000 RSI: ffff8ea73fa18cd0 RDI: ffff8ea73fa18cd0
[1174955.187226] RBP: ffffb7a230bffb68 R08: ffff8ea73fa18cd0 R09: ffffb7a1c8980020
[1174955.187267] R10: ffff8ea72e477220 R11: 0000000000000001 R12: ffff8e99d63e4eb1
[1174955.188947] R13: 0000000000000001 R14: ffff8e99d63ec000 R15: 0000000000000000
[1174955.190579] FS: 00007f0e977fe700(0000) GS:ffff8ea73fa00000(0000) knlGS:0000000000000000
[1174955.192228] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1174955.193902] CR2: 00007f0d28e3e000 CR3: 000000157730a005 CR4: 00000000003606e0
[1174955.195573] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1174955.197170] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1174955.198743] Call Trace:
[1174955.200438] __check_heap_object+0xe6/0x120
[1174955.201967] __check_object_size+0x13f/0x150
[1174955.203643] zfs_uiomove_iter+0x61/0xf0 [zfs]
[1174955.205557] zfs_uiomove+0x25/0x30 [zfs]
[1174955.207163] dmu_read_uio_dnode+0xa5/0xf0 [zfs]
[1174955.208707] ? zfs_rangelock_enter_impl+0x271/0x5c0 [zfs]
[1174955.210376] dmu_read_uio_dbuf+0x47/0x60 [zfs]
[1174955.211894] zfs_read+0x136/0x3a0 [zfs]
[1174955.213397] zpl_iter_read+0xd8/0x180 [zfs]
[1174955.214908] do_iter_readv_writev+0x18b/0x1b0
[1174955.216427] do_iter_read+0xe2/0x1a0
[1174955.217979] vfs_readv+0x6e/0xb0
[1174955.219360] ? perf_syscall_enter+0x209/0x210
[1174955.220799] do_preadv+0x93/0xd0
[1174955.222196] __x64_sys_preadv+0x21/0x30
[1174955.223520] do_syscall_64+0x49/0xc0
[1174955.224815] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1174955.226121] RIP: 0033:0x7f0ebad5b6d8
[1174955.227403] Code: 3c 24 48 89 4c 24 18 e8 a6 00 f8 ff 4c 8b 54 24 18 8b 3c 24 45 31 c0 41 89 c1 8b 54 24 14 48 8b 74 24 08 b8 27 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2c 44 89 cf 48 89 04 24 e8 d4 00 f8 ff 48 8b
[1174955.229971] RSP: 002b:00007f0e977fbd80 EFLAGS: 00000246 ORIG_RAX: 0000000000000127
[1174955.230943] RAX: ffffffffffffffda RBX: 00007f0e977fbdb0 RCX: 00007f0ebad5b6d8
[1174955.231900] RDX: 0000000000000020 RSI: 00007f0e977fbdb0 RDI: 0000000000000049
[1174955.232838] RBP: 00007f0e977fc020 R08: 0000000000000000 R09: 0000000000000000
[1174955.233774] R10: 00000000a097a7c1 R11: 0000000000000246 R12: 00000000a097a7c1
[1174955.234707] R13: 00007f0e977fc0c8 R14: 00007f0ebad5b670 R15: 0000000000000000
[1174955.235605] Modules linked in: wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha libblake2s_generic xt_multiport act_mirred cls_u32 sch_ingress sch_hfsc veth nf_conntrack_netlink nfnetlink xfrm_user sch_fq_codel bridge stp llc aufs overlay xt_MASQUERADE xt_nat xt_addrtype iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_state xt_conntrack iptable_filter bpfilter binfmt_misc zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm rapl intel_cstate ipmi_ssif dcdbas mei_me mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ifb ip_tables x_tables autofs4 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear
[1174955.235641] raid10 raid1 ses enclosure scsi_transport_sas mgag200 i2c_algo_bit drm_kms_helper crct10dif_pclmul syscopyarea sysfillrect crc32_pclmul sysimgblt ghash_clmulni_intel fb_sys_fops cec aesni_intel ixgbe rc_core crypto_simd xfrm_algo cryptd mxm_wmi glue_helper drm ahci lpc_ich tg3 dca libahci megaraid_sas mdio wmi
So the slub object has a limit of 16K:
zio_buf_comb_16384 171771 183223 16384 2 8 : tunables 0 0 0 : slabdata 92951 92951 0
and it looks like we're attempting to write objects into it over 20K, I can see this code was changed recently in:
However as we're on amd64 linux that shouldn't have changed anything as afaik our page size has always been 4K.
We're going to try setting spl_kmem_cache_slab_limit=0
to see if that immediately fixes things.
For anybody else hitting this bug in the future, it was resolved by setting spl_kmem_cache_slab_limit=0
, there doesn't appear to be much of a performance hit (we cannot see any noticeable difference).
Here's the trace of the bug through the stack:
https://github.com/torvalds/linux/blob/923dcc5eb0c111eccd51cc7ce1658537e3c38b25/mm/slub.c#L4524 https://github.com/torvalds/linux/blob/a4849f6000e29235a2707f22e39da6b897bb9543/mm/usercopy.c#L286 https://github.com/torvalds/linux/blob/a4849f6000e29235a2707f22e39da6b897bb9543/include/linux/thread_info.h#L218 https://github.com/torvalds/linux/blob/a4849f6000e29235a2707f22e39da6b897bb9543/include/linux/uio.h#L161 https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/os/linux/zfs/zfs_uio.c#L172 https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/os/linux/zfs/zfs_uio.c#L205
I think the bug can be fixed by properly checking the object size against the slab limit, but I would guess that an object larger than the limit should already not be possible? If so could someone point me to the code which checks the object size?
cc @rincebrain
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
Unmark stale
After upgrading to 2.1.6 last night we hit https://github.com/openzfs/zfs/issues/11574 on 50+ servers. All servers were locked up requiring a reboot into rescue mode as reboots were causing infinite lockups due to OOM.
We're currently investigating how to get the servers back to a stable state.
The two easy ways I can think of to make scrubbing OOM a system is by having lots of pools (since currently the new scrub caps each scrub to 10% of total system RAM, with no global maximum usage), or with memory fragmentation.
Do either of those sound plausible?
Either way, you could try zfs_scrub_legacy=1
for now.
@rincebrain we have 12 pools on each server, and the RAM is always under pressure, so 10% usage would add up to 120% RAM without any other processes.
We're testing zfs_scrub_legacy=1
now.
There's an internal variable for this threshold, zfs_scan_mem_lim_fact
, which is 20, and used in the sense of 1/[that] of total RAM per, I believe. (Unless it was reworked since I last had someone report this problem.)
So that'd be 60% of system RAM for you, I imagine.
It's not exposed as a tunable, but would be simple to doso. You could also just import one pool, pause scrub, repeat, and then stagger running them, though I would agree that shouldn't be necessary.
System information
Describe the problem you're observing
Kernel randomly panics, our fleet has experienced 13 panics over 2 weeks since upgrading to zfs 2.1.0.
Describe how to reproduce the problem
Install zfs-2.1.0-0york2~20.04 on Ubuntu 20.04.3, run a high load.
Include any warning/errors/backtraces from the system logs