openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
740 stars 106 forks source link

Kernel panics with usercopy: Kernel memory exposure attempt detected from page alloc #1693

Open pckroon opened 3 months ago

pckroon commented 3 months ago

Hello world. I'm encountering the following kernel panics after switching from jiva to mayastor, which causes all sorts of chaos on my k8s cluster:

 kernel:[7182326.881293] usercopy: Kernel memory exposure attempt detected from page alloc (offset 2093608, size 18980)!

Message from syslogd@rohan2013 at Jul 15 16:09:39 ...
 kernel:[7182326.932442] Kernel panic - not syncing: Fatal exception

I'm not able to reliably reproduce the issue, which makes debugging harder.

OS info (please complete the following information):

$ helm list -n openebs NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION openebs openebs 2 2024-07-12 14:03:17.931253448 +0200 CEST deployed openebs-4.1.0 4.1.0


**Additional context**
I'm still testing mayastor, and I'm using file-backed loopback devices on 3 nodes.
<details><summary>kubectl mayastor get pools -n openebs</summary>

ID DISKS MANAGED NODE STATUS CAPACITY ALLOCATED AVAILABLE COMMITTED rohan2013-hostpath-pool aio:///dev/loop0?uuid=0cb2997e-39b5-4bbb-a831-5ee245d75e5c true rohan2013 Online 2.4TiB 228.7GiB 2.2TiB 1.5TiB zix-hostpath-pool aio:///dev/loop0?uuid=36610c71-f54d-4835-a10d-c5b912cf05e2 true zix Online 2.4TiB 12.9GiB 2.4TiB 1.9TiB gondor2013-hostpath-pool aio:///dev/loop0?uuid=4d1474bf-5a6e-479e-853e-0996dc9c9c53 true gondor2013 Online 2.4TiB 44.1GiB 2.4TiB 1.7TiB

</details>

<details><summary>Logs</summary>

jul 16 11:31:53 zix kernel: usercopy: Kernel memory exposure attempt detected from page alloc (offset 0, size 16936)! jul 16 11:31:53 zix kernel: ------------[ cut here ]------------ jul 16 11:31:53 zix kernel: kernel BUG at mm/usercopy.c:101! jul 16 11:31:53 zix kernel: invalid opcode: 0000 [#1] PREEMPT SMP PTI jul 16 11:31:53 zix kernel: CPU: 1 PID: 23165 Comm: io-engine Not tainted 6.1.0-22-amd64 #1 Debian 6.1.94-1 jul 16 11:31:53 zix kernel: Hardware name: Dell Inc. PowerEdge R720xd/0W7JN5, BIOS 2.2.2 01/16/2014 jul 16 11:31:53 zix kernel: RIP: 0010:usercopy_abort+0x75/0x77 jul 16 11:31:53 zix kernel: Code: d5 90 51 48 0f 45 d6 48 89 c1 49 c7 c3 28 41 d7 90 41 52 48 c7 c6 e7 94 d5 90 48 c7 c7 c8 40 d7 90 49 0f 45 f3 e8 da 54 ff ff <0f> 0b 48 89 f1 49 89 e8 44 89 e2 31 f6 48 c7 c7 72 41 d7 90 e8 72 jul 16 11:31:53 zix kernel: RSP: 0018:ffffb72eae827770 EFLAGS: 00010246 jul 16 11:31:53 zix kernel: RAX: 0000000000000059 RBX: ffff9ce746ee8000 RCX: 0000000000000000 jul 16 11:31:53 zix kernel: RDX: 0000000000000000 RSI: ffff9d053f8203a0 RDI: ffff9d053f8203a0 jul 16 11:31:53 zix kernel: RBP: 0000000000004228 R08: 0000000000000000 R09: ffffb72eae827608 jul 16 11:31:53 zix kernel: R10: 0000000000000003 R11: ffff9d057ff0f260 R12: 0000000000000001 jul 16 11:31:53 zix kernel: R13: ffff9ce746eec228 R14: 0000000000008117 R15: ffff9ce9261b1700 jul 16 11:31:53 zix kernel: FS: 00007fc0fdce7dc0(0000) GS:ffff9d053f800000(0000) knlGS:0000000000000000 jul 16 11:31:53 zix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 jul 16 11:31:53 zix kernel: CR2: 00007ffd893e0e80 CR3: 0000001284c72003 CR4: 00000000001706e0 jul 16 11:31:53 zix kernel: Call Trace: jul 16 11:31:53 zix kernel: jul 16 11:31:53 zix kernel: ? die_body.cold+0x1a/0x1f jul 16 11:31:53 zix kernel: ? die+0x2a/0x50 jul 16 11:31:53 zix kernel: ? do_trap+0xc5/0x110 jul 16 11:31:53 zix kernel: ? usercopy_abort+0x75/0x77 jul 16 11:31:53 zix kernel: ? do_error_trap+0x6a/0x90 jul 16 11:31:53 zix kernel: ? usercopy_abort+0x75/0x77 jul 16 11:31:53 zix kernel: ? exc_invalid_op+0x4c/0x60 jul 16 11:31:53 zix kernel: ? usercopy_abort+0x75/0x77 jul 16 11:31:53 zix kernel: ? asm_exc_invalid_op+0x16/0x20 jul 16 11:31:53 zix kernel: ? usercopy_abort+0x75/0x77 jul 16 11:31:53 zix kernel: __check_object_size.cold+0x17/0xcb jul 16 11:31:53 zix kernel: simple_copy_to_iter+0x25/0x40 jul 16 11:31:53 zix kernel: skb_datagram_iter+0x19e/0x2f0 jul 16 11:31:53 zix kernel: ? skb_free_datagram+0x10/0x10 jul 16 11:31:53 zix kernel: skb_copy_datagram_iter+0x30/0x90 jul 16 11:31:53 zix kernel: tcp_recvmsg_locked+0x5ce/0x940 jul 16 11:31:53 zix kernel: tcp_recvmsg+0x83/0x1f0 jul 16 11:31:53 zix kernel: inet_recvmsg+0x52/0x130 jul 16 11:31:53 zix kernel: sock_read_iter+0x92/0x100 jul 16 11:31:53 zix kernel: do_iter_readv_writev+0x13c/0x150 jul 16 11:31:53 zix kernel: do_iter_read+0xe8/0x1e0 jul 16 11:31:53 zix kernel: vfs_readv+0xa7/0xe0 jul 16 11:31:53 zix kernel: do_readv+0xfa/0x160 jul 16 11:31:53 zix kernel: do_syscall_64+0x55/0xb0 jul 16 11:31:53 zix kernel: ? do_readv+0x117/0x160 jul 16 11:31:53 zix kernel: ? exit_to_user_mode_prepare+0x44/0x1f0 jul 16 11:31:53 zix kernel: ? syscall_exit_to_user_mode+0x1e/0x40 jul 16 11:31:53 zix kernel: ? do_syscall_64+0x61/0xb0 jul 16 11:31:53 zix kernel: ? x64_sys_epoll_wait+0x6f/0x110 jul 16 11:31:53 zix kernel: ? exit_to_user_mode_prepare+0x44/0x1f0 jul 16 11:31:53 zix kernel: ? syscall_exit_to_user_mode+0x1e/0x40 jul 16 11:31:53 zix kernel: ? do_syscall_64+0x61/0xb0 jul 16 11:31:53 zix kernel: ? __fget_light+0x9d/0x100 jul 16 11:31:53 zix kernel: ? fget_light+0x9d/0x100 jul 16 11:31:53 zix kernel: ? do_epoll_wait+0xb2/0x7d0 jul 16 11:31:53 zix kernel: ? __x64_sys_epoll_wait+0x6f/0x110 jul 16 11:31:53 zix kernel: ? exit_to_user_mode_prepare+0x44/0x1f0 jul 16 11:31:53 zix kernel: ? syscall_exit_to_user_mode+0x1e/0x40 jul 16 11:31:53 zix kernel: ? do_syscall_64+0x61/0xb0 jul 16 11:31:53 zix kernel: ? exit_to_user_mode_prepare+0x44/0x1f0 jul 16 11:31:53 zix kernel: ? syscall_exit_to_user_mode+0x1e/0x40 jul 16 11:31:53 zix kernel: ? do_syscall_64+0x61/0xb0 jul 16 11:31:53 zix kernel: ? do_syscall_64+0x61/0xb0 jul 16 11:31:53 zix kernel: ? do_syscall_64+0x61/0xb0 jul 16 11:31:53 zix kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8 jul 16 11:31:53 zix kernel: RIP: 0033:0x7fc0fde32367 jul 16 11:31:53 zix kernel: Code: 77 51 c3 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 1b 0d f8 ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 13 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 39 44 89 c7 48 89 44 24 08 e8 74 0d f8 ff 48 jul 16 11:31:53 zix kernel: RSP: 002b:00007ffff3d5e970 EFLAGS: 00000293 ORIG_RAX: 0000000000000013 jul 16 11:31:53 zix kernel: RAX: ffffffffffffffda RBX: 000000000000024b RCX: 00007fc0fde32367 jul 16 11:31:53 zix kernel: RDX: 0000000000000002 RSI: 00007ffff3d5e9a0 RDI: 000000000000024b jul 16 11:31:53 zix kernel: RBP: 00007ffff3d5e9a0 R08: 0000000000000000 R09: 0000000000000000 jul 16 11:31:53 zix kernel: R10: 0000000000000080 R11: 0000000000000293 R12: 0000000000000002 jul 16 11:31:53 zix kernel: R13: 00007ffff3d5ea00 R14: 00007ffff3d5e9a0 R15: 0000000000008240 jul 16 11:31:53 zix kernel: jul 16 11:31:53 zix kernel: Modules linked in: tcp_diag udp_diag inet_diag vhost_net vhost vhost_iotlb tap tun blocklayoutdriver rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs iscsi_tcp libiscsi_tcp libis> jul 16 11:31:53 zix kernel: ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl intel_cstate mgag200 intel_uncore iTCO_wdt intel_pmc_bxt iTCO_vendor_support watchdog meime drm> jul 16 11:31:53 zix kernel: ---[ end trace 0000000000000000 ]--- jul 16 11:31:53 zix kernel: RIP: 0010:usercopy_abort+0x75/0x77 jul 16 11:31:53 zix kernel: Code: d5 90 51 48 0f 45 d6 48 89 c1 49 c7 c3 28 41 d7 90 41 52 48 c7 c6 e7 94 d5 90 48 c7 c7 c8 40 d7 90 49 0f 45 f3 e8 da 54 ff ff <0f> 0b 48 89 f1 49 89 e8 44 89 e2 31 f6 48 c7 c7 72 41 d7 90 e8 72 jul 16 11:31:53 zix kernel: RSP: 0018:ffffb72eae827770 EFLAGS: 00010246 jul 16 11:31:53 zix kernel: RAX: 0000000000000059 RBX: ffff9ce746ee8000 RCX: 0000000000000000 jul 16 11:31:53 zix kernel: RDX: 0000000000000000 RSI: ffff9d053f8203a0 RDI: ffff9d053f8203a0 jul 16 11:31:53 zix kernel: RBP: 0000000000004228 R08: 0000000000000000 R09: ffffb72eae827608 jul 16 11:31:53 zix kernel: R10: 0000000000000003 R11: ffff9d057ff0f260 R12: 0000000000000001 jul 16 11:31:53 zix kernel: R13: ffff9ce746eec228 R14: 0000000000008117 R15: ffff9ce9261b1700 jul 16 11:31:53 zix kernel: FS: 00007fc0fdce7dc0(0000) GS:ffff9d053f800000(0000) knlGS:0000000000000000 jul 16 11:31:53 zix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 jul 16 11:31:53 zix kernel: CR2: 00007ffd893e0e80 CR3: 0000001284c72003 CR4: 00000000001706e0 jul 16 11:31:53 zix kernel: Kernel panic - not syncing: Fatal exception

</details>

As I'm writing this it happened again:

Message from syslogd@zix at Jul 16 11:44:25 ... kernel:[ 480.982446] usercopy: Kernel memory exposure attempt detected from page alloc (offset 20480, size 12816)!



Let me know if I can provide more information
pckroon commented 3 months ago

kernel:[ 300.153358] usercopy: Kernel memory exposure attempt detected from page alloc (offset 2076672, size 29224)! ~I'm starting to develop a gut-feeling it is related to the vm I'm trying to run using kubevirt.~ Second guess it's I/O load related. When I try to pv-migrate a 1TB volume it triggers the kernel panic on the node hosting the pv-migrate pod somewhat reliably.

tiagolobocastro commented 3 months ago

Hi @pckroon, we never seem this, I wonder if it's related to the loop device. Would you mind using the files directly? Mayastor pool can be created with file directly, without having to setup the loopdev.

Is there a larger trace from dmesg with more information? Also this could simply be a kernel bug, would you be able to try a newer kernel version?

pckroon commented 3 months ago

Hello hello! Here's the dmesg traceback, fresh from this morning.

dmesg ``` [ 367.640075] usercopy: Kernel memory exposure attempt detected from page alloc (offset 24576, size 33243)! [ 367.640347] ------------[ cut here ]------------ [ 367.640349] kernel BUG at mm/usercopy.c:101! [ 367.640564] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 367.640763] CPU: 1 PID: 37226 Comm: io-engine Not tainted 6.1.0-22-amd64 #1 Debian 6.1.94-1 [ 367.640999] Hardware name: Dell Inc. PowerEdge R720xd/0W7JN5, BIOS 2.2.2 01/16/2014 [ 367.641243] RIP: 0010:usercopy_abort+0x75/0x77 [ 367.641447] Code: 55 8f 51 48 0f 45 d6 48 89 c1 49 c7 c3 28 41 57 8f 41 52 48 c7 c6 e7 94 55 8f 48 c7 c7 c8 40 57 8f 49 0f 45 f3 e8 da 54 ff ff <0f> 0b 48 89 f1 49 89 e8 44 89 e2 31 f6 48 c7 c7 72 41 57 8f e8 72 [ 367.641881] RSP: 0018:ffffba87ecc6f7f0 EFLAGS: 00010246 [ 367.642135] RAX: 000000000000005d RBX: ffff94e9a9026000 RCX: 0000000000000000 [ 367.642382] RDX: 0000000000000000 RSI: ffff95063f8203a0 RDI: ffff95063f8203a0 [ 367.642640] RBP: 00000000000081db R08: 0000000000000000 R09: ffffba87ecc6f688 [ 367.642883] R10: 0000000000000003 R11: ffff95067ff3ac28 R12: 0000000000000001 [ 367.643140] R13: ffff94e9a902e1db R14: 0000000000000065 R15: ffff94e8fdfaaae0 [ 367.643393] FS: 00007f7ea71f9dc0(0000) GS:ffff95063f800000(0000) knlGS:0000000000000000 [ 367.643687] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 367.644005] CR2: 000000c0011a7fb0 CR3: 00000011a48c4001 CR4: 00000000001706e0 [ 367.644262] Call Trace: [ 367.644516] [ 367.644785] ? __die_body.cold+0x1a/0x1f [ 367.645085] ? die+0x2a/0x50 [ 367.645371] ? do_trap+0xc5/0x110 [ 367.645656] ? usercopy_abort+0x75/0x77 [ 367.645921] ? do_error_trap+0x6a/0x90 [ 367.646193] ? usercopy_abort+0x75/0x77 [ 367.646448] ? exc_invalid_op+0x4c/0x60 [ 367.646707] ? usercopy_abort+0x75/0x77 [ 367.646979] ? asm_exc_invalid_op+0x16/0x20 [ 367.647277] ? usercopy_abort+0x75/0x77 [ 367.647576] __check_object_size.cold+0x17/0xcb [ 367.647878] simple_copy_to_iter+0x25/0x40 [ 367.648168] __skb_datagram_iter+0x19e/0x2f0 [ 367.648445] ? skb_free_datagram+0x10/0x10 [ 367.648742] skb_copy_datagram_iter+0x30/0x90 [ 367.649070] ? sock_read_iter+0x92/0x100 [ 367.649372] tcp_recvmsg_locked+0x5ce/0x940 [ 367.649669] tcp_recvmsg+0x83/0x1f0 [ 367.649964] inet_recvmsg+0x52/0x130 [ 367.650240] sock_read_iter+0x92/0x100 [ 367.650512] do_iter_readv_writev+0x13c/0x150 [ 367.650790] do_iter_read+0xe8/0x1e0 [ 367.651068] vfs_readv+0xa7/0xe0 [ 367.651347] do_readv+0xfa/0x160 [ 367.651643] do_syscall_64+0x55/0xb0 [ 367.651950] ? __x64_sys_epoll_wait+0x6f/0x110 [ 367.652226] ? exit_to_user_mode_prepare+0x44/0x1f0 [ 367.652505] ? syscall_exit_to_user_mode+0x1e/0x40 [ 367.652804] ? do_syscall_64+0x61/0xb0 [ 367.653110] ? do_epoll_wait+0xb2/0x7d0 [ 367.653404] ? __x64_sys_epoll_wait+0x6f/0x110 [ 367.653657] ? exit_to_user_mode_prepare+0x44/0x1f0 [ 367.653944] ? syscall_exit_to_user_mode+0x1e/0x40 [ 367.654208] ? do_syscall_64+0x61/0xb0 [ 367.654501] ? syscall_exit_to_user_mode+0x1e/0x40 [ 367.654773] ? do_syscall_64+0x61/0xb0 [ 367.655000] ? do_syscall_64+0x61/0xb0 [ 367.655218] ? syscall_exit_to_user_mode+0x1e/0x40 [ 367.655434] ? do_syscall_64+0x61/0xb0 [ 367.655665] ? do_syscall_64+0x61/0xb0 [ 367.655889] ? do_syscall_64+0x61/0xb0 [ 367.656108] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 367.656322] RIP: 0033:0x7f7ea7344367 [ 367.656515] Code: 77 51 c3 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 1b 0d f8 ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 13 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 39 44 89 c7 48 89 44 24 08 e8 74 0d f8 ff 48 [ 367.656941] RSP: 002b:00007ffe2879d7a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000013 [ 367.657197] RAX: ffffffffffffffda RBX: 0000000000000212 RCX: 00007f7ea7344367 [ 367.657457] RDX: 0000000000000002 RSI: 00007ffe2879d7d0 RDI: 0000000000000212 [ 367.657674] RBP: 00007ffe2879d7d0 R08: 0000000000000000 R09: 000000000000ffb3 [ 367.657904] R10: 0000000000000010 R11: 0000000000000293 R12: 0000000000000002 [ 367.658136] R13: 00007ffe2879d830 R14: 00007ffe2879d7d0 R15: 0000000000008240 [ 367.658370] [ 367.658615] Modules linked in: vhost_net vhost vhost_iotlb tap tun blocklayoutdriver xt_multiport ipt_rpfilter ip_set_hash_net vxlan xfrm_user veth wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel ip6t_REJECT nf_reject_ipv6 nf_conntrack_netlink ipt_REJECT nf_reject_ipv4 xt_mark xt_addrtype xt_MASQUERADE xt_set ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ip ip_set_bitmap_port dummy nft_chain_nat nf_nat ip_vs_rr ip_set ip_vs rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink scsi_transport_iscsi sunrpc binfmt_misc nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 intel_rapl_msr intel_rapl_common ipmi_ssif sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 [ 367.658676] sha1_ssse3 aesni_intel ipmi_si ipmi_devintf crypto_simd cryptd ipmi_msghandler mei_me iTCO_wdt intel_pmc_bxt dcdbas mei mgag200 rapl iTCO_vendor_support evdev drm_shmem_helper joydev intel_cstate watchdog drm_kms_helper acpi_power_meter pcspkr button sg intel_uncore nvme_tcp nvme_fabrics nvme_core br_netfilter bridge dm_multipath stp llc overlay efi_pstore drm loop fuse configfs efivarfs ip_tables x_tables autofs4 hid_generic usbhid hid xfs libcrc32c crc32c_generic dm_mod sd_mod t10_pi crc64_rocksoft crc64 crc_t10dif crct10dif_generic ixgbe ehci_pci ehci_hcd megaraid_sas crct10dif_pclmul xfrm_algo crct10dif_common crc32_pclmul mdio_devres igb crc32c_intel usbcore scsi_mod libphy lpc_ich i2c_algo_bit dca usb_common scsi_common mdio wmi [ 367.663211] ---[ end trace 0000000000000000 ]--- [ 367.668464] RIP: 0010:usercopy_abort+0x75/0x77 [ 367.668907] Code: 55 8f 51 48 0f 45 d6 48 89 c1 49 c7 c3 28 41 57 8f 41 52 48 c7 c6 e7 94 55 8f 48 c7 c7 c8 40 57 8f 49 0f 45 f3 e8 da 54 ff ff <0f> 0b 48 89 f1 49 89 e8 44 89 e2 31 f6 48 c7 c7 72 41 57 8f e8 72 [ 367.669683] RSP: 0018:ffffba87ecc6f7f0 EFLAGS: 00010246 [ 367.670110] RAX: 000000000000005d RBX: ffff94e9a9026000 RCX: 0000000000000000 [ 367.670543] RDX: 0000000000000000 RSI: ffff95063f8203a0 RDI: ffff95063f8203a0 [ 367.670954] RBP: 00000000000081db R08: 0000000000000000 R09: ffffba87ecc6f688 [ 367.671353] R10: 0000000000000003 R11: ffff95067ff3ac28 R12: 0000000000000001 [ 367.671793] R13: ffff94e9a902e1db R14: 0000000000000065 R15: ffff94e8fdfaaae0 ```

Note that this is still with the loopback device, I'll see if I can swap things around.

pckroon commented 3 months ago

I updated the kernel to 6.1.0-23, which is the newest avaible to debian stable. I tried to delete the existing diskpool for one of the nodes, and switch it for a file mounted directly in the io-engine pod (as HostPath). The existing diskpool gets stuck in Terminating though. To properly clean up after myself I emptied the backing file. The io-engine does recognize this. It cannot import the loop-based device (insufficient space, since I removed the loop device), then fails to import the file backed diskpool, recognizes it's empty, and reinitializes it. I also noticed that the io-engine spews a lot of ERROR io_engine::bdev::nvmx::handle:handle.rs:387] I/O completed with PI error messages right before the kernel panic.

Any more help/advice would be much appreciated!

pckroon commented 3 months ago

Ok, I managed to remove the old diskpool using the instructions in #1656. It seemed to be a bit more stable, but as soon as is rescaled my postgresql server to 1 replica I got another kernel panic.

Further update: I moved my postgresql data back to the jiva storageclass, and everything seems stable, for now Further-further update: Running VMs using kubevirt backed by mayastor also makes the systems unstable

My conclusion for now is that any application that uses hugepages can/will cause a kernel panic on the node it's running on. I'll still try running VMs backed by jiva storage, but that may have to wait until after my holidays.

tiagolobocastro commented 3 months ago

A similar issue has been reported with spdk, would you be able to try kernel 6.7? Otherwise would you be able to share steps to reproduce this so we can try with our systems?

tiagolobocastro commented 3 months ago

I've tested a pv-migrate of 400GiB volumes without issue on 6.1.87

pckroon commented 3 months ago

Thanks for investigating! 6.1.0 is the newest kernel for debian stable, I'm not eager to switch to a higher version. For me a surefire way to trigger this seems to be to start the postgresql server when it's backed by a mayastor pvc. I installed it using the bitnami/postgresql helm chart:

NAME            NAMESPACE       REVISION        UPDATED                                         STATUS          CHART                   APP VERSION
postgresql      postgresql      11              2024-07-12 17:11:59.889804984 +0200 CEST        deployed        postgresql-15.5.16      16.3.0

With the following values:

global:
  storageClass: mayastor-3
image:
  tag: 15-debian-12
  debug: true
tls:
  enabled: true
  autoGenerated: true
primary:
  pgHbaConfiguration: |-
    local all all trust
    host all all localhost trust
    host all all 10.0.0.0/8 md5
    hostssl all all 192.168.0.0/16 md5
  extendedConfiguration: |-
    huge_pages = off
  resourcesPreset: "medium"
  networkPolicy:
    enabled: false
  service:
    type: LoadBalancer
    externalTrafficPolicy: Local
  persistence:
    size: 200Gi
volumePermissions:
  enabled: true

My kubevirt VMs seem to run stably when backed by a jiva pvc; so my final diagnosis is that a pod which uses hugepages and a mayastor pvc will cause a kernel panic on the k8s node. Whether this is a mayastor bug, kubernetes bug, cgroups issue, or kernel bug I have no clue...

tiagolobocastro commented 3 months ago

@pckroon so you're saying we just need to install the postgresql? We don't even need to run any application using the postgresql?

pckroon commented 3 months ago

I'm not 100% of course, but for me it crashes as soon as the psql database starts. If that doesn't do it, maybe create a small database with some funny data and open a connection?

tiagolobocastro commented 3 months ago

Just tried to set this up but it's failing with:

Bus error (core dumped)

Probably related to hugepages

tiagolobocastro commented 3 months ago

Ok got it running:

global:
  storageClass: mayastor-nvmf-3
image:
  tag: 15-debian-12
  debug: true
tls:
  enabled: true
  autoGenerated: true
primary:
  extendedConfiguration: |-
    huge_pages = off
  extraVolumeMounts:
    - name: pg-sample-config
      mountPath: /opt/bitnami/postgresql/share/postgresql.conf.sample
      subPath: postgresql.conf.sample
  extraVolumes:
    - configMap:
        name: pg-sample-config
      name:  pg-sample-config
  resourcesPreset: "medium"
  networkPolicy:
    enabled: false
  service:
    type: LoadBalancer
    externalTrafficPolicy: Local
  persistence:
    size: 9Gi
extraDeploy:
  - apiVersion: v1
    kind: ConfigMap
    metadata:
      name: pg-sample-config
    data:
      postgresql.conf.sample: |-
        huge_pages = off
volumePermissions:
  enabled: true

No crash seen though this is a smaller volume... This was on ubuntu 2204 6.2.0

tiagolobocastro commented 3 months ago

Btw similar issue reported in SPDK: https://github.com/spdk/spdk/issues/2993#issuecomment-1619829992

pckroon commented 3 months ago

Thanks again for digging into this. The linked spdk issue seems relevant, but I'm not sure what to do with the information there. Maybe it's just an unlucky combination of Debian (with the hardened usercopy) and the kernel version. Either way, it seems... undesirable that applications that require/want huge pages can't run on mayastor storage.

I'll give postgres a go with the huge_pages configmap, but that'll have to wait until after my holidays I'm afraid. I'll get back to you at the end of August.

tiagolobocastro commented 3 months ago

Yes at the moment seems like a kernel bug that we may not be able to fix from our side (other than perhaps trying that configmap). In a future version we may allow running mayastor without hugepages, which would be another kind of solution to this.

That's great, thanks @pckroon enjoy your holidays!

pckroon commented 2 months ago

I hope you had an excellent summer. I can confirm that I can run my psql database backed by mayastor with the configmap you suggest. That said, I'm a little bit sad that mayastor is not completely application agnostic. Not sure how to proceed with this from here though.

pckroon commented 2 months ago

Hello hello! This allows me to run my postgresql server at least, but it seems I have more applications that use hugepages. This issue makes it really hard for me to use mayastor :(

avishnu commented 1 week ago

Investigation scoped for v4.3 This needs to be tested in the specified Debian version with hardened_usercopy enabled on the kernel.