Open ccoager opened 1 year ago
Hello Cory,
Thanx for the report.
Cory Coager:
I'm seeing kernel crashes on 5.15, 6.0 and 6.1. I'm using Raspberry Pi's on NFS netboot with AUFS storage on Crio/Kubernetes.
These logs show
you (or a task named "exe" or crio) are unmounting aufs
aufs issues "sync" for its writable branch filesystems
nfs4 branch starts "syncing"
nfs4 tries acquiring a mutex and stopped
systemd tries writing a cgourp file under kernfs
cgroup file starts rcu
rcu detects the "syncing" work is stopped, and cannot continue
So the point is why nfs4 stopped mutex. I see mutex_lock(&NFS_I(cinfo->inode)->commit_mutex); in v6.1 fs/nfs/write.c:nfs_scan_commit(), and I guess this is the point.
I don't think other than nfs4 touches this commit_mutex, and I believe nfs4 is the thing to investigate first. If you can, try writing a small program to issue 'syncfs("/path/to/your/nfs4/branch")' and see what will happen.
Let me know what other information you need from me.
I want these.
(from aufs README file) When you have any problems or strange behaviour in aufs, please let me know with:
Also it MAY help a little to press MagicSysRq + d, l, and t. It is just a little help since you already gathered the stack traces, But 'd' may be able to help us.
J. R. Okajima
Cory Coager:
I'm seeing kernel crashes on 5.15, 6.0 and 6.1. I'm using Raspberry Pi's on NFS netboot with AUFS storage on Crio/Kubernetes.
I might not get the point correctly. Do you mean the first bad thing is this, and all stack traces showed up much later?
[Thu Dec 8 19:12:39 2022] Unable to handle kernel paging request at virtual address 0001010201019004 [Thu Dec 8 19:12:39 2022] Mem abort info: [Thu Dec 8 19:12:39 2022] ESR = 0x0000000096000004 [Thu Dec 8 19:12:39 2022] EC = 0x25: DABT (current EL), IL = 32 bits [Thu Dec 8 19:12:39 2022] SET = 0, FnV = 0 [Thu Dec 8 19:12:39 2022] EA = 0, S1PTW = 0 [Thu Dec 8 19:12:39 2022] FSC = 0x04: level 0 translation fault [Thu Dec 8 19:12:39 2022] Data abort info: [Thu Dec 8 19:12:39 2022] ISV = 0, ISS = 0x00000004 [Thu Dec 8 19:12:39 2022] CM = 0, WnR = 0 [Thu Dec 8 19:12:39 2022] [0001010201019004] address between user and kernel address ranges [Thu Dec 8 19:12:39 2022] Internal error: Oops: 96000004 [#1] PREEMPT SMP
Then the important thing is this paging issue rather than stack traces. Who (I mean which process or white operation) triggered this log?
J. R. Okajima
Caught a crash tonight...
[Tue Jan 24 19:21:10 2023] Unable to handle kernel paging request at virtual address 000000a2b9400394
[Tue Jan 24 19:21:10 2023] Mem abort info:
[Tue Jan 24 19:21:10 2023] ESR = 0x0000000096000004
[Tue Jan 24 19:21:10 2023] EC = 0x25: DABT (current EL), IL = 32 bits
[Tue Jan 24 19:21:10 2023] SET = 0, FnV = 0
[Tue Jan 24 19:21:10 2023] EA = 0, S1PTW = 0
[Tue Jan 24 19:21:10 2023] FSC = 0x04: level 0 translation fault
[Tue Jan 24 19:21:10 2023] Data abort info:
[Tue Jan 24 19:21:10 2023] ISV = 0, ISS = 0x00000004
[Tue Jan 24 19:21:10 2023] CM = 0, WnR = 0
[Tue Jan 24 19:21:10 2023] [000000a2b9400394] address between user and kernel address ranges
[Tue Jan 24 19:21:10 2023] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[Tue Jan 24 19:21:10 2023] Modules linked in: xt_multiport xt_set ipt_rpfilter ip_set_hash_ip ip_set_hash_net ipip tunnel4 ip_tunnel wireguard libchacha20poly1305 chacha_neon poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha veth nf_conntrack_netlink xt_statistic xt_nat xt_addrtype ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs rpcsec_gss_krb5 xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 binfmt_misc xt_comment nft_compat nf_tables nfnetlink aufs vc4 snd_soc_hdmi_codec drm_display_helper cec drm_dma_helper joydev drm_kms_helper snd_soc_core brcmfmac brcmutil snd_compress snd_bcm2835(C) snd_pcm_dmaengine snd_pcm cfg80211 snd_timer rfkill rpivid_hevc(C) bcm2835_codec(C) bcm2835_isp(C) bcm2835_v4l2(C) bcm2835_mmal_vchiq(C) v4l2_mem2mem v3d videobuf2_dma_contig snd videobuf2_vmalloc videobuf2_memops pwm_raspberrypi_poe videobuf2_v4l2 raspberrypi_hwmon videobuf2_common vc_sm_cma(C) syscopyarea gpu_sched
[Tue Jan 24 19:21:10 2023] sysfillrect videodev sysimgblt mc fb_sys_fops drm_shmem_helper uio_pdrv_genirq nvmem_rmem uio pwm_fan sch_fq_codel br_netfilter bridge stp llc drm fuse drm_panel_orientation_quirks backlight ip_tables x_tables ipv6 spidev i2c_bcm2835 spi_bcm2835
[Tue Jan 24 19:21:10 2023] CPU: 3 PID: 47 Comm: kworker/u8:2 Tainted: G C 6.1.5-v8+ #8
[Tue Jan 24 19:21:10 2023] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[Tue Jan 24 19:21:10 2023] Workqueue: writeback wb_workfn (flush-0:28)
[Tue Jan 24 19:21:10 2023] pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[Tue Jan 24 19:21:10 2023] pc : __mutex_lock.constprop.0+0x80/0x604
[Tue Jan 24 19:21:10 2023] lr : __mutex_lock.constprop.0+0x48/0x604
[Tue Jan 24 19:21:10 2023] sp : ffffffc0085bb970
[Tue Jan 24 19:21:10 2023] x29: ffffffc0085bb970 x28: ffffff81026c4460 x27: 7fffffffffffffff
[Tue Jan 24 19:21:10 2023] x26: ffffff8164674108 x25: 0000000000000000 x24: 000000007fffffff
[Tue Jan 24 19:21:10 2023] x23: 0000000000000000 x22: ffffff8164674080 x21: 0000000000000002
[Tue Jan 24 19:21:10 2023] x20: ffffffc0085bba70 x19: ffffff8164673fe8 x18: 00000000dc6ea76a
[Tue Jan 24 19:21:10 2023] x17: 0000000000000001 x16: 0000a666386d735e x15: 022bf2782e921fd2
[Tue Jan 24 19:21:10 2023] x14: ffffffdce780bbe8 x13: 000000000204f835 x12: 0000000000000000
[Tue Jan 24 19:21:10 2023] x11: ffffffffffffffc0 x10: 0000000000000001 x9 : ffffffdce77d232c
[Tue Jan 24 19:21:10 2023] x8 : ffffffc0085bba48 x7 : 0000000000000000 x6 : 0000000000000000
[Tue Jan 24 19:21:10 2023] x5 : ffffffc0085bba70 x4 : ffffff8100b70000 x3 : 350000a2b9400360
[Tue Jan 24 19:21:10 2023] x2 : ffffff8100b70000 x1 : 350000a2b9400362 x0 : 350000a2b9400360
[Tue Jan 24 19:21:10 2023] Call trace:
[Tue Jan 24 19:21:10 2023] __mutex_lock.constprop.0+0x80/0x604
[Tue Jan 24 19:21:10 2023] __mutex_lock_slowpath+0x1c/0x30
[Tue Jan 24 19:21:10 2023] mutex_lock+0x60/0x70
[Tue Jan 24 19:21:10 2023] nfs_scan_commit.part.0.isra.0+0x2c/0xc4
[Tue Jan 24 19:21:10 2023] __nfs_commit_inode+0xa8/0x1a0
[Tue Jan 24 19:21:10 2023] nfs_write_inode+0x44/0x9c
[Tue Jan 24 19:21:10 2023] nfs4_write_inode+0x24/0x60
[Tue Jan 24 19:21:10 2023] __writeback_single_inode+0x368/0x4b0
[Tue Jan 24 19:21:10 2023] writeback_sb_inodes+0x214/0x49c
[Tue Jan 24 19:21:10 2023] wb_writeback+0xf4/0x3b0
[Tue Jan 24 19:21:10 2023] wb_workfn+0xec/0x5b4
[Tue Jan 24 19:21:10 2023] process_one_work+0x1dc/0x450
[Tue Jan 24 19:21:10 2023] worker_thread+0x154/0x450
[Tue Jan 24 19:21:10 2023] kthread+0x104/0x110
[Tue Jan 24 19:21:10 2023] ret_from_fork+0x10/0x20
[Tue Jan 24 19:21:10 2023] Code: 540000c1 f9400260 f27df000 54001f20 (b9403401)
[Tue Jan 24 19:21:10 2023] ---[ end trace 0000000000000000 ]---
[Tue Jan 24 19:21:10 2023] note: kworker/u8:2[47] exited with preempt_count 1
[Tue Jan 24 19:21:10 2023] ------------[ cut here ]------------
[Tue Jan 24 19:21:10 2023] WARNING: CPU: 0 PID: 47 at kernel/exit.c:765 do_exit+0x828/0x9e0
[Tue Jan 24 19:21:10 2023] Modules linked in: xt_multiport xt_set ipt_rpfilter ip_set_hash_ip ip_set_hash_net ipip tunnel4 ip_tunnel wireguard libchacha20poly1305 chacha_neon poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha veth nf_conntrack_netlink xt_statistic xt_nat xt_addrtype ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs rpcsec_gss_krb5 xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 binfmt_misc xt_comment nft_compat nf_tables nfnetlink aufs vc4 snd_soc_hdmi_codec drm_display_helper cec drm_dma_helper joydev drm_kms_helper snd_soc_core brcmfmac brcmutil snd_compress snd_bcm2835(C) snd_pcm_dmaengine snd_pcm cfg80211 snd_timer rfkill rpivid_hevc(C) bcm2835_codec(C) bcm2835_isp(C) bcm2835_v4l2(C) bcm2835_mmal_vchiq(C) v4l2_mem2mem v3d videobuf2_dma_contig snd videobuf2_vmalloc videobuf2_memops pwm_raspberrypi_poe videobuf2_v4l2 raspberrypi_hwmon videobuf2_common vc_sm_cma(C) syscopyarea gpu_sched
[Tue Jan 24 19:21:10 2023] sysfillrect videodev sysimgblt mc fb_sys_fops drm_shmem_helper uio_pdrv_genirq nvmem_rmem uio pwm_fan sch_fq_codel br_netfilter bridge stp llc drm fuse drm_panel_orientation_quirks backlight ip_tables x_tables ipv6 spidev i2c_bcm2835 spi_bcm2835
[Tue Jan 24 19:21:11 2023] CPU: 0 PID: 47 Comm: kworker/u8:2 Tainted: G D C 6.1.5-v8+ #8
[Tue Jan 24 19:21:11 2023] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[Tue Jan 24 19:21:11 2023] Workqueue: writeback wb_workfn (flush-0:28)
[Tue Jan 24 19:21:11 2023] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[Tue Jan 24 19:21:11 2023] pc : do_exit+0x828/0x9e0
[Tue Jan 24 19:21:11 2023] lr : do_exit+0x64/0x9e0
[Tue Jan 24 19:21:11 2023] sp : ffffffc0085bb5b0
[Tue Jan 24 19:21:11 2023] x29: ffffffc0085bb5b0 x28: ffffffc0085bb6c3 x27: ffffffdce7a468a0
[Tue Jan 24 19:21:11 2023] x26: ffffffdce7a46898 x25: 0000000000000001 x24: ffffffdce77d2374
[Tue Jan 24 19:21:11 2023] x23: 0000000000000000 x22: 000000000000000b x21: ffffff8100b78000
[Tue Jan 24 19:21:11 2023] x20: ffffff8100b80000 x19: ffffff8100b70000 x18: 0000000000000000
[Tue Jan 24 19:21:11 2023] x17: 0000000000000000 x16: 0000000000000000 x15: ffffff81fefd10c0
[Tue Jan 24 19:21:11 2023] x14: 0000000000000000 x13: 0000000000000001 x12: 0000000000000000
[Tue Jan 24 19:21:11 2023] x11: 0000000000000000 x10: 0000000000001a60 x9 : ffffffdce77d8018
[Tue Jan 24 19:21:11 2023] x8 : ffffffc0085bb3e8 x7 : 0000000000000000 x6 : 0000000e31c6e800
[Tue Jan 24 19:21:11 2023] x5 : ffffffdce7fb9000 x4 : ffffffdce7fb90f0 x3 : 0000000000000000
[Tue Jan 24 19:21:11 2023] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffffc0085bbca0
[Tue Jan 24 19:21:11 2023] Call trace:
[Tue Jan 24 19:21:11 2023] do_exit+0x828/0x9e0
[Tue Jan 24 19:21:11 2023] make_task_dead+0x64/0x110
[Tue Jan 24 19:21:11 2023] die+0x1dc/0x218
[Tue Jan 24 19:21:11 2023] die_kernel_fault+0x280/0x338
[Tue Jan 24 19:21:11 2023] __do_kernel_fault+0x120/0x1c0
[Tue Jan 24 19:21:11 2023] do_translation_fault+0x58/0xdc
[Tue Jan 24 19:21:11 2023] do_mem_abort+0x4c/0xa0
[Tue Jan 24 19:21:11 2023] el1_abort+0x44/0x74
[Tue Jan 24 19:21:11 2023] el1h_64_sync_handler+0xd8/0xe4
[Tue Jan 24 19:21:11 2023] el1h_64_sync+0x64/0x68
[Tue Jan 24 19:21:11 2023] __mutex_lock.constprop.0+0x80/0x604
[Tue Jan 24 19:21:11 2023] __mutex_lock_slowpath+0x1c/0x30
[Tue Jan 24 19:21:11 2023] mutex_lock+0x60/0x70
[Tue Jan 24 19:21:11 2023] nfs_scan_commit.part.0.isra.0+0x2c/0xc4
[Tue Jan 24 19:21:11 2023] __nfs_commit_inode+0xa8/0x1a0
[Tue Jan 24 19:21:11 2023] nfs_write_inode+0x44/0x9c
[Tue Jan 24 19:21:11 2023] nfs4_write_inode+0x24/0x60
[Tue Jan 24 19:21:11 2023] __writeback_single_inode+0x368/0x4b0
[Tue Jan 24 19:21:11 2023] writeback_sb_inodes+0x214/0x49c
[Tue Jan 24 19:21:11 2023] wb_writeback+0xf4/0x3b0
[Tue Jan 24 19:21:11 2023] wb_workfn+0xec/0x5b4
[Tue Jan 24 19:21:11 2023] process_one_work+0x1dc/0x450
[Tue Jan 24 19:21:11 2023] worker_thread+0x154/0x450
[Tue Jan 24 19:21:11 2023] kthread+0x104/0x110
[Tue Jan 24 19:21:11 2023] ret_from_fork+0x10/0x20
[Tue Jan 24 19:21:11 2023] ---[ end trace 0000000000000000 ]---
show-backtrace-all-active-cpus:
[Tue Jan 24 19:33:12 2023] sysrq: Show backtrace of all active CPUs
[Tue Jan 24 19:33:12 2023] BUG: using smp_processor_id() in preemptible [00000000] code: bash/2263
[Tue Jan 24 19:33:12 2023] caller is debug_smp_processor_id+0x20/0x2c
[Tue Jan 24 19:33:12 2023] CPU: 2 PID: 2263 Comm: bash Tainted: G D WC 6.1.5-v8+ #8
[Tue Jan 24 19:33:12 2023] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT)
[Tue Jan 24 19:33:12 2023] Call trace:
[Tue Jan 24 19:33:12 2023] dump_backtrace.part.0+0xe8/0xf4
[Tue Jan 24 19:33:12 2023] show_stack+0x20/0x30
[Tue Jan 24 19:33:12 2023] dump_stack_lvl+0x8c/0xb8
[Tue Jan 24 19:33:12 2023] dump_stack+0x18/0x34
[Tue Jan 24 19:33:12 2023] check_preemption_disabled+0x118/0x124
[Tue Jan 24 19:33:12 2023] debug_smp_processor_id+0x20/0x2c
[Tue Jan 24 19:33:12 2023] sysrq_handle_showallcpus+0x28/0xc4
[Tue Jan 24 19:33:12 2023] __handle_sysrq+0x94/0x1a0
[Tue Jan 24 19:33:12 2023] write_sysrq_trigger+0x7c/0xa0
[Tue Jan 24 19:33:12 2023] proc_reg_write+0xac/0x100
[Tue Jan 24 19:33:12 2023] vfs_write+0xd8/0x35c
[Tue Jan 24 19:33:12 2023] ksys_write+0x70/0x100
[Tue Jan 24 19:33:12 2023] __arm64_sys_write+0x24/0x30
[Tue Jan 24 19:33:12 2023] invoke_syscall+0x50/0x120
[Tue Jan 24 19:33:12 2023] el0_svc_common.constprop.0+0x68/0x124
[Tue Jan 24 19:33:12 2023] do_el0_svc+0x38/0xdc
[Tue Jan 24 19:33:12 2023] el0_svc+0x30/0x94
[Tue Jan 24 19:33:12 2023] el0t_64_sync_handler+0xbc/0x13c
[Tue Jan 24 19:33:12 2023] el0t_64_sync+0x18c/0x190
[Tue Jan 24 19:33:12 2023] sysrq: CPU2:
[Tue Jan 24 19:33:12 2023] Call trace:
[Tue Jan 24 19:33:12 2023] dump_backtrace.part.0+0xe8/0xf4
[Tue Jan 24 19:33:12 2023] show_stack+0x20/0x30
[Tue Jan 24 19:33:12 2023] sysrq_handle_showallcpus+0x4c/0xc4
[Tue Jan 24 19:33:12 2023] __handle_sysrq+0x94/0x1a0
[Tue Jan 24 19:33:12 2023] write_sysrq_trigger+0x7c/0xa0
[Tue Jan 24 19:33:12 2023] proc_reg_write+0xac/0x100
[Tue Jan 24 19:33:12 2023] vfs_write+0xd8/0x35c
[Tue Jan 24 19:33:12 2023] ksys_write+0x70/0x100
[Tue Jan 24 19:33:12 2023] __arm64_sys_write+0x24/0x30
[Tue Jan 24 19:33:12 2023] invoke_syscall+0x50/0x120
[Tue Jan 24 19:33:12 2023] el0_svc_common.constprop.0+0x68/0x124
[Tue Jan 24 19:33:12 2023] do_el0_svc+0x38/0xdc
[Tue Jan 24 19:33:12 2023] el0_svc+0x30/0x94
[Tue Jan 24 19:33:12 2023] el0t_64_sync_handler+0xbc/0x13c
[Tue Jan 24 19:33:12 2023] el0t_64_sync+0x18c/0x190
[Tue Jan 24 19:33:12 2023] sysrq: CPU1:
[Tue Jan 24 19:33:12 2023] sysrq: CPU3: backtrace skipped as idling
[Tue Jan 24 19:33:12 2023] Call trace:
[Tue Jan 24 19:33:12 2023] dump_backtrace.part.0+0xe8/0xf4
[Tue Jan 24 19:33:12 2023] show_stack+0x20/0x30
[Tue Jan 24 19:33:12 2023] showacpu+0x60/0x9c
[Tue Jan 24 19:33:12 2023] __flush_smp_call_function_queue+0xe4/0x260
[Tue Jan 24 19:33:12 2023] generic_smp_call_function_single_interrupt+0x1c/0x30
[Tue Jan 24 19:33:12 2023] ipi_handler+0x98/0x300
[Tue Jan 24 19:33:12 2023] handle_percpu_devid_irq+0xac/0x240
[Tue Jan 24 19:33:12 2023] generic_handle_domain_irq+0x34/0x50
[Tue Jan 24 19:33:12 2023] gic_handle_irq+0x4c/0xe0
[Tue Jan 24 19:33:12 2023] call_on_irq_stack+0x2c/0x60
[Tue Jan 24 19:33:12 2023] do_interrupt_handler+0xdc/0xe0
[Tue Jan 24 19:33:12 2023] el0_interrupt+0x50/0x100
[Tue Jan 24 19:33:12 2023] __el0_irq_handler_common+0x18/0x24
[Tue Jan 24 19:33:12 2023] el0t_64_irq_handler+0x10/0x20
[Tue Jan 24 19:33:12 2023] el0t_64_irq+0x18c/0x190
[Tue Jan 24 19:33:12 2023] sysrq: CPU0:
[Tue Jan 24 19:33:12 2023] Call trace:
[Tue Jan 24 19:33:12 2023] dump_backtrace.part.0+0xe8/0xf4
[Tue Jan 24 19:33:12 2023] show_stack+0x20/0x30
[Tue Jan 24 19:33:12 2023] showacpu+0x60/0x9c
[Tue Jan 24 19:33:12 2023] __flush_smp_call_function_queue+0xe4/0x260
[Tue Jan 24 19:33:12 2023] generic_smp_call_function_single_interrupt+0x1c/0x30
[Tue Jan 24 19:33:12 2023] ipi_handler+0x98/0x300
[Tue Jan 24 19:33:12 2023] handle_percpu_devid_irq+0xac/0x240
[Tue Jan 24 19:33:12 2023] generic_handle_domain_irq+0x34/0x50
[Tue Jan 24 19:33:12 2023] gic_handle_irq+0x4c/0xe0
[Tue Jan 24 19:33:12 2023] call_on_irq_stack+0x2c/0x60
[Tue Jan 24 19:33:12 2023] do_interrupt_handler+0xdc/0xe0
[Tue Jan 24 19:33:12 2023] el0_interrupt+0x50/0x100
[Tue Jan 24 19:33:12 2023] __el0_irq_handler_common+0x18/0x24
[Tue Jan 24 19:33:12 2023] el0t_64_irq_handler+0x10/0x20
[Tue Jan 24 19:33:12 2023] el0t_64_irq+0x18c/0x190
show-task-states: show-task-states.txt
I might not get the point correctly.
Do you mean the first bad thing is this, and all stack traces showed up
much later?
The "Unable to handle kernel paging request at virtual address" is a kernel crash and triggers a reboot. The crashes are intermittent but happen often enough that I can reproduce it. I have 6 RPi's and they are all crashing. The crashes seem to happen when the kubelet service first starts and there is a chance that it crashes soon after or doesn't crash at all.
Current I am using kernel 6.1.5-v8+ from Git source: https://github.com/raspberrypi/linux root@artemis:~# uname -a Linux artemis 6.1.5-v8+ #8 SMP PREEMPT Sat Jan 14 02:27:34 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
root@artemis:~# modinfo aufs
filename: /lib/modules/6.1.5-v8+/kernel/fs/aufs/aufs.ko.xz
alias: fs-aufs
version: 6.1-20230109
description: aufs -- Advanced multi layered unification filesystem
author: Junjiro R. Okajima aufs-users@lists.sourceforge.net
license: GPL
srcversion: A880D04257372558DC0FB9B
depends:
intree: Y
name: aufs
vermagic: 6.1.5-v8+ SMP preempt modunload modversions aarch64
parm: brs:use
No LSM installed.
I didn't see a Magic Sysrq (d) command.
Let me know if I forgot anything or if you need anything else.
Cory Coager:
Current I am using kernel 6.1.5-v8+ from Git source: https://github.com/raspberrypi/linux
There are many branches and tags in that repository. Which one are you using?
And thanx for uploading many files.
J. R. Okajima
Cory Coager:
Let me know if I forgot anything or if you need anything else.
If you can, try writing a small program to issue 'syncfs("/path/to/your/nfs4/branch")' and see what will happen.
J. R. Okajima
There are many branches and tags in that repository.
Which one are you using?
Branch rpi-6.1.y
Cory Coager:
Branch rpi-6.1.y
Ok, I've git-pull-ed. The latest rpi-6.1.y is linux-v6.1.8 and yours is v6.1.5. Assuming this difference is not a big deal, I took a glance at nfs4 files and found nothing suspicious.
Unfortunately I cannot use nfs4 on my test environment. I don't know why but nfs4 stopped working since around linux-v5.17.
Honestly speaking, I have nothing left to investigate. If there is one thing left, it is apply this patch and run 'syncfs("/path/to/your/nfs4/branch")' BEFORE mounting aufs.
diff --git a/fs/nfs/write.c b/fs/nfs/write.c index f41d24b54fd1..f2b42ca1020a 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1073,6 +1073,7 @@ nfs_scan_commit(struct inode inode, struct list_head dst,
if (!atomic_long_read(&cinfo->mds->ncommit))
return 0;
J. R. Okajima
Unfortunately I cannot use nfs4 on my test environment. I don't know why
but nfs4 stopped working since around linux-v5.17.
The website says 6.x+ is "supported and fully tested" so I assumed NFSv4 was working. Is NFSv3 supposed to work? Because I am getting the same kernel Oops on NFSv3 (with a slightly different message).
Cory Coager:
The website says 6.x+ is "supported and fully tested" so I assumed NFSv4 was working. Is NFSv3 supposed to work? Because I am getting the same kernel Oops on NFSv3 (with a slightly different message).
I should have updated it, sorry. As you see, it says "SIGIO and nfs4 don't work expectedly" since v5.18 and some v5.1[05] stable versions. For nfs3, I can test on my side. How does the message differ?
J. R. Okajima
For nfs3, I can test on my side.
How does the message differ?
The message is basically the same but doesn't mention nfs4.
I had the same kernel Oops on kernel 5.15 in my previous testing. Where should I go from here? Try the patch?
Cory Coager:
I had the same kernel Oops on kernel 5.15 in my previous testing. Where should I go from here? Try the patch?
The patch I sent doesn't fix anything. It prints a debug message if nfs[34] is bad. But the patch and syncfs(2) for your nfs branch is what I want to try first.
The writable nfs3 branch always passed my local tests. But I will try rpi-v6.1.y with nfs3 tomorrow.
J. R. Okajima
"J. R. Okajima":
The writable nfs3 branch always passed my local tests. But I will try rpi-v6.1.y with nfs3 tomorrow.
I tried.
plain linux-v6.1.8 + aufs6.1 + rpi-v6.1.y (9002db7115e4) the kernel didn't boot on my desktop x86_64. It waits for a unknown device becomes ready. I simply dropped rpi-v6.1.y.
plain linux-v6.1.8 + aufs6.1 + nfs3 (both of RO and RW branches) It passed all of my tests, including syncfs(2). Technically speaking, it was not 'all' since a few tests were skipped. Those tests need to modify some files on RO branch and it means modifying the files on nfs server. For RO branch nfs server, my test environt doesn't support such activity. So a few tests were skipped.
It seems that the possibility of rpi-v6.1.y affects the behaviour of nfs is high, but I couldn't find any suspicious code in rpi-v6.1.y. If you tried syncfs(2) on your nfs branch, it would be a good evidence to indicate that you had better ask nfs and rpi people.
J. R. Okajima
I recompiled the kernel (6.1.8-v8+) with the latest changes and the NFS patch you gave me. I did a sync before starting the service. Here are the relevant logs. Let me know if you need anything else. kernel-oops-sysrq.txt config-6.1.8-v8+.txt proc-mounts.txt sys-module-aufs.txt sys-fs-aufs.txt
Cory Coager:
I recompiled the kernel (6.1.8-v8+) with the latest changes and the NFS patch you gave me. I did a sync before starting the service. Here are the relevant logs. Let me know if you need anything else.
Thanx for the test, but the patch I sent should not be called NFS patch. It was just to produce a warning when NFS handles an inode incorrectly. And you got the warning
[Thu Jan 26 09:13:28 2023] ------------[ cut here ]------------ [Thu Jan 26 09:13:28 2023] WARNING: CPU: 3 PID: 216 at fs/nfs/write.c:1076 nfs_scan_commit.part.0.isra.0+0xbc/0xc4
So I'd strongly suggest you to consult NFS or RaspberryPi people. It looks aufs is unrelated.
J. R. Okajima
I'm seeing kernel crashes on 5.15, 6.0 and 6.1. I'm using Raspberry Pi's on NFS netboot with AUFS storage on Crio/Kubernetes.
Let me know what other information you need from me.