Closed JuliaVixen closed 5 years ago
P.S. I do have ECC memory in this computer.
@JuliaVixen
Please do objdump -d zfs.ko
, find dmu_buf_get_blkptr
and post it. You need to post what ever covered under the range of dmu_buf_get_blkptr+0x8b
This exact same bug just occurred again, under similar conditions as the first time. So, I guess it's reproducible. Here's the dump, 0x947E plus 0x71 is 94ef: 8b 40 38 mov 0x38(%rax),%eax
which makes sense if RAX is zero.
000000000000947e <dmu_buf_get_blkptr>:
947e: 55 push %rbp
947f: 48 8b 47 48 mov 0x48(%rdi),%rax
9483: 48 89 e5 mov %rsp,%rbp
9486: 5d pop %rbp
9487: c3 retq
9488: 55 push %rbp
9489: 48 89 e5 mov %rsp,%rbp
948c: 41 54 push %r12
948e: 49 89 fc mov %rdi,%r12
9491: 49 89 b4 24 a0 00 00 mov %rsi,0xa0(%r12)
9498: 00
9499: 53 push %rbx
949a: 48 8b 46 48 mov 0x48(%rsi),%rax
949e: 48 89 f3 mov %rsi,%rbx
94a1: 48 89 47 18 mov %rax,0x18(%rdi)
94a5: 48 89 f7 mov %rsi,%rdi
94a8: e8 00 00 00 00 callq 94ad <dmu_buf_get_blkptr+0x2f>
94ad: 85 c0 test %eax,%eax
94af: 75 12 jne 94c3 <dmu_buf_get_blkptr+0x45>
94b1: 4c 89 e2 mov %r12,%rdx
94b4: 48 c7 c6 00 00 00 00 mov $0x0,%rsi
94bb: 48 89 df mov %rbx,%rdi
94be: e8 00 00 00 00 callq 94c3 <dmu_buf_get_blkptr+0x45>
94c3: 5b pop %rbx
94c4: 41 5c pop %r12
94c6: 5d pop %rbp
94c7: c3 retq
94c8: 55 push %rbp
94c9: 48 89 e5 mov %rsp,%rbp
94cc: 41 54 push %r12
94ce: 49 89 d4 mov %rdx,%r12
94d1: 53 push %rbx
94d2: 48 89 fb mov %rdi,%rbx
94d5: 48 8b 7a 20 mov 0x20(%rdx),%rdi
94d9: e8 00 00 00 00 callq 94de <dmu_buf_get_blkptr+0x60>
94de: 48 89 c7 mov %rax,%rdi
94e1: 49 8b 84 24 f0 00 00 mov 0xf0(%r12),%rax
94e8: 00
94e9: 31 d2 xor %edx,%edx
94eb: 48 8b 4b 50 mov 0x50(%rbx),%rcx
94ef: 8b 40 38 mov 0x38(%rax),%eax
94f2: 48 f7 b3 f0 02 00 00 divq 0x2f0(%rbx)
94f9: 48 89 ca mov %rcx,%rdx
94fc: 48 63 f0 movslq %eax,%rsi
94ff: e8 00 00 00 00 callq 9504 <dmu_buf_get_blkptr+0x86>
9504: 5b pop %rbx
9505: 41 5c pop %r12
9507: 5d pop %rbp
9508: c3 retq
Here is the quite nearly identical crash dump from a few minutes ago....
[177048.712082] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
[177048.712146] IP: [<ffffffffa10714ef>] dmu_buf_get_blkptr+0x71/0x8b [zfs]
[177048.712192] PGD a441d067 PUD a9097067 PMD 0
[177048.712229] Oops: 0000 [#1] SMP
[177048.712261] Modules linked in: nfsd ipv6 binfmt_misc zfs(PO) zunicode(PO) x86_pkg_temp_thermal zcommon(PO) coretemp znvpair(PO) kvm_intel spl(O) zavl(PO) zlib_deflate kvm irqbypass input_leds i2c_i801 joydev i2c_core pcspkr e1000e acpi_cpufreq thermal video fan backlight battery processor button xts gf128mul aes_x86_64 cbc sha512_generic sha256_generic sha1_generic libiscsi scsi_transport_iscsi tg3 libphy e1000 fuse nfs lockd grace sunrpc jfs multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 dm_snapshot dm_bufio dm_crypt dm_mirror dm_region_hash dm_log dm_mod hid_sunplus hid_sony hid_samsung hid_pl hid_petalynx hid_monterey hid_microsoft hid_logitech hid_gyration hid_ezkey hid_cypress hid_chicony hid_cherry hid_belkin hid_apple hid_a4tech
[177048.712835] sl811_hcd xhci_plat_hcd ohci_pci ohci_hcd uhci_hcd aic94xx libsas lpfc crc_t10dif crct10dif_common qla2xxx megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid sx8 DAC960 cciss 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc mptspi mptscsih mptbase atp870u dc395x qla1280 imm parport dmx3191d sym53c8xx gdth advansys initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sg pdc_adma sata_inic162x sata_mv ata_piix sata_qstor sata_vsc sata_uli sata_sis sata_sx4 sata_nv sata_via sata_svw sata_sil24 sata_sil sata_promise pata_sl82c105 pata_via pata_jmicron pata_marvell pata_sis pata_netcell pata_pdc202xx_old pata_triflex pata_atiixp pata_opti pata_amd pata_ali pata_it8213 pata_pcmcia pcmcia pcmcia_core pata_ns87415 pata_ns87410 pata_serverworks pata_artop pata_it821x
[177048.713380] pata_optidma pata_hpt3x2n pata_hpt3x3 pata_hpt37x pata_hpt366 pata_cmd64x pata_efar pata_rz1000 pata_sil680 pata_radisys pata_pdc2027x pata_mpiix led_class usbhid usb_storage xhci_pci ehci_pci xhci_hcd ahci ehci_hcd usbcore ptp libahci pps_core usb_common libata
[177048.713612] CPU: 1 PID: 4766 Comm: z_wr_int_0 Tainted: P O 4.4.6-gentoo #1
[177048.713649] Hardware name: Supermicro X10SLL-F/X10SLL-SF/X10SLL-F/X10SLL-SF, BIOS 1.0a 06/11/2013
[177048.713706] task: ffff880805d7e800 ti: ffff88080a8cc000 task.ti: ffff88080a8cc000
[177048.713741] RIP: 0010:[<ffffffffa10714ef>] [<ffffffffa10714ef>] dmu_buf_get_blkptr+0x71/0x8b [zfs]
[177048.713803] RSP: 0018:ffff88080a8cfd18 EFLAGS: 00010246
[177048.713833] RAX: 0000000000000000 RBX: ffff8807084c2d60 RCX: 0000000000b60c0d
[177048.713867] RDX: 0000000000000000 RSI: ffff880787531630 RDI: ffff8800c6c12800
[177048.713902] RBP: ffff88080a8cfd28 R08: 0000000000000010 R09: 0000000000000010
[177048.713936] R10: 00000000005294ac R11: ffff8804c453f030 R12: ffff8807c50362e0
[177048.713970] R13: 0000000000000100 R14: 0000000000000000 R15: ffff880076c48de0
[177048.714005] FS: 0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
[177048.714041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[177048.714072] CR2: 0000000000000038 CR3: 000000009ad53000 CR4: 00000000001406e0
[177048.714107] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[177048.714141] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[177048.714175] Stack:
[177048.714200] ffff8806f8039ab8 ffff880809db8000 ffff88080a8cfd38 ffffffffa1068189
[177048.714246] ffff88080a8cfd68 ffffffffa10f6bc4 00000000005294ac ffff8806f8039ab8
[177048.714292] 0000000000080000 ffff880805d7e800 ffff88080a8cfde8 ffffffffa10fb7bb
[177048.714338] Call Trace:
[177048.714366] [<ffffffffa1068189>] arc_buf_size+0xd3/0x77d [zfs]
[177048.714404] [<ffffffffa10f6bc4>] zio_buf_free+0x3d9/0x8cc [zfs]
[177048.714443] [<ffffffffa10fb7bb>] zio_worst_error+0xd66/0xe73 [zfs]
[177048.714478] [<ffffffff814d1258>] ? mutex_unlock+0x9/0xb
[177048.714514] [<ffffffffa10f652a>] ? zil_fini+0x2ec/0x553 [zfs]
[177048.714552] [<ffffffffa10f7b5b>] zio_execute+0x99/0xcd [zfs]
[177048.714585] [<ffffffffa0aec0d8>] taskq_create+0x54c/0x6a5 [spl]
[177048.714619] [<ffffffff8105a3f1>] ? try_to_wake_up+0x22f/0x22f
[177048.714651] [<ffffffffa0aebeac>] ? taskq_create+0x320/0x6a5 [spl]
[177048.714685] [<ffffffff81055390>] kthread+0xcd/0xd5
[177048.714715] [<ffffffff810552c3>] ? kthread_freezable_should_stop+0x43/0x43
[177048.714749] [<ffffffff814d2e9f>] ret_from_fork+0x3f/0x70
[177048.714780] [<ffffffff810552c3>] ? kthread_freezable_should_stop+0x43/0x43
[177048.714814] Code: 41 5c 5d c3 55 48 89 e5 41 54 49 89 d4 53 48 89 fb 48 8b 7a 20 e8 83 ce 00 00 48 89 c7 49 8b 84 24 f0 00 00 00 31 d2 48 8b 4b 50 <8b> 40 38 48 f7 b3 f0 02 00 00 48 89 ca 48 63 f0 e8 7a 78 02 00
[177048.715099] RIP [<ffffffffa10714ef>] dmu_buf_get_blkptr+0x71/0x8b [zfs]
[177048.715140] RSP <ffff88080a8cfd18>
[177048.715165] CR2: 0000000000000038
[177048.715636] ---[ end trace e1f1622cce51eb6d ]---
I hadn't really looked at that dump before I posted it. The crash isn't in dmu_buf_get_blkptr()
after all, there's some symbols missing... This function...
94c8: 55 push rbp
94c9: 48 89 e5 mov rbp,rsp
94cc: 41 54 push r12
94ce: 49 89 d4 mov r12,rdx
94d1: 53 push rbx
94d2: 48 89 fb mov rbx,rdi
94d5: 48 8b 7a 20 mov rdi,QWORD PTR [rdx+0x20]
94d9: e8 00 00 00 00 call 94de <dmu_buf_get_blkptr+0x60> 94da: R_X86_64_PC32 dmu_objset_pool-0x4
94de: 48 89 c7 mov rdi,rax
94e1: 49 8b 84 24 f0 00 00 00 mov rax,QWORD PTR [r12+0xf0]
94e9: 31 d2 xor edx,edx
94eb: 48 8b 4b 50 mov rcx,QWORD PTR [rbx+0x50]
94ef: 8b 40 38 mov eax,DWORD PTR [rax+0x38]
94f2: 48 f7 b3 f0 02 00 00 div QWORD PTR [rbx+0x2f0]
94f9: 48 89 ca mov rdx,rcx
94fc: 48 63 f0 movsxd rsi,eax
94ff: e8 00 00 00 00 call 9504 <dmu_buf_get_blkptr+0x86> 9500: R_X86_64_PC32 dsl_pool_undirty_space-0x4
9504: 5b pop rbx
9505: 41 5c pop r12
9507: 5d pop rbp
9508: c3 ret
... Which might be in dbuf_undirty()
, just guessing. I suppose I could try rebuilding this module with debugging symbols intact this time. Though I did this through Gentoo's weird git emerge process, so it's going to be some work to get the exact same binaries out of it.
I suppose it might help to see what the last thing on the call stack really was. It's not from arc_buf_size()
, that's just another bogus symbol. It looks like a call to a function pointer... whatever structure that 173: 48 8b 48 10 mov rcx,QWORD PTR [rax+0x10]
is.
00000000000000b6 <arc_buf_size>:
b6: 48 8b 07 mov rax,QWORD PTR [rdi]
b9: 55 push rbp
ba: 48 89 e5 mov rbp,rsp
bd: 48 63 40 2c movsxd rax,DWORD PTR [rax+0x2c]
c1: 5d pop rbp
c2: c3 ret
c3: 48 8b 0d 00 00 00 00 mov rcx,QWORD PTR [rip+0x0] # ca <arc_buf_size+0x14> c6: R_X86_64_PC32 .bss+0x8098c
ca: 48 8b 05 00 00 00 00 mov rax,QWORD PTR [rip+0x0] # d1 <arc_buf_size+0x1b> cd: R_X86_64_PC32 .bss+0x80994
d1: 45 31 c0 xor r8d,r8d
d4: 48 8b 35 00 00 00 00 mov rsi,QWORD PTR [rip+0x0] # db <arc_buf_size+0x25> d7: R_X86_64_PC32 .bss+0x80974
db: 55 push rbp
dc: 48 8b 51 48 mov rdx,QWORD PTR [rcx+0x48]
e0: 48 03 51 40 add rdx,QWORD PTR [rcx+0x40]
e4: 48 89 e5 mov rbp,rsp
e7: 48 8b 0d 00 00 00 00 mov rcx,QWORD PTR [rip+0x0] # ee <arc_buf_size+0x38> ea: R_X86_64_PC32 .bss+0x8097c
ee: 48 03 50 40 add rdx,QWORD PTR [rax+0x40]
f2: 48 03 50 48 add rdx,QWORD PTR [rax+0x48]
f6: 48 8b 46 48 mov rax,QWORD PTR [rsi+0x48]
fa: 48 03 46 40 add rax,QWORD PTR [rsi+0x40]
fe: 48 8b 35 00 00 00 00 mov rsi,QWORD PTR [rip+0x0] # 105 <arc_buf_size+0x4f> 101: R_X86_64_PC32 .data+0x60c
105: 48 03 41 40 add rax,QWORD PTR [rcx+0x40]
109: 48 03 41 48 add rax,QWORD PTR [rcx+0x48]
10d: 48 8b 0d 00 00 00 00 mov rcx,QWORD PTR [rip+0x0] # 114 <arc_buf_size+0x5e> 110: R_X86_64_PC32 .data+0x66c
114: 48 89 cf mov rdi,rcx
117: 48 29 d7 sub rdi,rdx
11a: 49 0f 48 f8 cmovs rdi,r8
11e: 48 39 f7 cmp rdi,rsi
121: 72 05 jb 128 <arc_buf_size+0x72>
123: 48 01 d0 add rax,rdx
126: eb 0a jmp 132 <arc_buf_size+0x7c>
128: 48 29 f1 sub rcx,rsi
12b: 49 0f 48 c8 cmovs rcx,r8
12f: 48 01 c8 add rax,rcx
132: 5d pop rbp
133: c3 ret
134: 48 8b 15 00 00 00 00 mov rdx,QWORD PTR [rip+0x0] # 13b <arc_buf_size+0x85> 137: R_X86_64_PC32 .data+0x5dc
13b: 8b 0d 00 00 00 00 mov ecx,DWORD PTR [rip+0x0] # 141 <arc_buf_size+0x8b> 13d: R_X86_64_PC32 zfs_arc_overflow_shift-0x4
141: 55 push rbp
142: 48 89 d0 mov rax,rdx
145: 48 89 e5 mov rbp,rsp
148: 48 d3 e8 shr rax,cl
14b: b9 00 00 00 01 mov ecx,0x1000000
150: 48 3d 00 00 00 01 cmp rax,0x1000000
156: 48 0f 42 c1 cmovb rax,rcx
15a: 48 01 d0 add rax,rdx
15d: 48 39 05 00 00 00 00 cmp QWORD PTR [rip+0x0],rax # 164 <arc_buf_size+0xae> 160: R_X86_64_PC32 .data+0x66c
164: 5d pop rbp
165: 0f 93 c0 setae al
168: 0f b6 c0 movzx eax,al
16b: c3 ret
16c: 48 8b 87 60 01 00 00 mov rax,QWORD PTR [rdi+0x160] ; <--- from this other structure
173: 48 8b 48 10 mov rcx,QWORD PTR [rax+0x10] ; <----- that'd be a function pointer
177: 48 85 c9 test rcx,rcx
17a: 74 0e je 18a <arc_buf_size+0xd4>
17c: 55 push rbp
17d: 48 8b 70 20 mov rsi,QWORD PTR [rax+0x20] ; arg
181: 48 8b 10 mov rdx,QWORD PTR [rax] ; arg
184: 48 89 e5 mov rbp,rsp
187: ff d1 call rcx ; <------- That's the last call on the stack (0xB6+0xD3)
189: 5d pop rbp
18a: c3 ret
18b: 55 push rbp
18c: ff ce dec esi
18e: 48 8b 97 80 00 00 00 mov rdx,QWORD PTR [rdi+0x80]
195: b8 0d 00 00 00 mov eax,0xd
19a: 48 89 e5 mov rbp,rsp
etc. etc.
The function is dbuf_write_physdone. The NULL pointer is db->db_data_pending
000000000000d470 <dbuf_write_physdone>:
d470: e8 00 00 00 00 callq d475 <dbuf_write_physdone+0x5>
d475: 55 push %rbp
d476: 48 89 e5 mov %rsp,%rbp
d479: 41 54 push %r12
d47b: 53 push %rbx
d47c: 48 89 fb mov %rdi,%rbx
d47f: 48 8b 7a 20 mov 0x20(%rdx),%rdi
d483: 49 89 d4 mov %rdx,%r12
d486: e8 00 00 00 00 callq d48b <dbuf_write_physdone+0x1b>
d48b: 48 89 c7 mov %rax,%rdi
d48e: 49 8b 84 24 e8 00 00 mov 0xe8(%r12),%rax
d495: 00
d496: 31 d2 xor %edx,%edx
d498: 48 8b 4b 50 mov 0x50(%rbx),%rcx
d49c: 8b 40 38 mov 0x38(%rax),%eax
d49f: 48 f7 b3 f0 02 00 00 divq 0x2f0(%rbx)
d4a6: 48 89 ca mov %rcx,%rdx
d4a9: 48 63 f0 movslq %eax,%rsi
d4ac: e8 00 00 00 00 callq d4b1 <dbuf_write_physdone+0x41>
d4b1: 5b pop %rbx
d4b2: 41 5c pop %r12
d4b4: 5d pop %rbp
d4b5: c3 retq
d4b6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
d4bd: 00 00 00
@JuliaVixen Are you still getting this error? If it's easily reproducible, you can try configure zfs with debug enable.
Well, this just happened, and now all writes to this filesystem block forever. I don't know what happened, I had one process writing to this filesystem over NFS, and I was reading from the same filesystem, copying some files to a local USB memory stick.
Reads still work fine, but all writes, anywhere in the pool, all block. (I have atime turned off on this whole pool, so reads shouldn't be causing writes anyway.)
This is the Gentoo ZFS-9999 package, the exact same one as in https://github.com/zfsonlinux/zfs/issues/4582#issuecomment-220190326