openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

BUG: unable to handle kernel NULL pointer dereference at (null) #214

Closed uclageek closed 13 years ago

uclageek commented 13 years ago

Hi,

I was used "cp -aR zpool1/Dataset/"folder" zpool2/Dataset/"folder".

Then I think ZFS crashed. This is on Ubuntu 10.04, kernel version 2.6.32-30-server.

Here are the syslogs:

BUG: unable to handle kernel NULL pointer dereference at (null) [ 3413.814213] IP: [] balance_pgdat+0x4d4/0x6a0 [ 3413.814229] PGD 216aa7067 PUD 203d24067 PMD 0 [ 3413.814237] Oops: 0002 [#1] SMP [ 3413.814242] last sysfs file: /sys/devices/platform/it87.3712/temp3_alarm [ 3413.814248] CPU 1 [ 3413.814251] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate binfmt_misc appletalk ppdev xt_state ipt_REJECT xt_tcpudp kvm_amd kvm quota_v2 quota_tree snd_hda_codec_atihdmi snd_hda_codec_via snd_seq_dummy snd_seq_oss bridge snd_seq_midi snd_hda_intel stp iptable_filter snd_rawmidi snd_hda_codec snd_hwdep ipt_MASQUERADE snd_seq_midi_event snd_pcm_oss snd_mixer_oss iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 snd_seq ip_tables x_tables snd_seq_device snd_pcm fbcon tileblit font dell_wmi bitblit softcursor dcdbas fglrx(P) snd_timer vga16fb vgastate snd edac_core edac_mce_amd psmouse serio_raw it87 hwmon_vid shpchp i2c_piix4 soundcore snd_page_alloc lp parport dm_raid45 usbhid hid usb_storage pata_jmicron ahci r8169 mii xor pata_atiixp [last unloaded: vboxdrv] [ 3413.814350] Pid: 71, comm: kswapd0 Tainted: P W 2.6.32-30-server #59-Ubuntu Inspiron 546S [ 3413.814355] RIP: 0010:[] [] balance_pgdat+0x4d4/0x6a0 [ 3413.814366] RSP: 0018:ffff880215559d60 EFLAGS: 00010286 [ 3413.814370] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000002e90c [ 3413.814374] RDX: 00000000000043bb RSI: 00000000000000d0 RDI: 0000000000003bbd [ 3413.814378] RBP: ffff880215559e50 R08: 0000000000000002 R09: 0000000000000000 [ 3413.814382] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88000000b000 [ 3413.814386] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000 [ 3413.814391] FS: 00007fc71c84d700(0000) GS:ffff880008c80000(0000) knlGS:0000000000000000 [ 3413.814396] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 3413.814399] CR2: 0000000000000000 CR3: 0000000216c4d000 CR4: 00000000000006e0 [ 3413.814404] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3413.814408] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 3413.814413] Process kswapd0 (pid: 71, threadinfo ffff880215558000, task ffff880216f09700) [ 3413.814417] Stack: [ 3413.814419] ffff880215559da0 ffff880215559

behlendorf commented 13 years ago

There's not a lot to go on here, but if I were to make a guess as to the problem I would say it was a stack overrun was likely to blame. What version of the code were you using when you hit this? Was this a one time event or can you consistently reproduce it?

uclageek commented 13 years ago

What else can I give you to help troubleshoot?

Version is latest on Launchpad.net. version 0.6.0.9.

This has happened a few times resulting in a forced reboot. The system stalls at releasing Weak File Systems (which I assume is a result of this). I originally had this pool on a sil3214 but zfs started telling me there was file corruption. After a ton of hardware troubleshooting it turned out to be the card. The pool is no longer attached to the card and corruption issues have not come back up. Now this error keeps popping up but only when moving a large amount of data +1GB.

The pool is connected via a port multiplier but this error showed up before I installed that. The system has 8GB of RAM zfs tends to take 5GB of that. All of the drives have passed long and short smart tests. I even updated the drives firmware. I am preforming another copy right now I will let you know if it fails again.

If there are any other logs or such I can supply just let me know.

uclageek commented 13 years ago

Sorry hit the wrong button.

Maraschin commented 13 years ago

Hi,

I'm get a similar problem but it the error is: Unable to handle kernel paging request at virtual address ...

And some reference about stack overrun that I can't remember right now. I get it in the main console and I did not find it any where in the logs ... The computer freezes and I've to reboot it. I'm getting it every time I try to start zfs, so I just can't use zfs at the moment. I'm running rc3 on Ubuntu 11.04 Beta Linux 2.6.38-8-server #42-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

Any tips where I can get more info to submit?

behlendorf commented 13 years ago

I just committed a few stack fixed to zfs/master they should find their way in to the Ubuntu PPA in the next week or so I'd think. If you wan't wait that long please just trying to latest source from github.

git clone git@github.com:behlendorf/zfs.git

uclageek commented 13 years ago

I think I might have narrowed the problem. I had previously installed dmraid because of the card but never used it. I just removed it and the transfers seem to be working. I'm transferring 160GB right now to push it. Will update you when/if it completes.

cyanoacry commented 13 years ago

Hello, I also wanted to confirm that this problem exists on my setup, as of last night.

Here is the syslog (the kernel oops message had already scrolled off the screen by now):


Apr 27 04:44:18 lenin kernel: [25461.564196] PGD 75a6e067 PUD 75a6f067 PMD 0
Apr 27 04:44:18 lenin kernel: [25461.564283] CPU 3
Apr 27 04:44:18 lenin kernel: [25461.564289] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl fuse nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc loop snd_pcsp snd_pcm ioatdma snd_timer snd shpchp i5000_edac edac_core i2c_i801 tpm_tis pci_hotplug evdev i2c_core tpm processor i5k_amb dca serio_raw tpm_bios rng_core soundcore snd_page_alloc thermal_sys button container dm_mod raid1 md_mod sd_mod crc_t10dif ata_generic uhci_hcd ata_piix ahci libahci libata 3w_9xxx ehci_hcd piix scsi_mod ide_core e1000e usbcore nls_base [last unloaded: scsi_wait_scan]
Apr 27 04:44:18 lenin kernel: [25461.564607] 
Apr 27 04:44:18 lenin kernel: [25461.564628] Pid: 48, comm: kswapd0 Tainted: P            2.6.38.4+ #1 Supermicro X7DB8/X7DB8
Apr 27 04:44:18 lenin kernel: [25461.564684] RIP: 0010:[]  [] kswapd+0x58e/0x8cd
Apr 27 04:44:18 lenin kernel: [25461.564731] RSP: 0018:ffff88041e93ddc0  EFLAGS: 00010286
Apr 27 04:44:18 lenin kernel: [25461.564758] RAX: 0000000000000000 RBX: ffff88042fffb000 RCX: 000000000002bb5c
Apr 27 04:44:18 lenin kernel: [25461.564788] RDX: 0000000000000000 RSI: 00000000000000d0 RDI: ffff88041fa6af78
Apr 27 04:44:18 lenin kernel: [25461.564817] RBP: 0000000000000002 R08: 0000000000013de4 R09: 0000000000000000
Apr 27 04:44:18 lenin kernel: [25461.564847] R10: 0000000000000004 R11: 00000000000000b0 R12: 0000000000000000
Apr 27 04:44:18 lenin kernel: [25461.564877] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000c
Apr 27 04:44:18 lenin kernel: [25461.564908] FS:  0000000000000000(0000) GS:ffff8800cfcc0000(0000) knlGS:0000000000000000
Apr 27 04:44:18 lenin kernel: [25461.564953] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 27 04:44:18 lenin kernel: [25461.564981] CR2: 0000000000000000 CR3: 0000000081f85000 CR4: 00000000000406e0
Apr 27 04:44:18 lenin kernel: [25461.565010] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 27 04:44:18 lenin kernel: [25461.565040] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 27 04:44:18 lenin kernel: [25461.565071] Process kswapd0 (pid: 48, threadinfo ffff88041e93c000, task ffff88041fa6af40)
Apr 27 04:44:18 lenin kernel: [25461.565136]  ffff88041fa6af78 ffff88041e93de50 ffff8800cfcd3640 0000000000000002
Apr 27 04:44:18 lenin kernel: [25461.565186]  00000000cfcd3e40 0000000000000000 0000000000000000 ffff88042fffb000
Apr 27 04:44:18 lenin kernel: [25461.565237]  ffff88042ffff0a8 000000000018447c 0000000100000000 0000000000000000
Apr 27 04:44:18 lenin kernel: [25461.565311]  [] ? kswapd+0x0/0x8cd
Apr 27 04:44:18 lenin kernel: [25461.565337]  [] ? kswapd+0x0/0x8cd
Apr 27 04:44:18 lenin kernel: [25461.565366]  [] ? kthread+0x7a/0x82
Apr 27 04:44:18 lenin kernel: [25461.565396]  [] ? kernel_thread_helper+0x4/0x10
Apr 27 04:44:18 lenin kernel: [25461.565425]  [] ? kthread+0x0/0x82
Apr 27 04:44:18 lenin kernel: [25461.565452]  [] ? kernel_thread_helper+0x0/0x10
Apr 27 04:44:18 lenin kernel: [25461.565706]  RSP 
Apr 27 04:44:18 lenin kernel: [25461.566092] ---[ end trace c9d8bb1fc6a6619e ]---

However, the system continued to respond for another 2 hours; I did several tests of heavy copying/writing to our drives. Our setup is particularly metadata intensive, as we use MooseFS ( http://moosefs.org ) on top of 3 zfs pools. The pool version is currently behind at v15, as we're waiting to see if we can go with Linux (we were previously on FreeBSD).

Additionally, I'm currently running a 2.6.38.4+ kernel, with the vmalloc() patch attached. (mentioned at https://bugzilla.kernel.org/attachment.cgi?id=50802 ). Without it we were getting locked processes similar to #184 (I'll open another bug report), under even slight load.

2 hours after the system gave the above oops, it then signaled a general exception fault in kswapd and hard locked the system (no processes responding, though the system responded to pings). I don't know if the two (the oops and the GPF) are related, but I'd never gotten a hard lock before; when kswapd/other processes would lock on disk, the system would slowly lock up, and the terminal remain responsive until the foreground task locked.


Message from syslogd@lenin at Apr 27 06:54:44 ...
 kernel:[33287.369562] general protection fault: 0000 [#2] SMP 

Message from syslogd@lenin at Apr 27 06:54:44 ...
 kernel:[33287.369597] last sysfs file: /sys/devices/platform/i5k_amb.0/temp8_alarm

Message from syslogd@lenin at Apr 27 06:54:44 ...
 kernel:[33287.370469] Stack:

Message from syslogd@lenin at Apr 27 06:54:44 ...
 kernel:[33287.370641] Call Trace:

Message from syslogd@lenin at Apr 27 06:54:44 ...
 kernel:[33287.370960] Code: ed 48 c1 e5 04 48 03 6f 10 48 89 ef e8 ef b8 0d e1 48 8b 55 08 eb 0f 44 39 62 ec 75 06 66 ff 45 00 eb 1f 48 89 ca 48 85 d2 74 11 <48> 8b 0a 48 8d 42 e8 0f 18 09 39 5a e8 75 e9 eb db 66 ff 45 00 

was printed on the ssh root terminals that were open at the time. I can't seem to find it in syslog, and the server got rebooted before we could take a look at the console. I'll keep an eye out in case the VGA terminal dumps more.

I'll update to the newest master code and give it another spin and let you guys know if we still have problems. Please let me know if there's any debug information that could be useful.

uclageek commented 13 years ago

Ok so it looks like everything transferred fine. I really think dmraid was some how screwing with zfs but I have no idea. I will be moving 2 TB's of data over the next few days so fingers crossed. Should I keep this ticket open for cyanoacry?

cyanoacry commented 13 years ago

Please do. Bug still persists in the latest trunk. ZFS is remaining responsive for me after the kernel dumps the following. It soon turns semi-unresponsive though; please see #215. (the kernel OOPS happens second, at 90 seconds after the processes lock for 120 seconds in #215.)


[17853.228719] BUG: unable to handle kernel NULL pointer dereference at           (null)
[17853.228778] IP: [] kswapd+0x58e/0x8cd
[17853.228814] PGD 3fa739067 PUD 1783a2067 PMD 0 
[17853.228844] Oops: 0002 [#1] SMP 
[17853.228871] last sysfs file: /sys/module/zfs/parameters/zfs_arc_meta_limit
[17853.228902] CPU 5 
[17853.228908] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl fuse nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc loop snd_pcsp snd_pcm snd_timer snd ioatdma i5000_edac edac_core shpchp soundcore pci_hotplug i2c_i801 evdev snd_page_alloc processor i5k_amb tpm_tis i2c_core serio_raw tpm dca rng_core tpm_bios thermal_sys container button dm_mod raid1 md_mod sd_mod crc_t10dif ata_generic uhci_hcd ata_piix ahci libahci libata ehci_hcd 3w_9xxx usbcore scsi_mod piix ide_core e1000e nls_base [last unloaded: scsi_wait_scan]
[17853.229225] 
[17853.229246] Pid: 48, comm: kswapd0 Tainted: P            2.6.38.4+ #1 Supermicro X7DB8/X7DB8
[17853.229302] RIP: 0010:[]  [] kswapd+0x58e/0x8cd
[17853.229350] RSP: 0018:ffff88041e93ddc0  EFLAGS: 00010286
[17853.229376] RAX: 0000000000000000 RBX: ffff88042fffb000 RCX: 000000000002163d
[17853.229406] RDX: 0000000000000000 RSI: 00000000000000d0 RDI: ffff88041e93dd7c
[17853.229436] RBP: 0000000000000002 R08: 000000000000f2d6 R09: 0000000000000000
[17853.229466] R10: 0000000000000004 R11: 00000000000000b0 R12: 0000000000000000
[17853.229496] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000c
[17853.229526] FS:  0000000000000000(0000) GS:ffff8800cfd40000(0000) knlGS:0000000000000000
[17853.229572] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[17853.229599] CR2: 0000000000000000 CR3: 0000000165dc7000 CR4: 00000000000406e0
[17853.229629] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17853.229659] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[17853.229689] Process kswapd0 (pid: 48, threadinfo ffff88041e93c000, task ffff88041fa6af40)
[17853.229734] Stack:
[17853.229755]  ffff88041fa6af78 ffff88041e93de50 ffff8800cfcd3640 0000000000000002
[17853.229805]  000000001fa6af40 0000000000000000 0000000000000000 ffff88042fffb000
[17853.229855]  ffff88042ffff0a8 00000000001267ba 0000000100000000 0000000000000000
[17853.229906] Call Trace:
[17853.229934]  [] ? kswapd+0x0/0x8cd
[17853.229960]  [] ? kswapd+0x0/0x8cd
[17853.229990]  [] ? kthread+0x7a/0x82
[17853.230024]  [] ? kernel_thread_helper+0x4/0x10
[17853.230053]  [] ? kthread+0x0/0x82
[17853.230081]  [] ? kernel_thread_helper+0x0/0x10
[17853.230108] Code: c1 e2 03 e8 8d 5e ff ff 84 c0 75 15 48 8d 94 24 90 00 00 00 48 8b 74 24 38 44 89 ff e8 7c ed ff ff 48 8b 44 24 58 be d0 00 00 00 <48> c7 00 00 00 00 00 48 8b 54 24 48 48 8b bc 24 90 00 00 00 e8 
[17853.230323] RIP  [] kswapd+0x58e/0x8cd
[17853.230358]  RSP 
[17853.230382] CR2: 0000000000000000
[17853.234504] ---[ end trace f96e8682d86626d6 ]---

behlendorf commented 13 years ago

I'm not sure how the dmraid packages could have caused this, but I'm glad to hear it's working better for you. Let's leave this issue open however until we can determine the real root cause.

Maraschin commented 13 years ago

Hi,

I just checked out the latest trunk version of spl and zfs and it stills crashes my machine... How can I get some logs to submit? It prints all in the console terminal but I did not manage to find it on the logs...

Before it started to happen (with rc3) I was running scrub on a raid-z with 6x 2TB discs and about 45% free, it had done about 33% of the scrub when it crashed...

Maraschin commented 13 years ago

Hello again,

It seems the problem is with the scrub. I just made a script that started the zfs and run a zpool scrub -s on the pool and it hasn't crashed yet...

Here the pool: config: NAME STATE READ WRITE CKSUM lpool ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 dm-3 ONLINE 0 0 0 dm-4 ONLINE 0 0 0 dm-5 ONLINE 0 0 0 dm-6 ONLINE 0 0 0 dm-7 ONLINE 0 0 0 dm-8 ONLINE 0 0 0

BTW, performance seems a lot worse than it was with rc3, which wasn't good too. Before I managed to get up to 1.4MB/s when writing, now is down to 600KB/s.

behlendorf commented 13 years ago

I believe your still overflowing the kernel stack when running scrub, and that's what's causing the crash. If your running zfs on top of a dm device that becomes even more likely. So if your game the next thing to do is to build you kernel with CONFIG_STACK_TRACER=y option enabled to get real time stack checking which should catch the overflow.

You will need to add the following line to /etc/sysctl.conf.

kernel.stack_tracer_enabled = 1

And mount the kernel debugfs which can be done manually or by adding the following to your /etc/fstab

debugfs                 /sys/kernel/debug       debugfs defaults        0 0

After this you'll be able to see the worst case kernel stack usage on your system by reading /sys/kernel/debug/tracing/stack_trace. We should see it go over 8k and it will help use identify the functions which are the worst offenders. I believed we got a handle on most of the stack issues, but layering on top of a dm device makes it just that much harder to stay within the 8k limit.

As for performance that's absolutely horrible! How are you testing performance? I commonly test 8+2 raidz2 sets and see 600+ MB/s.

Maraschin commented 13 years ago

Hello,

As I wrote, as soon as I started scrub (well it took 2 days) it crashed... I did what you point out above but I'm not sure I do have the information from where the problem happen... The machine totally crashed so I could not access it to copy anything qwhen it happen, just after reboot...

Here what I do have in the /sys/kernel/debug/tracing/stack_trace:


        Depth    Size   Location    (44 entries)
        -----    ----   --------
  0)     4224      32   _raw_spin_lock_bh+0x16/0x30
  1)     4192      48   nf_nat_get_offset+0x4c/0x80 [nf_nat]
  2)     4144     176   tcp_in_window+0x3c0/0x670 [nf_conntrack]
  3)     3968     192   tcp_packet+0x11a/0x7b0 [nf_conntrack]
  4)     3776     192   nf_conntrack_in+0x2f6/0x4f0 [nf_conntrack]
  5)     3584      16   ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
  6)     3568      80   nf_iterate+0x84/0xa0
  7)     3488     128   nf_hook_slow+0xb3/0x120
  8)     3360      32   __ip_local_out+0xa3/0xb0
  9)     3328      32   ip_local_out+0x16/0x30
 10)     3296     176   ip_queue_xmit+0x179/0x3f0
 11)     3120     112   tcp_transmit_skb+0x392/0x5e0
 12)     3008     112   tcp_write_xmit+0x1bb/0x3c0
 13)     2896      32   __tcp_push_pending_frames+0x26/0xa0
 14)     2864     176   tcp_sendmsg+0x701/0xc50
 15)     2688      64   inet_sendmsg+0x64/0xb0
 16)     2624     384   sock_sendmsg+0xf3/0x130
 17)     2240      32   kernel_sendmsg+0x41/0x60
 18)     2208     112   xs_send_kvec+0x8e/0xa0 [sunrpc]
 19)     2096      96   xs_sendpages+0x19b/0x220 [sunrpc]
 20)     2000      64   xs_tcp_send_request+0x5d/0x150 [sunrpc]
 21)     1936      64   xprt_transmit+0x83/0x290 [sunrpc]
 22)     1872      32   call_transmit+0xa8/0x130 [sunrpc]
 23)     1840      64   __rpc_execute+0xb2/0x1f0 [sunrpc]
 24)     1776      32   rpc_execute+0x43/0x50 [sunrpc]
 25)     1744      48   rpc_run_task+0xdd/0x130 [sunrpc]
 26)     1696      96   rpc_call_sync+0x43/0x70 [sunrpc]
 27)     1600      48   rpcb_register_call+0x20/0x70 [sunrpc]
 28)     1552     144   rpcb_register+0x9f/0xf0 [sunrpc]
 29)     1408     128   __svc_register+0x195/0x1b0 [sunrpc]
 30)     1280      96   svc_register+0x93/0x120 [sunrpc]
 31)     1184      80   svc_setup_socket+0x144/0x1c0 [sunrpc]
 32)     1104     336   svc_create_socket+0x191/0x2e0 [sunrpc]
 33)      768      16   svc_udp_create+0x1f/0x30 [sunrpc]
 34)      752     160   svc_create_xprt+0x1b8/0x2c0 [sunrpc]
 35)      592      48   create_lockd_listener+0x7b/0x80 [lockd]
 36)      544      32   create_lockd_family+0x31/0x60 [lockd]
 37)      512      48   lockd_up+0x9a/0x210 [lockd]
 38)      464      48   nfsd_svc+0xd5/0x1d0 [nfsd]
 39)      416     112   write_threads+0xc4/0xe0 [nfsd]
 40)      304      48   nfsctl_transaction_write+0x79/0x90 [nfsd]
 41)      256      48   vfs_write+0xc6/0x180
 42)      208      80   sys_write+0x51/0x90
 43)      128     128   system_call_fastpath+0x16/0x1b

Is there anything else to look at? Cheers

behlendorf commented 13 years ago

Unfortunately we need to grab this debugging after the issue occurs, grabbing after a reboot isn't going to be helpful. We should be able to do that if we rebuild your kernel with larger stacks and run it again. In fact, if this is a stack issue (as I suspect) this will prevent the crash entirely. Can you rebuild your kernel with the following change and try again?

--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -1,7 +1,7 @@
 #ifndef _ASM_X86_PAGE_64_DEFS_H
 #define _ASM_X86_PAGE_64_DEFS_H

-#define THREAD_ORDER   1
+#define THREAD_ORDER   3 
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_ORDER)
 #define CURRENT_MASK (~(THREAD_SIZE - 1))
seriv commented 13 years ago

I tried with zfs rc3 on fedora-14 scrubbing a big zpool:

pool: storage state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on older software versions. scan: scrub repaired 0 in 426h2m with 0 errors on Thu May 5 15:16:33 2011 config:

NAME        STATE     READ WRITE CKSUM
storage     ONLINE       0     0     0
  mirror-0  ONLINE       0     0     0
    sde     ONLINE       0     0     0
    sdc     ONLINE       0     0     0
  mirror-1  ONLINE       0     0     0
    sdf     ONLINE       0     0     0
    sdb     ONLINE       0     0     0

and it run for a while and then at about 95% done it hanged. I was able to boot the system into runlevel 1 and to do 'zpool export storage' Then later I've tried 'zpool import storage' and each time I tried the system hanged without any messages written to syslog. I was going to arrange serial console and to save kernel panic messages, but did not have a chance. Today I saw commit with rc4 tag, tried and imported zpool cleanly, and in about 15 minutes it completed scrub and is online. Scrub finished successfully.

behlendorf commented 13 years ago

That's very good news, thanks for reporting your success. I suspect we are close to resolving Maraschin's issue too. But in his case zfs layered on top of dm devices which consume some additional stack and further compounds the problem.

behlendorf commented 13 years ago

If this does turn out to be a stack issue we may have to fix it like the traverse_visitbp() recursion issue. This change 6656bf56216f36805731298ee0f4de87ae6b6b3d basically took the recursive function and preserved the recursive native of it by moving the stack variables on to the heap. It's not pretty, and I'd like to avoid it, but if push comes to show we can get the needed stack by doing this.

Maraschin commented 13 years ago

Sorry but I was a bit short of time... Here we go... I've downloaded the latest trunk version (Monday May 9th) I first mount all without problem and run scrub, it was working fine (up to 45%) when I decided to run zpool status ... it crashed big time! After that I've not been able to start zfs, it will crash direct every time I try to start the service.

So, I did create a virtual machine with the same configuration as the normal server and try to import the pool. Here is what I get (now I can see the terminal)


[70120.449395] BUG: unable to handle kernel paging request at 0000000000001689
[70120.450046] IP: [] print_context_stack+0xad/0x140
[70120.450486] PGD 96042067 PUD 951e5067 PMD 0
[70120.450856] Thread overran stack, or stack corrupted
[70120.451186] Oops: 0000 [#1] SMP
[70120.451461] last sysfs file: /sys/device/system/cpu/cpu3/cache/index2/shared_cpu_map
[70120.451941] CPU 1
[70120.452049] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate psmouse serio_raw virtio_ballon i2c_piix4 lp parport floppy
[70120.452896]
[70120.452896] Pid: 4214, comm: txg_sync Tainted: P    2.6.38-8-server #42-Ubuntu Bochs Bochs
[70120.452896] RIP: 0010:[ffffffff8100fe5d>]  [] print_context_stack+0xad/0x140v
[70120.452896] RSP: 0018:ffff88007a6c8040  EFLAGS: 0010002
[70120.452896] RAX: 0000000000000001 RBX: ffff88007a6c8148 RCX: 0000000000000000
[70120.452896] RDX: 0000000000000001 RSI: ffffffff81019f2d RDI: ffffffff81a2e600
[70120.452896] RBP: ffff88007a6c80a0 R08: ffffffff81a2e600 R09: 0000000000000000
[70120.452896] R10: ffffffffffffffff R11: 0000000000000000 R12: ffff88007a6c8160
[70120.452896] R13: ffff88007a6c8000 R14: ffffffff816017c0 R15: 0000000000000000
[70120.452896] FS:  00007fd1d4593c20(0000) GS: ffff88009cc80000(0000) kn1GS: 0000000000000000
[70120.452896] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[70120.452896] CR2: 0000000000001689 CR3: 0000000096799000 CR4: 00000000000006e0
[70120.452896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[70120.452896] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[70120.452896] Process txg_sync (pid:4214, threadinfo ffff88007a6c8000, task ffff88007a6d96e0)
[70120.452896] Stack:
[70120.452896]  00000000000107a4 ffff88007a6c9ff8 ffff88007a6c8160 ffffffffffffe000
[70120.452896]  ffffffff81a2e600 ffffffff81019f2d 0000000000001e78 ffff88007a6c8100
[70120.452896] 000000000000cca0 ffffffff816017c0 ffff88009cc83fc0 ffffffff81a2e600
[70120.460179] Call Trace:
[70120.460179]  [] ? save_stack_trace+0x2d/0x50
[70120.460179]  [] dump_trace+0x1f0/0x3a0
[70120.460179]  [] save_stack_trace+0x2d/0x50
[70120.460179]  [] stack_trace_call+0x12e/0x1f0
[70120.460179]  [] ftrace_call+0x5/0x2b
[70120.460179]  [] ? kmem_cache_alloc+0x21/0x110
[70120.460179]  [] ? ftrace_call+0x5/0x2b
[70120.460179]  [] mempool_alloc_slab+0x15/0x20
[70120.460179]  [] mempool_alloc+0x53/0x130
[70120.460179]  [] ? ftrace_call+0x5/0x2b
[70120.460179]  [] get_request+0x226/0x3f0

Would you lime to still recompile the kernel with the extra parameters you have sent to me or does this start to help? BTW; I still using the same encrypted devices as I listed before mapped in kvm as scsi devices.

Maraschin commented 13 years ago

Hi, I've removed all the smaller than and bigger than since I did not know how to use them in here...


[70120.449395] BUG: unable to handle kernel paging request at 0000000000001689
[70120.450046] IP: [ffffffff8100fe5d] print_context_stack+0xad/0x140
[70120.450486] PGD 96042067 PUD 951e5067 PMD 0
[70120.450856] Thread overran stack, or stack corrupted
[70120.451186] Oops: 0000 [#1] SMP
[70120.451461] last sysfs file: /sys/device/system/cpu/cpu3/cache/index2/shared_cpu_map
[70120.451941] CPU 1
[70120.452049] Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate psmouse serio_raw virtio_ballon i2c_piix4 lp parport floppy
[70120.452896]
[70120.452896] Pid: 4214, comm: txg_sync Tainted: P    2.6.38-8-server #42-Ubuntu Bochs Bochs
[70120.452896] RIP: 0010:[ffffffff8100fe5d]  [ffffffff8100fe5d] print_context_stack+0xad/0x140v
[70120.452896] RSP: 0018:ffff88007a6c8040  EFLAGS: 0010002
[70120.452896] RAX: 0000000000000001 RBX: ffff88007a6c8148 RCX: 0000000000000000
[70120.452896] RDX: 0000000000000001 RSI: ffffffff81019f2d RDI: ffffffff81a2e600
[70120.452896] RBP: ffff88007a6c80a0 R08: ffffffff81a2e600 R09: 0000000000000000
[70120.452896] R10: ffffffffffffffff R11: 0000000000000000 R12: ffff88007a6c8160
[70120.452896] R13: ffff88007a6c8000 R14: ffffffff816017c0 R15: 0000000000000000
[70120.452896] FS:  00007fd1d4593c20(0000) GS: ffff88009cc80000(0000) kn1GS: 0000000000000000
[70120.452896] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[70120.452896] CR2: 0000000000001689 CR3: 0000000096799000 CR4: 00000000000006e0
[70120.452896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[70120.452896] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[70120.452896] Process txg_sync (pid:4214, threadinfo ffff88007a6c8000, task ffff88007a6d96e0)
[70120.452896] Stack:
[70120.452896]  00000000000107a4 ffff88007a6c9ff8 ffff88007a6c8160 ffffffffffffe000
[70120.452896]  ffffffff81a2e600 ffffffff81019f2d 0000000000001e78 ffff88007a6c8100
[70120.452896]  000000000000cca0 ffffffff816017c0 ffff88009cc83fc0 ffffffff81a2e600
[70120.460179] Call Trace:
[70120.460179]  [ffffffff81019f2d] ? save_stack_trace+0x2d/0x50
[70120.460179]  [ffffffff8100ec30] dump_trace+0x1f0/0x3a0
[70120.460179]  [ffffffff81019f2d] save_stack_trace+0x2d/0x50
[70120.460179]  [ffffffff810efe8e] stack_trace_call+0x12e/0x1f0
[70120.460179]  [ffffffff8100bd56] ftrace_call+0x5/0x2b
[70120.460179]  [ffffffff811551a1] ? kmem_cache_alloc+0x21/0x110
[70120.460179]  [ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[70120.460179]  [ffffffff8110dac5] mempool_alloc_slab+0x15/0x20
[70120.460179]  [ffffffff8110de03] mempool_alloc+0x53/0x130
[70120.460179]  [ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[70120.460179]  [ffffffff812c22f6] get_request+0x226/0x3f0

behlendorf commented 13 years ago

We've good evidence we still need to reduce stack usage for scrub when your zfs pool is layered on top of dm, md, or encrypted block devices. We'll certainly work on getting this fixed in the next few weeks. In the meanwhile you can likely avoid the issue by setting the 'no_scrub_io' module option. This will allow you to import the pool and then manually disable the scrub with zpool scrub -s pool. Then you can remove the 'no_scrub_io' module option and if you avoid doing scrub until this is fixed you should be OK.

Maraschin commented 13 years ago

Hi, Thanks, if you need any other test just let me know. Just a small question, how do I set no_scrub_io module? I'm guessing it is not a kernel module that I suppose to load, or?! I tried google it but I did not find any references...

behlendorf commented 13 years ago

The no_scrub_io option is a kernel module option. You can set it one of two ways.

Maraschin commented 13 years ago

When trying the first option:

sudo /sbin/modprobe zfs no_scrub_io=1
FATAL: Error inserting zfs (/lib/modules/2.6.38-8-server/addon/zfs/zfs/zfs.ko): Unknown symbol in module, or unknown parameter (see dmesg)

terminal window:

zfs: Unknown parameter `no_scrub_io'

Whith the second option (the file), I get the nice crash screen... here the call trace:

[ 430.380958]  [ffffffff81019f2d] ? save_stack_trace+0x2d/0x50
[ 430.380958]  [ffffffff8100ec30] dump_trace+0x1f0/0x3a0
[ 430.380958]  [ffffffff81019f2d] save_stack_trace+0x2d/0x50
[ 430.380958]  [ffffffff810efe8e] stack_trace_call+0x12e/0x1f0
[ 430.380958]  [ffffffff8100bd56] ftrace_call+0x5/0x2b
[ 430.380958]  [ffffffff811551a1] ? kmem_cache_alloc+0x21/0x110
[ 430.380958]  [ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ 430.380958]  [ffffffff8110dac5] mempool_alloc_slab+0x15/0x20
[ 430.380958]  [ffffffff8110de03] mempool_alloc+0x53/0x130
[ 430.380958]  [ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ 430.380958]  [ffffffff812c22f6] get_request+0x226/0x3f0
[ 430.380958]  [ffffffff812c2779] get_request_wait+0x29-0x1a0
[ 430.380958]  [ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ 430.380958]  [ffffffff812d11e0] ? deadline_queue_empty+0x0/0x30
[ 430.380958]  [ffffffff812c2fd6] __make_request+0x76/0x4c0
[ 430.380958]  [ffffffff812c04d2] generic_make_request+0x2d2/0x5c0
[ 430.380958]  [ffffffff812c0847] submit_bio+0x87/0x110
[ 430.380958]  [ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ 430.380958]  [ffffffffa03ca13f] __vdev_disk_physio+0x3af/0x440 [zfs]
[ 430.380958]  [ffffffffa03ca23c] vdev_disk_io_start+0x6c/0x120 [zfs]
[ 430.380958]  [ffffffffa0402d7b] zio_vdev_io_start_+0xab/0x300 [zfs]
[ 430.380958]  [ffffffffa04031c9] zio_nowait+0x89/0x140 [zfs]
[ 430.380958]  [ffffffffa04031c9] zio_nowait+0x89/0x140 [zfs]
[ 430.380958]  [ffffffffa03cf8cc] vdev_raidz_io_start+0x4ac/0x6f0 [zfs]
[ 430.380958]  [ffffffffa03ce8b0] ? vdev_raidz_child_done+0x0/0x30 [zfs]
[ 430.380958]  [ffffffff815d5f59] ? mutex_unlock+0x9/0x20
[ 430.380958]  [ffffffffa0402d7b] zio_vdev_io_start+0xab/0x300 [zfs]
[ 430.380958]  [ffffffffa04031c9] zio_nowait+0x89/0x140 [zfs]
[ 430.380958]  [ffffffffa03cda56] vdev_mirror_io_start+0x186/0x3d0 [zfs]
[ 430.380958]  [ffffffffa03cd210] ? vdev_mirror_child_done+0x0/0x30 [zfs]
[ 430.380958]  [ffffffffa0402f07] zio_vdev_io_start+0x237/0x300 [zfs]
[ 430.380958]  [ffffffffa04031c9] zio_nowait+0x89/0x140 [zfs]
[ 430.380958]  [ffffffffa0372003] arc_read_nolock+0x523/0x7e0 [zfs]
[ 430.380958]  [ffffffffa037a150] ? dbuf_read_done+0x0/0x120 [zfs]
[ 430.380958]  [ffffffffa0372346] arc_read+0x86/0x150 [zfs]
[ 430.380958]  [ffffffffa037a150] ? dbuf_read_done+0x0/0x120 [zfs]
[ 430.380958]  [ffffffffa03a63c1] dsl_read+0x31/0x40 [zfs]
[ 430.380958]  [ffffffffa0376ff1] dbuf_read+0x251/0x7f0 [zfs]
[ 430.380958]  [ffffffffa03781c1] __dbuf_hold_impl+0x371/0x460 [zfs]
[ 430.380958]  [ffffffffa0376060] ? dbuf_find+0xe0/0x100 [zfs]
[ 430.380958]  [ffffffffa0377fbf] __dbuf_hold_impl+0x16f/0x460 [zfs]
[ 430.380958]  [ffffffffa0376060] ? dbuf_find+0xe0/0x100 [zfs]
[ 430.380958]  [ffffffffa0377fbf] __dbuf_hold_impl+0x16f/0x460 [zfs]
[ 430.380958]  [ffffffffa0378336] dbuf_hold_impl+0x86/0xc0 [zfs]
[ 430.380958]  [ffffffffa0378640] dbuf_hold+0x20/0x30 [zfs]
[ 430.380958]  [ffffffffa037dea7] dmu_buf_hold+0x97/0x1a0 [zfs]
[ 430.380958]  [ffffffffa03d16db] zap_idx_to_blk+0xab/0x140 [zfs]
[ 430.380958]  [ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ 430.380958]  [ffffffffa03d1dc4] zap_deref_leaf+0x54/0x80 [zfs]

I tried on both the real machine and the VM, the call trace is from the VM but the problem is similar in both cases, I just have problem to copy the call trace from the real machine terminal... Just to clarify: when I try to do the same in the real machine it just crashed since the disks are already mounted. On the vm i get the crash after try to import the same pool.

behlendorf commented 13 years ago

Sorry, I got the option slightly wrong. It's 'zfs_no_scrub_io'. Also make sure your using the latest source from the zfs master branch.

Maraschin commented 13 years ago

No, it doesn't work ... here the command

sudo /sbin/modprobe zfs zfs_no_scrub_io=1

and I get (I could not get the all stack trace since it was bigger than my terminal...

...
[ffffffffa0112060] ? dbuf_find+0xe0/0x100 [zfs]
[ffffffffa0113fbf] __dbuf_hold_impl+0x16f/0x460 [zfs]
[ffffffffa0114336] dbuf_hold_impl+0x86/0xc0 [zfs]
[ffffffffa0114640] dbuf_hold+0x20/0x30 [zfs]
[ffffffffa0119ea7] dmu_buf_hold+0x97/0x1a0 [zfs]
[ffffffffa016d6db] zap_idx_to_blk+0xab/0x140 [zfs]
[ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ffffffffa016ddc4] zap_deref_leaf+0x54/0x80 [zfs]
[ffffffffa006935b] ? kmem_alloc_debug+0xeb/0x130 [spl]
[ffffffffa016ec99] fzap_length+0x39/0xa0 [zfs]
[ffffffffa0171659] ? zap_hash+0x9/0x110 [zfs]
[ffffffffa0172246] ?zap_name_alloc_uint64x0x76/0xa0 [zfs]
[ffffffffa0173c44] zap_length_uint64+0x84/0xb0 [zfs]
[ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ffffffffa0118d10] ? ddt_zap_lookup+0x0/0xe0 [zfs]
[ffffffffa0118d60] ddt_zap_lookup+0x50/0xe0 [zfs]
[ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ffffffff815d5f59] ? mutex_unlock+0x9/0x20
[ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ffffffffa0118d10] ? ddt_zap_lookup+0x0/0xe0 [zfs]
[ffffffffa0118d60] ddt_zap_contains+0x98/0xd0 [zfs]
[ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ffffffffa006935b] ? kmem_alloc_debug+0xeb/0x130 [spl]
[ffffffffa0140290] ? dsl_scan_check_pause+0x10/0x160 [zfs]
[ffffffffa0141496] dsl_scan_visitbp.clone.2+0x316/0xbe0 [zfs]
[ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ffffffffa014096d] ? dsl_scan_prefetch+0xd/0xb0 [zfs]
[ffffffffa014179e] dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
[ffffffffa014179e] dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
[ffffffffa0141cc7] dsl_scan_visitbp.clone.2+0xb47/0xbe0 [zfs]
[ffffffffa014179e] dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
[ffffffffa014179e] dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
[ffffffffa014179e] dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
[ffffffffa014179e] dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
[ffffffffa014179e] dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
[ffffffffa014179e] dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
[ffffffffa010e101] arc_read_nolock+0x621/0x7e0 [zfs]
[ffffffffa01418e5] dsl_scan_visitbp.clone.2+0x765/0xbe0 [zfs]
[ffffffffa0141daf] dsl_scan_visit_rootbp+0x4f/0x60 [zfs]
[ffffffffa0141e56] dsl_scan_visitds+0x96/0x3c0 [zfs]
[ffffffffa01412e81] dsl_scan_sync+0x141/0x700 [zfs]
[ffffffffa019f0b4] ? zio_wait+0xe4/0x170 [zfs]
[ffffffffa015108e] spa_sync+0x40e/0x9b0 [zfs]
[ffffffff8100bd56] ? ftrace_call+0x5/0x2b
[ffffffffa015f5f1] txg_sync_thread+0x241/0x3c0 [zfs]
[ffffffffa015f3b0] ? txg_sync_thread+0x0/0x3c0 [zfs]
[ffffffffa006b108] thread_generic_wrapper+0x78/0x90 [spl]
[ffffffffa006b090] ? thread_generic_wrapper+0x0/0x90 [spl]
[ffffffff810871f6] kthread+0x96/0xa0
[ffffffff8100cde4] kernel_thread_helper+0x4/0x10
[ffffffff81087160] ? kthread+0x0/0xa0
[ffffffff8100cde0] ? kernel_thread_helper+0x0/0x10
Code: 49 07 00 85 c0 74 2c 49 8d 44 24 08 48 39 c3 74 7d 31 d2 48 8b 75 c8 48 8b 7d c0 41 ff 56 10 48 81 7d c8 01 be 00 81 49 8b 45 00 LESSTHAN8bBIGGERTHAN 90 88 16 00 00 74 1b 48 83 c3 08 4d 85 ff 75 92 4c 39 eb 76
RIP [ffffffff8100fe5d] print_context_stack+0xad/0x140
 RSP LESSTHANffff88008bb3e040BIGGERTHAN
CR2: 0000000000001689
-- [ end trace 314f1433b31e17fa ] ---
kernel panic - not syncing: Fatal exception in interrupt
Pid: 1727, comm: txg_sync Tainted: P    D   2.6.38-8-server #42-Ubuntu
Call trace:

behlendorf commented 13 years ago

That's unfortunate, we'll then you can either rebuild your kernel with 16k stacks or wait for the real fix. My plan is to work on it next week.

behlendorf commented 13 years ago

Can someone who can still reproduce this issue test a proposed fix. I believe I've come up with a fairly clean fix for this stack issue. However, after spending a couple hours trying to recreate a failing test case here I haven't been able to. Can you please try the zio_execute branch and give it a try. I've hopefully resolved the stack issue by rescheduling all scanning I/O to the zio issue threads. This effectively buys us an extra 8k of stack and because the scheduling latency is far lower than performing actual disk I/O there shouldn't be an impact to scrub performance. This is confirmed by my testing thus far.

https://github.com/behlendorf/zfs/commits/zio_execute

Maraschin commented 13 years ago

Hello,

I've cloned spl yesterday and just got the zio_execute... git clone git://github.com/behlendorf/zfs.git zio_execute ( Is this correct?)

I tried to import the pool I was having problems... and it crashed... Since I can't copy and paste the stack trace I will have to type everything... so I will do tomorrow since it is more than 0:00 in this side of the globe ;) BTW, I tried with /sbin/modprobe zfs zfs_no_scrub_io=1 as well, same result. Sorry, the fix did not work for me...

behlendorf commented 13 years ago
git clone git://github.com/behlendorf/zfs.git
cd zfs
git co -b zio_execute origin/zio_execute

Thanks. I think the patch is flawed, your the second person to report a hang on import although I haven't observed this myself. Let me think about how that might happen and thanks for the feedback.

Maraschin commented 13 years ago

Hi, just some more information:

My problem is with the pool I was running scrub and which crashed the server the first time. In the pass I was able to mount it again but now it just crashed direct and that is even before it starts to run the scrub... (at least I think it is since it never complete the import). Than I created a VM server just to run some tests, there I get the same problem when I try to import it but I've no problem importing another pool (both have about the same size more than 10TB but one is a raidz with 6 discs and the other is 4 pairs of mirrored discs).

behlendorf commented 13 years ago

Thanks for the clue. Which type of pool has trouble importing? The raidz or the mirror pairs? Also do you have any debugging printed to dmesg for the the hung zpool import?

Maraschin commented 13 years ago

Pool that did work the import (which wasn't running any scrub priory)

pool  6.82T   719G  6.12T    10%  1.00x  ONLINE  -

  pool: pool
 state: ONLINE
 scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    pool        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        sdm     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
      mirror-2  ONLINE       0     0     0
        sdn     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
      mirror-3  ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdl     ONLINE       0     0     0

Pool that will not import after the crash (this was running scrub and it was at 40% when it crashed). Interesting is that it only crashed after I run "zpool status".

    NAME STATE READ WRITE CKSUM
    lpool ONLINE 0 0 0
      raidz1-0 ONLINE 0 0 0
        dm-3 ONLINE 0 0 0
        dm-4 ONLINE 0 0 0
        dm-5 ONLINE 0 0 0
        dm-6 ONLINE 0 0 0
        dm-7 ONLINE 0 0 0
        dm-8 ONLINE 0 0 0
Maraschin commented 13 years ago

Details: In the information I just sent, the first status is from a VM machine, so the discs are really dm's which I do remap... the second status was get from the main machine when it was still working. I can't list the values in the VM since it will not import...

Debug information when I try to run zpool import lpool

BTW, since I need to type every single line bellow, no way to copy and past, which information is really needed and what can I just ignore?

Ops, I just crashed the main server trying to run zpool status... now I will have to wait until I get on it to fix...

Maraschin commented 13 years ago

Here it is:

dbuf_read+0x40f/0x7f0 [zfs]
__dbuf_hold_impl+0x371/0x460 [zfs]
? dbuf_find+0xe0/0x100 [zfs]
__dbuf_hold_impl+0x16f/0x460 [zfs]
? dbuf_find+0xe0/0x100 [zfs]
__dbuf_hold_impl+0x16f/0x460 [zfs]
buf_hold_impl+0x86/0xc0 [zfs]
dbuf_hold+0x20/0x30 [zfs]
dmu_buf_hold+0x97/0x1a0 [zfs]
zap_idx_to_blk+0xab/0x140 [zfs]
? ftrace_call+0x5/0x2b
zap_deref_leaf+0x54/0x80 [zfs]
? kmem_alloc_debug+0xeb/0x130 [spl]
fzap_length+0x39/0xa0 [zfs]
? zap_hash+0x9/0x110 [zfs]
? zap_name_alloc_uint64x0x76/0xa0 [zfs]
zap_length_uint64+0x84/0xb0 [zfs]
? ftrace_call+0x5/0x2b
? ddt_zap_lookup+0x0/0xe0 [zfs]
ddt_zap_lookup+0x50/0xe0 [zfs]
? vmap_pud_range+0x13e/0x1d0  (This seems new if I compare with the ones above , it was a ? mutex_unlock+0x9/0x20 before)
? ftrace_call+0x5/0x2b
? ddt_zap_lookup+0x0/0xe0 [zfs]
ddt_class_contains+0x98/0xd0 [zfs]
? ftrace_call+0x5/0x2b
? ftrace_call+0x5/0x2b
? ftrace_call+0x5/0x2b
? kmem_alloc_debug+0xeb/0x130 [spl]
? dsl_scan_check_pause+0x10/0x160 [zfs]
dsl_scan_visitbp.clone.2+0x316/0xbe0 [zfs]
? ftrace_call+0x5/0x2b
? dsl_scan_prefetch+0xd/0xb0 [zfs]
dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
dsl_scan_visitbp.clone.2+0xb47/0xbe0 [zfs]
dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
dsl_scan_visitbp.clone.2+0x61e/0xbe0 [zfs]
arc_read_nolock+0x621/0x7e0 [zfs]
dsl_scan_visitbp.clone.2+0x765/0xbe0 [zfs]
dsl_scan_visit_rootbp+0x4f/0x60 [zfs]
dsl_scan_visitds+0x96/0x3c0 [zfs]
dsl_scan_sync+0x141/0x700 [zfs]
? zio_destroy+0xae/0xf0 [zfs] (This is NEW as well before it was  ? zio_wait+0xe4/0x170 [zfs] )
spa_sync+0x40e/0x9b0 [zfs]
? ftrace_call+0x5/0x2b
[ffffffffa015f5f1] txg_sync_thread+0x241/0x3c0 [zfs]
[ffffffffa015f3b0] ? txg_sync_thread+0x0/0x3c0 [zfs]
[ffffffffa006b108] thread_generic_wrapper+0x78/0x90 [spl]
? thread_generic_wrapper+0x0/0x90 [spl]
[ffffffff810871f6] kthread+0x96/0xa0
[ffffffff8100cde4] kernel_thread_helper+0x4/0x10
[ffffffff81087160] ? kthread+0x0/0xa0
[ffffffff8100cde0] ? kernel_thread_helper+0x0/0x10
Code: 49 07 00 85 c0 74 2c 49 8d 44 24 08 48 39 c3 74 7d 31 d2 48 8b 75 c8 48 8b 7d c0 41 ff 56 10 48 81 7d c8 01 be 00 81 49 8b 45 00  LESSTHAN8bBIGGERTHAN 90 88 16 00 00 74 1b 48 83 c3 08 4d 85 ff 75 92 4c 39 eb 76
RIP [ffffffff8100fe5d] print_context_stack+0xad/0x140
 RSP LESSTHANffff88008bb3e040BIGGERTHAN
CR2: 0000000000001689
-- [ end trace 36a20409a61c3a57 ] ---
behlendorf commented 13 years ago

Ahh, this last stack is exactly what I wanted to see. It shows the the patch worked as expected but we subsequently hit another stack overflow I hadn't considered. My first patch very specifically targeted just one case. I'll look at this next case now and come up with a new patch.

Maraschin commented 13 years ago

Great! Let me know when I can test it again! ;)

devsk commented 13 years ago

Brian,

I am beginning to think that this is not going to work long term. How long are we going to keep running into these stack issues (a different one everytime) and keep fixing them? This is not how a reliable system supposed to work. We either need a guarantee that stack will never go beyond 8K or conclude that we need to increase the stack size to 16KB. Is there any way to traverse the code tree automatically using a tool and measure the worst case max depth we will ever reach and make some assumptions about local variables to estimate how much space we really need?

-devsk

behlendorf commented 13 years ago

My expectation is that once we identify and fix all of these cases this will largely become an issue of the past. Thus far once a call path has been identified and fixed, it has stayed fixed. However, your right we absolutely need to get all offending call paths and it would be nice to be able to know when we've got them all. Unfortunately, since the offending call paths are usually recursive no static analysis tool will be able to catch this. I do run the Coverity static analysis tool over the code occasional and it does catch a few things.

If you want to improve your stability today, and provide some very valuable debugging, the best thing you can do is increase your default stack size to 16k by defining THREAD_ORDER to 2. Then enable the kernel stack tracer by default in your sysctl.conf.

kernel.stack_tracer_enabled = 1

And finally add a debugfs line to you /etc/fstab.

debugfs                 /sys/kernel/debug       debugfs defaults        0 0

By increasing the stack size to 16k you will be preventing the stack overflow in the first place. Then by enabling the stack tracer and mounting debugfs you will be able to see the worst case stack usage on your system since it was booted. Just cat the file _/sys/kernel/debug/tracing/stack_maxsize, if the depth ever get's near 8000 you would have had a stack overflow. Ideally, the max depth should be kept below 7000 to leave at least a little headroom, lower is of course better. You can then consult the contents of /sys/kernel/debug/tracing/stack_trace to get the exact offending stack.

I'd hope that after running for some time with real usage and never overrunning the stack we could say pretty confidently that all the overflows have been identified. I really want to get them all. If we don't people will never be able to use ZFS with a stock Linux kernel. And I personally feel supporting stock kernels is a requirement to getting larger acceptance and use of ZFS. For a long time before I worked on ZFS I was a Lustre filesystem developer. Building Lustre requires you to patch your kernel, and I know for a fact that's one of the major reasons it hasn't been more widely adopted.

Unfortunately, the Linux kernel in particular is a lot more difficult to work in then user space. It's just the way it is. The upside of course is that once things are stable we should be able to achieve considerable better performance numbers. It's just that none of it is easy, if it was someone else would have done it by now.

Back to the original bug. Using your reproducer I was able to locally recreate this issue. After increasing the kernel stack size and enabling the stack debugging it looks like worst case use was 7928 bytes. A little to much after you take in to account the task struct overhead. However, as expected with the larger stack I'm able to resilver the pool just fine.

devsk commented 13 years ago

Brian,

Good that you were able to reproduce the issue.

A follow up question: What's the side effect of running with 16K stacks? Is it just additional memory usage? If I am running around 1000 total tasks, I don't care about 8KB*1000 ~ 8MB of extra RAM being used. I don't care if its even 80MB. Seems like a bargain for the stability that I get. And given that I have 12GB of RAM in my system, I never want to hit this issue again. Stability and reliability are my topmost requirements.

I wonder, if 16KB stacks don't have any side effects, why is it not a configurable option for kernel? I vaguely remember that stack size used to be a config option and people used to try and run 4K stacks few years ago.

-devsk

behlendorf commented 13 years ago

The penalty (aside from the additional memory usage) is that it might slightly impact your fork-exec performance. If your system is low on memory it may take the kernel slightly longer to allocate 4 instead of 2 contiguous pages of memory. This was played up to be a big deal in Linux some time back since some workloads are fork-exec happy. Cramming everything in to 8k gave Linux a performance advantage over several other operating systems. Whether it was worth the cost is still debated.

devsk commented 13 years ago

oh wow! You are saying that fork-exec performance is affected by the need to allocate 8K additional memory. I do understand it has to be contiguous but how much of fork-exec traffic is there in a non-server environment? Given my system had 6GB of RAM free (and 8GB in free+caches) when the lock ups happened, my desktop/home server is way over-provisioned with 12GB of RAM, I think I am going to just go with 16K stack and see if I run into any performance issues.

behlendorf commented 13 years ago

After thinking about this for a very long time I've had what I believe is a very good idea. Please give the updated zio_execute branch a try, you'll want to destroy your current copy and pull a new one. Basically, it dawned on me there was very little reason not to make the txg_sync_thread zio's all asynchronous. This can effectively double our available stack space in the recursive cases we care about most, I've described by thinking in commit 2973c48adba5af29c51618cc723f2b8e0aac2c85 . Thus far this change is working very well for me but I'd appreciate if others who have hit this stack issue could test it.

devsk commented 13 years ago

quick question: you mention reverting 6656bf5. Is that already done in zio_execute branch?

behlendorf commented 13 years ago

@devsk: The fork-exec thing is basically for server workloads, but since working on server issues pays the bills for most Linux kernel developers that's what the focus is. For a desktop system I doubt you'll notice anything but the extra memory usage. It only really starts seriously hurting your performance when your low on memory and your memory is heavily fragmented.

Yes, I reverted commit 6656bf56216f36805731298ee0f4de87ae6b6b3d with commit 75424e3bc3b098dbe1b06f8462cf362c65975535 in the zio_execute branch. Hopefully, I'm not getting ahead of myself here but I'll make sure it's well tested before merging to master. If you've already rebuilt your kernel with 16k stacks this change won't impact you either way. But I'd be very interested what you worst case stack usage is after a few days or weeks of usage. That's an excellent cross check that I got this right. A better cross check would be an automated test suite but that's still coming!

devsk commented 13 years ago

Does zio_execute branch have everything that's in master at this time?

behlendorf commented 13 years ago

Yes, it's currently based against the latest master so it has all the fixes which are fully baked.

devsk commented 13 years ago

Looks like we have a winner here! Will know once the resilvering completes, but the original issue used to happen within couple of minutes.

I just compiled the zio_execute branch and did not change the kernel stack size.

behlendorf commented 13 years ago

Excellent! If you did enable the stack tracer and mount the debugs with the 8k kernel I'd love to see the worst case stack usage on your system.

Maraschin commented 13 years ago

lol I manged to import the pool! It is running on a VM, so it is quite slow, but at least I can get the stacktracer if something goes wrong... now is wait and see how it goes ;) Thanks and great job!

sudo zpool status
  pool: lpool
 state: ONLINE
 scan: scrub in progress since Sat May  7 10:02:59 2011
    3.13T scanned out of 6.97T at 1.38M/s, (scan is slow, no estimated time)
    0 repaired, 44.86% done
config:

    NAME        STATE     READ WRITE CKSUM
    lpool       ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sda     ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0