zfs OOM and general protection fault when running bonnie++

fajarnugraha commented 13 years ago

I was testing bonnie++ with "-s 3g" on a Xen dom0 with 1.5G memory. On first run I got OOM error

crond invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Pid: 8903, comm: crond Tainted: P           2.6.32.28-1.pv_ops.el5.fanxen #1
Call Trace:
 [] ? cpuset_print_task_mems_allowed+0x92/0x9e
 [] oom_kill_process+0x85/0x25b
 [] ? select_bad_process+0xbc/0x102
 [] __out_of_memory+0x3d/0x86
 [] out_of_memory+0xfc/0x195
 [] __alloc_pages_nodemask+0x487/0x595
 [] alloc_pages_current+0x9b/0xa4
 [] __page_cache_alloc+0x5c/0x60
 [] __do_page_cache_readahead+0x81/0x196
 [] ra_submit+0x21/0x25
 [] filemap_fault+0x181/0x30c
 [] __do_fault+0x59/0x37c
 [] ? pte_mfn_to_pfn+0x29/0x43
 [] ? xen_pte_val+0x69/0x73
 [] handle_mm_fault+0x14a/0x776
 [] ? hrtimer_cancel+0x19/0x25
 [] ? do_nanosleep+0x63/0x8e
 [] do_page_fault+0x200/0x26c
 [] page_fault+0x25/0x30
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:  60
CPU    1: hi:  186, btch:  31 usd:  69
Node 0 Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:   0
active_anon:10 inactive_anon:10 isolated_anon:0
 active_file:0 inactive_file:19 isolated_file:0
 unevictable:1694 dirty:0 writeback:0 unstable:0
 free:13648 slab_reclaimable:2814 slab_unreclaimable:137300
 mapped:1079 shmem:6 pagetables:1094 bounce:0
Node 0 DMA free:12972kB min:16kB low:20kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:12852kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 1428 10047 10047
Node 0 DMA32 free:36192kB min:1820kB low:2272kB high:2728kB active_anon:40kB inactive_anon:40kB active_file:0kB inactive_file:76kB unevictable:6776kB isolated(anon):0kB isolated(file):0kB present:1462496kB mlocked:6776kB dirty:0kB writeback:0kB mapped:4316kB shmem:24kB slab_reclaimable:11256kB slab_unreclaimable:549176kB kernel_stack:3048kB pagetables:4376kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:185 all_unreclaimable? no
lowmem_reserve[]: 0 0 8618 8618
Node 0 Normal free:5428kB min:10996kB low:13744kB high:16492kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:8825716kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:24kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 3*4kB 2*8kB 1*16kB 2*32kB 1*64kB 2*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 3*4096kB = 12972kB
Node 0 DMA32: 2918*4kB 1131*8kB 329*16kB 89*32kB 39*64kB 8*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 36192kB
Node 0 Normal: 1*4kB 4*8kB 1*16kB 2*32kB 3*64kB 2*128kB 3*256kB 0*512kB 0*1024kB 2*2048kB 0*4096kB = 5428kB
1116 total pagecache pages
19 pages in swap cache
Swap cache stats: add 19234, delete 19215, find 296/546
Free swap  = 2027288kB
Total swap = 2097144kB
Out of memory: kill process 9143 (xend) score 30732 or a child

Killed process 9147 (xend)

My initial guess was that somehow zfs was eating up all the memory for arc and doesn't free it when other applications need it. So I tried setting zfs_arc_max and zfs_arc_min module parameter to 68157440 (65M). Repeated the test, and got this error

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
CPU 5 
Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zlib_deflate zunicode(P) spl lockd sunrpc 8021q garp bridge stp llc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video output sbs sbshc parport_pc lp parport ipmi_devintf bnx2 radeon ttm drm_kms_helper ipmi_si drm ipmi_msghandler hpwdt i2c_algo_bit i2c_core snd_seq_dummy serio_raw snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i5k_amb iTCO_wdt iTCO_vendor_support i5000_edac edac_core snd_timer shpchp snd soundcore snd_page_alloc pcspkr qla2xxx scsi_transport_fc scsi_tgt cciss uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
Pid: 8606, comm: mount Tainted: P           2.6.32.28-1.pv_ops.el5.fanxen #1 ProLiant BL460c G1
RIP: e030:[]  [] zfs_inode_destroy+0x55/0xa0 [zfs]
RSP: e02b:ffff880026f2faa8  EFLAGS: 00010286
RAX: ffff880024d17ba0 RBX: dead000000200200 RCX: dead000000100100
RDX: dead000000200200 RSI: 0000000000000003 RDI: ffff880036f211d8
RBP: ffff880026f2fac8 R08: 8000000000000000 R09: ffffffff8100fa1f
R10: dead000000100100 R11: ffff880026f2f7f8 R12: ffff880036f211d8
R13: ffff880024d17bc8 R14: ffff880024d17a50 R15: ffffffffa04abf0f
FS:  00007f8d4fd6f760(0000) GS:ffff8800280c9000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f0850a960a0 CR3: 0000000027085000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mount (pid: 8606, threadinfo ffff880026f2e000, task ffff88003513e700)
Stack:
 ffff880024d17bc8 ffff880026f2fb88 0000000000000000 7fffffffffffffff
<0> ffff880026f2fad8 ffffffffa04abfbf ffff880026f2faf8 ffffffff8112ded6
<0> ffff880026f2faf8 ffff880024d17bc8 ffff880026f2fb18 ffffffff8112e089
Call Trace:
 [] zpl_inode_destroy+0xe/0x10 [zfs]
 [] destroy_inode+0x2f/0x44
 [] generic_drop_inode+0x57/0x5b
 [] iput+0x66/0x6a
(XEN) Debugging connection not set up.
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.

... and the dom0 keeps on rebooting and crashing on the same error (zfs_inode_destroy).

I had to boot using openindiana and delete the dataset used by bonnie to be able to access the pool on Linux again.

So the questions are:

is limiting zfs_arc_max a good practice with zfsonlinux? In solaris there's a mixed opinion on this: some said yes, while others said let zfs have all cause it can shrink it when needed.
any ideas how to solve general protection fault for zfs_inode_destroy?

This is with 0.6.0-rc1 plus some zpool/zvol-related fixes which I posted earlier.

fajarnugraha commented 13 years ago

Additional info: the test was on a zfs filesystem (not a zvol)

behlendorf commented 13 years ago

is limiting zfs_arc_max a good practice with zfsonlinux?

I would say it depends. Much like on Solaris the zfs implementation is tied in to the kernel's memory management. When the kernel runs low on memory it will request that zfs free some, zfs will do this if it can. In addition, the zfs implementation just like Solaris proactively frees memory to the system when memory is low. However, the exact thresholds for this have not yet been tuned for Linux. The proactive behavior is largely controlled by the arc_reclaim_needed() function and is based on the Solaris limits. It would not surprise me if these should be adjusted for Linux.

You can absolutely limit the arc with zfs_arc_max and this may be desirable for something like a desktop system. It will prevent the zfs cache for doing things like forcing your browser cache to the swap device. This will obviously good for interactive performance, for a server config however I'd suggest you let zfs use all the memory. In this case you want it caching as much as possible.

Another thought on how to avoid the oops is to increase Linux's min_free_kbytes. This is the value the kernel uses to determine when it is low on memory and it should start reclaiming some. Just set a larger value here and see if it helps with the oops. As we get more experience with how the port behaves under a real workload on Linux we'll be able to tune it better.

/proc/sys/vm/min_free_kbytes

any ideas how to solve general protection fault for zfs_inode_destroy?

That's clearly a bug. According to your partial stack trace the problem occurred in the list_remove in zfs_inode_destroy(). Probably because we had some sort of memory allocation error during mount while caused us to walk an untested error path. Do you have a full stack stack which shows this? I'd expect it to start in sys_mount().

void
zfs_inode_destroy(struct inode *ip)
{
        znode_t *zp = ITOZ(ip);
        zfs_sb_t *zsb = ZTOZSB(zp);

        mutex_enter(&zsb->z_znodes_lock);
        >>> list_remove(&zsb->z_all_znodes, zp); <<<
        mutex_exit(&zsb->z_znodes_lock);

        if (zp->z_acl_cached) {
                zfs_acl_free(zp->z_acl_cached);
                zp->z_acl_cached = NULL;
        }

        kmem_cache_free(znode_cache, zp);
}

fajarnugraha commented 13 years ago

I'm a bit reluctant to restart the test on dom0 (since it brough down guests on top of it), so I repeated the test on a domU instead, this time only having 500MB memory. I also increased /proc/sys/vm/min_free_kbytes to 16384 (from 5690). Doesn't make much difference

without arc limit -> OOM
with arc limit (65M) -> kernel crash, but different message

------------[ cut here ]------------
kernel BUG at fs/inode.c:1343!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/module/zfs/parameters/zio_delay_max
CPU 3 
Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zlib_deflate zunicode(P) spl autofs4 lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath parport_pc lp parport snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc xen_netfront pcspkr uhci_hcd ohci_hcd ehci_hcd
Pid: 48, comm: kswapd0 Tainted: P           2.6.32.28-1.pv_ops.el5.fanxen #1 
RIP: e030:[]  [] iput+0x20/0x6a
RSP: e02b:ffff8800799c7c70  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880078931958 RCX: ffff880078931988
RDX: ffff880078931988 RSI: 0000000000000003 RDI: ffff880078931958
RBP: ffff8800799c7c80 R08: 0000000000000000 R09: ffffffff8100fa1f
R10: dead000000100100 R11: dead000000200200 R12: ffff880078931958
R13: ffff880006a3dc00 R14: 0000000000000000 R15: ffff8800799c7d54
FS:  00007f55a00016e0(0000) GS:ffff880002499000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fff7ecdcc28 CR3: 0000000076844000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 48, threadinfo ffff8800799c6000, task ffff88007a5dc440)
Stack:
 ffff8800799c7c80 ffff880078794e00 ffff8800799c7ca0 ffffffff8112b01d
<0> dead000000100100 ffff880078794e00 ffff8800799c7cc0 ffffffff8112b11b
<0> ffff880078794e00 ffff880001d71a88 ffff8800799c7d30 ffffffff8112b310
Call Trace:
 [] dentry_iput+0xb8/0xca
 [] d_kill+0x49/0x6a
 [] __shrink_dcache_sb+0x1d4/0x272
 [] shrink_dcache_memory+0xf4/0x17a
 [] shrink_slab+0xe1/0x153
 [] kswapd+0x3dd/0x516
 [] ? isolate_pages_global+0x0/0x1ba
 [] ? need_resched+0x23/0x2d
 [] ? autoremove_wake_function+0x0/0x3d
 [] ? _spin_unlock_irqrestore+0x16/0x18
 [] ? kswapd+0x0/0x516
 [] kthread+0x6e/0x76
 [] child_rip+0xa/0x20
 [] ? int_ret_from_sys_call+0x7/0x1b
 [] ? retint_restore_args+0x5/0x6
 [] ? child_rip+0x0/0x20
Code: 5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 85 ff 48 89 fb 74 50 48 83 bf 18 02 00 00 40 75 04 <0f> 0b eb fe 48 8d 7f 48 48 c7 c6 e0 9f 9b 81 e8 56 b9 0f 00 85 
RIP  [] iput+0x20/0x6a
 RSP 
---[ end trace bb15de249868b1f5 ]---
Kernel panic - not syncing: Fatal exception
Pid: 48, comm: kswapd0 Tainted: P      D    2.6.32.28-1.pv_ops.el5.fanxen #1
Call Trace:
 [] panic+0xa5/0x164
 [] ? agp_amd64_probe+0x579/0x584
 [] ? xen_restore_fl_direct_end+0x0/0x1
 [] ? _spin_unlock_irqrestore+0x16/0x18
 [] ? release_console_sem+0x194/0x19d
 [] ? console_unblank+0x6a/0x6f
 [] ? print_oops_end_marker+0x23/0x25
 [] oops_end+0xb7/0xc7
 [] die+0x5a/0x63
 [] do_trap+0x115/0x124
 [] do_invalid_op+0x9c/0xa5
 [] ? iput+0x20/0x6a
 [] ? xen_force_evtchn_callback+0xd/0xf
 [] ? check_events+0x12/0x20
 [] invalid_op+0x1b/0x20
 [] ? xen_restore_fl_direct_end+0x0/0x1
 [] ? iput+0x20/0x6a
 [] dentry_iput+0xb8/0xca
 [] d_kill+0x49/0x6a
 [] __shrink_dcache_sb+0x1d4/0x272
 [] shrink_dcache_memory+0xf4/0x17a
 [] shrink_slab+0xe1/0x153
 [] kswapd+0x3dd/0x516
 [] ? isolate_pages_global+0x0/0x1ba
 [] ? need_resched+0x23/0x2d
 [] ? autoremove_wake_function+0x0/0x3d
 [] ? _spin_unlock_irqrestore+0x16/0x18
 [] ? kswapd+0x0/0x516
 [] kthread+0x6e/0x76
 [] child_rip+0xa/0x20
 [] ? int_ret_from_sys_call+0x7/0x1b
 [] ? retint_restore_args+0x5/0x6
 [] ? child_rip+0x0/0x20

And the odd thing is that the crash happens during bonnie's stat phase (or immediately after, since the "done" comment hasn't been printed yet)

# bonnie++ -u nobody
Using uid:99, gid:99.
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...

I'm inclined to agree with you about possible memory allocation error. What happens in zfsonlinux when it needs to allocate memory but zfs_arc_max is already exceeded?

fajarnugraha commented 13 years ago

Some more follow up.

Repeating the test using 500MB memory, 65M zfs_arc_max, but with zvol+ext4 (instead of zfs) caused OOM.

After increasing this domU memory to 1000MB (zfs_arc_max still limited to 65M), bonnie (both on zfs and zvol+ext4) was able to run correctly. So memory does matter a lot here, and low memory situation can cause unpredictable crash.

It looks like the previous test with dom0 earlier failed because even though it got 1.5 G memory some of it was already used by other services. WRT zfs_inode_destroy on the first post, that's all that was on serial console when the crash happened, no other stack information was available.

This kinda worries me a little though. Even when zfs_arc_max is limited to 65M, we need to "reserve" 1G memory to make sure intensive I/O operations don't cause system crash.

A short term workaround seems to be to set a lower arc_c_max on module/zfs/arc.c (currently set to the greater of 3/4 of all memory or all but 1G). At least until we can be sure that zfs can free memory to the system fast enough to avoid OOM.

behlendorf commented 13 years ago

Can you clarify the exact configuration you were testing. My understanding is you were running zfs in dom0 and serving up zvols as block devices for domU's. You then ran a bonnie++ in the domU (against the backing zvol) which triggered an OOM in dom0 which killed xend. Do I have that right?

If so I may have some ideas why that might happen. The OOM indicates some process running in a taskset with limited memory failed an order-0 allocation (4k of memory). The OOM message makes this fairly clear because __GFP_HARDWALL is set which limits user space processes memory consumption within the taskset. So even though there were 13648 4k pages available globally they were not in this taskset.

And there in I think may be part of the problem. The zfs code all runs in the kernel and as such doesn't honor taskset limits. So because there was still quite a bit of free memory on the system outside the taskset the memory reclaim code never ran. It's also not quite clear to me how exactly the taskset memory accounting is done if kernel anonymous pages count towards that memory limit you set. More investigation is needed.

As for the crashes, those are going to be easier to hunt down. Limiting the ARC to 65M is amazingly small, but it should be good for finding any memory handling problems. In you case I have a guess where things when wrong but you'll need to rebuild the spl+zfs code with the --enable-debug configure option. Enabling this option will resulting in lots of assertions being compiled in to the code. In particular, I'm hoping the following one gets hit because it will show the issue is that we failed the allocate the inode problem. That would explain the above inode accounting issues. Although if we hit a different assert that will be information too.

ASSERT(*zpp != NULL);

fajarnugraha commented 13 years ago

There were two different environments.

The first environment was bonnie++ using zfs (not zvol) on dom0. This configuration first caused OOM (which killed xend), and later kernel panic (when I limit zfs_arc_max). domU is not involved at all in this test. This is what I reported on the first post.

The second one was bonnie++ on a domU, using zvol-backed device from dom0. domU uses zfs on top of this block device. In domU's zpool I created a zvol (which I then formatted as ext4) and a zfs filesystem. On this environment I ran bonnie++ on both zfs and zvol+ext4. There were OOMs and panics on this environment, but all happen in domU. dom0 (which still has zfs_arc_max) was fine during this test.

I'll try using debug version sometime later.

behlendorf commented 13 years ago

You know, bugs aside, it's really cool that all of that basically works. :)

fajarnugraha commented 13 years ago

Yes it is :D

I've been working to get zfs and Xen to play nicely. In the past the efforts include:

Using xVM (opensolaris version of Xen). Has potential, till Oracle decided to kill it
Using zfs-fuse and store disk images as file. Works to a certain degree, but I'd rather have a block device instead of having to use the new-and-keep-changing-syntax of tap2:tapdisk:aio on Xen side

So if reserving an extra 1G and using small arc size is enough to get it working, it's more than enough for dev purposes.

behlendorf commented 13 years ago

Hopefully, given a few weeks we'll get this issues sorted out. This does help explain too why your so interested in the zvols working well. :)

behlendorf commented 13 years ago

With the changes to the direct reclaim path in the latest master code, GFP_NOFS branch merge, I think we have a handle on the crashes and oom issues described here. Fajar if you agree please go ahead and close this bug. Performance/memory issues still aren't resolved by can be worked in a different bug.

fajarnugraha commented 13 years ago

Closing this bug. Further memory/OOM issues is better tracked on issue #154 since using zfs as root is more general and receive more testing compared to bonnie++

openzfs / zfs

zfs OOM and general protection fault when running bonnie++ #131