Closed fajarnugraha closed 13 years ago
Additional info: the test was on a zfs filesystem (not a zvol)
is limiting zfs_arc_max a good practice with zfsonlinux?
I would say it depends. Much like on Solaris the zfs implementation is tied in to the kernel's memory management. When the kernel runs low on memory it will request that zfs free some, zfs will do this if it can. In addition, the zfs implementation just like Solaris proactively frees memory to the system when memory is low. However, the exact thresholds for this have not yet been tuned for Linux. The proactive behavior is largely controlled by the arc_reclaim_needed() function and is based on the Solaris limits. It would not surprise me if these should be adjusted for Linux.
You can absolutely limit the arc with zfs_arc_max and this may be desirable for something like a desktop system. It will prevent the zfs cache for doing things like forcing your browser cache to the swap device. This will obviously good for interactive performance, for a server config however I'd suggest you let zfs use all the memory. In this case you want it caching as much as possible.
Another thought on how to avoid the oops is to increase Linux's min_free_kbytes. This is the value the kernel uses to determine when it is low on memory and it should start reclaiming some. Just set a larger value here and see if it helps with the oops. As we get more experience with how the port behaves under a real workload on Linux we'll be able to tune it better.
/proc/sys/vm/min_free_kbytes
any ideas how to solve general protection fault for zfs_inode_destroy?
That's clearly a bug. According to your partial stack trace the problem occurred in the list_remove in zfs_inode_destroy(). Probably because we had some sort of memory allocation error during mount while caused us to walk an untested error path. Do you have a full stack stack which shows this? I'd expect it to start in sys_mount().
void
zfs_inode_destroy(struct inode *ip)
{
znode_t *zp = ITOZ(ip);
zfs_sb_t *zsb = ZTOZSB(zp);
mutex_enter(&zsb->z_znodes_lock);
>>> list_remove(&zsb->z_all_znodes, zp); <<<
mutex_exit(&zsb->z_znodes_lock);
if (zp->z_acl_cached) {
zfs_acl_free(zp->z_acl_cached);
zp->z_acl_cached = NULL;
}
kmem_cache_free(znode_cache, zp);
}
I'm a bit reluctant to restart the test on dom0 (since it brough down guests on top of it), so I repeated the test on a domU instead, this time only having 500MB memory. I also increased /proc/sys/vm/min_free_kbytes to 16384 (from 5690). Doesn't make much difference
------------[ cut here ]------------ kernel BUG at fs/inode.c:1343! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/module/zfs/parameters/zio_delay_max CPU 3 Modules linked in: zfs(P) zcommon(P) znvpair(P) zavl(P) zlib_deflate zunicode(P) spl autofs4 lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath parport_pc lp parport snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc xen_netfront pcspkr uhci_hcd ohci_hcd ehci_hcd Pid: 48, comm: kswapd0 Tainted: P 2.6.32.28-1.pv_ops.el5.fanxen #1 RIP: e030:[] [ ] iput+0x20/0x6a RSP: e02b:ffff8800799c7c70 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff880078931958 RCX: ffff880078931988 RDX: ffff880078931988 RSI: 0000000000000003 RDI: ffff880078931958 RBP: ffff8800799c7c80 R08: 0000000000000000 R09: ffffffff8100fa1f R10: dead000000100100 R11: dead000000200200 R12: ffff880078931958 R13: ffff880006a3dc00 R14: 0000000000000000 R15: ffff8800799c7d54 FS: 00007f55a00016e0(0000) GS:ffff880002499000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fff7ecdcc28 CR3: 0000000076844000 CR4: 0000000000002660 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kswapd0 (pid: 48, threadinfo ffff8800799c6000, task ffff88007a5dc440) Stack: ffff8800799c7c80 ffff880078794e00 ffff8800799c7ca0 ffffffff8112b01d <0> dead000000100100 ffff880078794e00 ffff8800799c7cc0 ffffffff8112b11b <0> ffff880078794e00 ffff880001d71a88 ffff8800799c7d30 ffffffff8112b310 Call Trace: [ ] dentry_iput+0xb8/0xca [ ] d_kill+0x49/0x6a [ ] __shrink_dcache_sb+0x1d4/0x272 [ ] shrink_dcache_memory+0xf4/0x17a [ ] shrink_slab+0xe1/0x153 [ ] kswapd+0x3dd/0x516 [ ] ? isolate_pages_global+0x0/0x1ba [ ] ? need_resched+0x23/0x2d [ ] ? autoremove_wake_function+0x0/0x3d [ ] ? _spin_unlock_irqrestore+0x16/0x18 [ ] ? kswapd+0x0/0x516 [ ] kthread+0x6e/0x76 [ ] child_rip+0xa/0x20 [ ] ? int_ret_from_sys_call+0x7/0x1b [ ] ? retint_restore_args+0x5/0x6 [ ] ? child_rip+0x0/0x20 Code: 5b 41 5c 41 5d 41 5e 41 5f c9 c3 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 85 ff 48 89 fb 74 50 48 83 bf 18 02 00 00 40 75 04 <0f> 0b eb fe 48 8d 7f 48 48 c7 c6 e0 9f 9b 81 e8 56 b9 0f 00 85 RIP [ ] iput+0x20/0x6a RSP ---[ end trace bb15de249868b1f5 ]--- Kernel panic - not syncing: Fatal exception Pid: 48, comm: kswapd0 Tainted: P D 2.6.32.28-1.pv_ops.el5.fanxen #1 Call Trace: [ ] panic+0xa5/0x164 [ ] ? agp_amd64_probe+0x579/0x584 [ ] ? xen_restore_fl_direct_end+0x0/0x1 [ ] ? _spin_unlock_irqrestore+0x16/0x18 [ ] ? release_console_sem+0x194/0x19d [ ] ? console_unblank+0x6a/0x6f [ ] ? print_oops_end_marker+0x23/0x25 [ ] oops_end+0xb7/0xc7 [ ] die+0x5a/0x63 [ ] do_trap+0x115/0x124 [ ] do_invalid_op+0x9c/0xa5 [ ] ? iput+0x20/0x6a [ ] ? xen_force_evtchn_callback+0xd/0xf [ ] ? check_events+0x12/0x20 [ ] invalid_op+0x1b/0x20 [ ] ? xen_restore_fl_direct_end+0x0/0x1 [ ] ? iput+0x20/0x6a [ ] dentry_iput+0xb8/0xca [ ] d_kill+0x49/0x6a [ ] __shrink_dcache_sb+0x1d4/0x272 [ ] shrink_dcache_memory+0xf4/0x17a [ ] shrink_slab+0xe1/0x153 [ ] kswapd+0x3dd/0x516 [ ] ? isolate_pages_global+0x0/0x1ba [ ] ? need_resched+0x23/0x2d [ ] ? autoremove_wake_function+0x0/0x3d [ ] ? _spin_unlock_irqrestore+0x16/0x18 [ ] ? kswapd+0x0/0x516 [ ] kthread+0x6e/0x76 [ ] child_rip+0xa/0x20 [ ] ? int_ret_from_sys_call+0x7/0x1b [ ] ? retint_restore_args+0x5/0x6 [ ] ? child_rip+0x0/0x20
And the odd thing is that the crash happens during bonnie's stat phase (or immediately after, since the "done" comment hasn't been printed yet)
# bonnie++ -u nobody Using uid:99, gid:99. Writing with putc()...done Writing intelligently...done Rewriting...done Reading with getc()...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...
I'm inclined to agree with you about possible memory allocation error. What happens in zfsonlinux when it needs to allocate memory but zfs_arc_max is already exceeded?
Some more follow up.
Repeating the test using 500MB memory, 65M zfs_arc_max, but with zvol+ext4 (instead of zfs) caused OOM.
After increasing this domU memory to 1000MB (zfs_arc_max still limited to 65M), bonnie (both on zfs and zvol+ext4) was able to run correctly. So memory does matter a lot here, and low memory situation can cause unpredictable crash.
It looks like the previous test with dom0 earlier failed because even though it got 1.5 G memory some of it was already used by other services. WRT zfs_inode_destroy on the first post, that's all that was on serial console when the crash happened, no other stack information was available.
This kinda worries me a little though. Even when zfs_arc_max is limited to 65M, we need to "reserve" 1G memory to make sure intensive I/O operations don't cause system crash.
A short term workaround seems to be to set a lower arc_c_max on module/zfs/arc.c (currently set to the greater of 3/4 of all memory or all but 1G). At least until we can be sure that zfs can free memory to the system fast enough to avoid OOM.
Can you clarify the exact configuration you were testing. My understanding is you were running zfs in dom0 and serving up zvols as block devices for domU's. You then ran a bonnie++ in the domU (against the backing zvol) which triggered an OOM in dom0 which killed xend. Do I have that right?
If so I may have some ideas why that might happen. The OOM indicates some process running in a taskset with limited memory failed an order-0 allocation (4k of memory). The OOM message makes this fairly clear because __GFP_HARDWALL is set which limits user space processes memory consumption within the taskset. So even though there were 13648 4k pages available globally they were not in this taskset.
And there in I think may be part of the problem. The zfs code all runs in the kernel and as such doesn't honor taskset limits. So because there was still quite a bit of free memory on the system outside the taskset the memory reclaim code never ran. It's also not quite clear to me how exactly the taskset memory accounting is done if kernel anonymous pages count towards that memory limit you set. More investigation is needed.
As for the crashes, those are going to be easier to hunt down. Limiting the ARC to 65M is amazingly small, but it should be good for finding any memory handling problems. In you case I have a guess where things when wrong but you'll need to rebuild the spl+zfs code with the --enable-debug configure option. Enabling this option will resulting in lots of assertions being compiled in to the code. In particular, I'm hoping the following one gets hit because it will show the issue is that we failed the allocate the inode problem. That would explain the above inode accounting issues. Although if we hit a different assert that will be information too.
ASSERT(*zpp != NULL);
There were two different environments.
The first environment was bonnie++ using zfs (not zvol) on dom0. This configuration first caused OOM (which killed xend), and later kernel panic (when I limit zfs_arc_max). domU is not involved at all in this test. This is what I reported on the first post.
The second one was bonnie++ on a domU, using zvol-backed device from dom0. domU uses zfs on top of this block device. In domU's zpool I created a zvol (which I then formatted as ext4) and a zfs filesystem. On this environment I ran bonnie++ on both zfs and zvol+ext4. There were OOMs and panics on this environment, but all happen in domU. dom0 (which still has zfs_arc_max) was fine during this test.
I'll try using debug version sometime later.
You know, bugs aside, it's really cool that all of that basically works. :)
Yes it is :D
I've been working to get zfs and Xen to play nicely. In the past the efforts include:
So if reserving an extra 1G and using small arc size is enough to get it working, it's more than enough for dev purposes.
Hopefully, given a few weeks we'll get this issues sorted out. This does help explain too why your so interested in the zvols working well. :)
With the changes to the direct reclaim path in the latest master code, GFP_NOFS branch merge, I think we have a handle on the crashes and oom issues described here. Fajar if you agree please go ahead and close this bug. Performance/memory issues still aren't resolved by can be worked in a different bug.
Closing this bug. Further memory/OOM issues is better tracked on issue #154 since using zfs as root is more general and receive more testing compared to bonnie++
I was testing bonnie++ with "-s 3g" on a Xen dom0 with 1.5G memory. On first run I got OOM error
My initial guess was that somehow zfs was eating up all the memory for arc and doesn't free it when other applications need it. So I tried setting zfs_arc_max and zfs_arc_min module parameter to 68157440 (65M). Repeated the test, and got this error
... and the dom0 keeps on rebooting and crashing on the same error (zfs_inode_destroy).
I had to boot using openindiana and delete the dataset used by bonnie to be able to access the pool on Linux again.
So the questions are:
This is with 0.6.0-rc1 plus some zpool/zvol-related fixes which I posted earlier.