Closed madpenguin closed 11 years ago
This is the first report I've seen of problems in the tsd code. Is there a particular workload your using to trigger this issue?
No, I'm afraid I've not been able to pin it down. It seems random, maybe once every three months per machine .. I mainly use AMD chips, but I've seen it on a Dell R710 (Intel/16 core) too ... [I have 1 machine thus far I've "not" see it on, but that's only been up for 6 weeks ..]
The only think I can think is that they're all running KVM instances to varying degrees .. this particular machine is serving data via two VPS's running gluster. Other machines are running KVM off QCOW2 instances sat directly on ZFS partitions.
All machines use SSD root disks .. machines are all between 1 & 3 years old .. , mostly single chip / 6-core, although the Dell was dual chip / 8-core.
Nothing particularly exotic, all have multiple NIC's of different makes, a mixture of different motherboards and disk controllers .. mixture of disks .. all are running Ubuntu 11.04 and 12.04, which means it spans different kernels ..
Don't know if any of that helps ??
fyi; previously most of these machines were running Software RAID10 for 1 year+ without incident ...
Ok, another one .. different box with similar specification ... frequency seems to be after 20 days uptime ...
Oct 10 15:05:47 data2 kernel: [1823996.336021] BUG: soft lockup - CPU#3 stuck for 22s! [kvm:14086]
Oct 10 15:05:47 data2 kernel: [1823996.337467] Modules linked in: nf_conntrack_ftp nf_conntrack_irc ip6table_filter ip6_tables ebtable_nat ebtable_broute ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables kvm_amd kvm nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc bridge serio_raw edac_core edac_mce_amd k10temp i2c_piix4 asus_atk0110 mac_hid wmi vesafb zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl(O) zlib_deflate r8169 pata_atiixp megaraid_sas 8021q garp stp sky2
Oct 10 15:05:47 data2 kernel: [1823996.340000] CPU 3
Oct 10 15:05:47 data2 kernel: [1823996.340000] Modules linked in: nf_conntrack_ftp nf_conntrack_irc ip6table_filter ip6_tables ebtable_nat ebtable_broute ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables kvm_amd kvm nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc bridge serio_raw edac_core edac_mce_amd k10temp i2c_piix4 asus_atk0110 mac_hid wmi vesafb zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl(O) zlib_deflate r8169 pata_atiixp megaraid_sas 8021q garp stp sky2
Oct 10 15:05:47 data2 kernel: [1823996.340000]
Oct 10 15:05:47 data2 kernel: [1823996.340000] Pid: 14086, comm: kvm Tainted: P D O 3.2.0-30-generic #48-Ubuntu System manufacturer System Product Name/M4A89GTD-PRO
Oct 10 15:05:47 data2 kernel: [1823996.340000] RIP: 0010:[
Hey guys,
As directed from the mailing list, here are the details of my case which seems to be related to this same issue:
Since recently we've been playing around with a new hardware configuration here and as a part of the evaluation we decided to give ZFS on Linux a try and see how it works.
I have an easily reproducible kernel panic/oops which I can reproduce within 10-15 minutes of stress-testing. Here are the gory details:
Kernel oops:
[ 905.545397] general protection fault: 0000 [#1] SMP
[ 905.545562] Modules linked in:
[ 905.545674] CPU 11
[ 905.545743] Pid: 10258, comm: sysbench Not tainted 3.7.0 #1 Supermicro X9DRE-TF+/X9DR7-TF+/X9DRE-TF+/X9DR7-TF+
[ 905.545867] RIP: 0010:[<ffffffff81103fbd>] [<ffffffff81103fbd>] tsd_hash_search+0xd3/0x14d
[ 905.545990] RSP: 0018:ffff881768b95dc8 EFLAGS: 00010286
[ 905.546050] RAX: dead000000100100 RBX: dead0000001000e8 RCX: 0000000000000037
[ 905.546113] RDX: 000000000000001a RSI: 0000000000000009 RDI: ffff8818124b5bf0
[ 905.546186] RBP: ffff881768b95e08 R08: 4090000000000000 R09: 0000000000002812
[ 905.546259] R10: 0000000040e6e758 R11: 0000000000000293 R12: 0000000000002812
[ 905.546323] R13: 0000000000008001 R14: ffff8818124b5bf0 R15: 0000000000000003
[ 905.546388] FS: 00007f31706e9700(0000) GS:ffff88307fd40000(0000) knlGS:0000000000000000
[ 905.546457] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 905.546768] CR2: 0000000000648944 CR3: 00000017f00a4000 CR4: 00000000000407e0
[ 905.547066] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 905.547363] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 905.547669] Process sysbench (pid: 10258, threadinfo ffff881768b94000, task ffff8818010a3f20)
[ 905.548200] Stack:
[ 905.554842] ffff881768b95e18 ffffffff81103f68 ffff881768b95e48 ffff883000c7a180
[ 905.555568] ffff8818137d5ac0 0000000000002812 0000000000000000 0000000000000004
[ 905.556325] ffff881768b95e88 ffffffff81104605 ffff881768b95e88 ffffffff81104ccf
[ 905.557069] Call Trace:
[ 905.557368] [<ffffffff81103f68>] ? tsd_hash_search+0x7e/0x14d
[ 905.557658] [<ffffffff81104605>] tsd_set+0x139/0x5d5
[ 905.557975] [<ffffffff81104ccf>] ? tsd_exit+0x133/0x183
[ 905.558265] [<ffffffff811c42e0>] zfs_fsync+0x2e/0xf3
[ 905.558572] [<ffffffff811d5556>] zpl_fsync+0x4e/0x69
[ 905.558887] [<ffffffff81140b67>] vfs_fsync_range+0x1e/0x20
[ 905.559162] [<ffffffff81140b80>] vfs_fsync+0x17/0x19
[ 905.559477] [<ffffffff81140c17>] do_fsync+0x3e/0x5a
[ 905.559769] [<ffffffff81140ee1>] sys_fsync+0xb/0xf
[ 905.560075] [<ffffffff81752112>] system_call_fastpath+0x16/0x1b
[ 905.560359] Code: 48 89 5c 24 18 48 89 5c 24 10 48 89 5c 24 08 48 c7 04 24 66 ed 9f 81 41 b9 77 00 00 00 eb 55 48 8b 00 48 85 c0 74 0c 48 8d 58 e8 <44> 39 68 e8 75 ee eb a9 41 80 06 01 31 db f6 05 16 14 a4 00 01
[ 905.564513] RIP [<ffffffff81103fbd>] tsd_hash_search+0xd3/0x14d
[ 905.564870] RSP <ffff881768b95dc8>
[ 905.565158] ---[ end trace 8d253ed150ed5e7a ]---
[ 906.856676] general protection fault: 0000 [#2] SMP
[ 906.857035] Modules linked in:
[ 906.857339] CPU 5
[ 906.857393] Pid: 10260, comm: sysbench Tainted: G D 3.7.0 #1 Supermicro X9DRE-TF+/X9DR7-TF+/X9DRE-TF+/X9DR7-TF+
[ 906.858130] RIP: 0010:[<ffffffff81103fbd>] [<ffffffff81103fbd>] tsd_hash_search+0xd3/0x14d
[ 906.858656] RSP: 0018:ffff881768ba1dc8 EFLAGS: 00010286
[ 906.858920] RAX: dead000000100100 RBX: dead0000001000e8 RCX: 0000000000000037
[ 906.859190] RDX: 0000000000000016 RSI: 0000000000000009 RDI: ffff8818124b4380
[ 906.859456] RBP: ffff881768ba1e08 R08: 40a0000000000000 R09: 0000000000002814
[ 906.859727] R10: 0000000040e5de1e R11: 0000000000000293 R12: 0000000000002814
[ 906.860003] R13: 0000000000008001 R14: ffff8818124b4380 R15: 0000000000000003
[ 906.860276] FS: 00007f31706d9700(0000) GS:ffff88181fd40000(0000) knlGS:0000000000000000
[ 906.860769] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 906.861041] CR2: 00007f317069fcd8 CR3: 00000017f00a4000 CR4: 00000000000407e0
[ 906.861316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 906.861593] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 906.861867] Process sysbench (pid: 10260, threadinfo ffff881768ba0000, task ffff8818010a12f0)
[ 906.862360] Stack:
[ 906.862625] 0000000040e5de1e 0000000000000001 00007f31706fb000 000000000067b088
[ 906.863303] ffff8818137d5ac0 0000000000002814 0000000000000000 0000000000000004
[ 906.863978] ffff881768ba1e88 ffffffff81104605 ffff881768ba1e88 ffffffff81104ccf
[ 906.864657] Call Trace:
[ 906.864931] [<ffffffff81104605>] tsd_set+0x139/0x5d5
[ 906.865204] [<ffffffff81104ccf>] ? tsd_exit+0x133/0x183
[ 906.865483] [<ffffffff811c42e0>] zfs_fsync+0x2e/0xf3
[ 906.865758] [<ffffffff811d5556>] zpl_fsync+0x4e/0x69
[ 906.866031] [<ffffffff81140b67>] vfs_fsync_range+0x1e/0x20
[ 906.866309] [<ffffffff81140b80>] vfs_fsync+0x17/0x19
[ 906.866582] [<ffffffff81140c17>] do_fsync+0x3e/0x5a
[ 906.866855] [<ffffffff81140ee1>] sys_fsync+0xb/0xf
[ 906.867136] [<ffffffff81752112>] system_call_fastpath+0x16/0x1b
[ 906.867409] Code: 48 89 5c 24 18 48 89 5c 24 10 48 89 5c 24 08 48 c7 04 24 66 ed 9f 81 41 b9 77 00 00 00 eb 55 48 8b 00 48 85 c0 74 0c 48 8d 58 e8 <44> 39 68 e8 75 ee eb a9 41 80 06 01 31 db f6 05 16 14 a4 00 01
[ 906.871147] RIP [<ffffffff81103fbd>] tsd_hash_search+0xd3/0x14d
[ 906.871461] RSP <ffff881768ba1dc8>
[ 906.871738] ---[ end trace 8d253ed150ed5e7b ]---
Setup:
SuperMicro X9DRE-TF+ motherboard, 2 x E5-2620, 24 x 8G Kingston DDR3, LSI 9207-8i HBA connected to 24 port backplane sas-multiplexer 2.5" populated with 8 SSDs: OCZ Agility 3 256G and 8 SATA Seagate 500G SATA
I have 2 raidz2 pools: ssd1 comprised of all the 8 ssds and sata1 comprised of all 8 sata disks. OS is on a separate drive.
Steps to reproduce:
I run sysbench fileio random read/write tests for about 10-15 minutes and I get the above oops. Here's in short what I do:
cd /ssd1
sysbench --num-threads=12 --max-requests=50000 --test=fileio --file-num=4096 --file-block-size=64K --file-test-mode=rndrw --file-fsync-freq=1 prepare
sysbench --num-threads=12 --max-requests=50000 --test=fileio --file-num=4096 --file-block-size=64K --file-test-mode=rndrw --file-fsync-freq=1 run
You might have to execute the second run line a few times in a row in order to get the oops. Doesn't always happen the first time. Between runs I do zfs export ssd1 && zfs import ssd1 to clear cache.
We tested the following kernels with ZFS incorporated into the kernel (not as a module): 3.7.0 with zfs sources from git, master branch
With ZFS 0.6.0-rc12 sources we tried kernels: 3.7.0, 3.6.10, 2.6.32-279.14.1.el6
The crash seem to be the same regardless of the kernel.
Another issue I noticed while running this comes from the test itself. It complains with this:
FATAL: Incorrect file discovered in request
(last message repeated 3 times)
FATAL: Failed to fsync file! file: 0 errno = 22 (8?{zh )
FATAL: Incorrect file discovered in request
It could be test itself but I decided to post it here since it might be related. Notice the garbage in the brackets! We also disabled NUMA in the BIOS and repeated the test - same result.
I would appreciate if someone could take a look at this. I am ready to provide further details if needed.
Regards,
Rumen Telbizov Unix Systems Administrator
@telbizov I've been unable to reproduce the issue locally, even with your test case. However, the above patch should resolve the issue since it effectively retires the offending code which isn't needed. Can you please apply the patch and verify the GPF is resolved.
If the FATAL: Failed to fsync file!
and FATAL: Incorrect file discovered in request
errors also aren't normal for sysbench
on other file systems (ext4, xfs). Then go ahead and open a new issue on the ZFS tracker for them. At least it will give us a place to initially investigate.
@behlendorf
Hi Brian,
Thank you very much for providing this patch. We recompiled kernel 3.7.0 against git branch issue-174 (correct me if that wasn't the right thing to do) and then I ran the same tests I did before over the night. I am really happy to announce that this morning after about 12 hours of running the test the machine is still happily trucking. Yay!
Due to the semi-intermittent nature of the problem we decided to run a second longer test over the weekend before we declare success. So currently we built ZFS into 2.6.32 since we intend to use that if possible. This will be the setup that I'll leave the test running on over the weekend.
I'll report back here on Monday. Hopefully the news will be just as good.
I wonder if other participants, involved in this issue might want to give your branch a try and see how it goes for them?
Once again, thank you very much for your quick response and help. Highly appreciated.
Regards,
Rumen Telbizov Unix Systems Administrator
@behlendorf
Brian,
Here's the update that I promised. After two days of non-stop running sysbench 2.6.32 patched against issue-174 branch I am happy to report that the system is still up and running. No kernel panics or any side-effects that I can see. I think this change has indeed fixed the issue.
It would be nice to see other people's experience with this patch but as far as I can tell your changes fixed this issue.
Brian, do you intend to merge this change into the master branch?
Thank you once again for your prompt response and the patch.
Rumen Telbizov, Unix Systems Administrator
@telbizov That's good news. A little more digging has turned up that this code was originally added to reduce lock contention on the zl_lock from zil_commit()
back in '07. Now at lot has changed since then so it's not entirely clear how much this is still helping.
But the test workload was running the filebench varmail tests and observing very slow fsync() times. Would it be possible for you to test this workload on your system with and without the patch so we know what the impact of this will be.
Original Sun issue: 6535160 lock contention on zl_lock from zil_commit
Merged in to master as commit zfsonlinux/zfs@31f2b5abdf95d8426d8bfd66ca7f62ec70215e3c , I wasn't able to observe a significant difference in fsync(2) performance with or without the patch. If other notice a significant performance regression we can revisit this issue then.
@behlendorf
Sorry for my late response. I see that you've closed the issue but I did run the suggested test against both my ssd and sata zpools using kernels with and without your patch. Here are the details:
Filebench Version 1.4.9.1
110045: 0.000: Allocated 170MB of shared memory
110045: 0.001: Varmail Version 3.0 personality successfully loaded
110045: 0.001: Creating/pre-allocating files and filesets
110045: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
110045: 0.005: Removed any existing fileset bigfileset in 1 seconds
110045: 0.005: making tree for filset /sata1/bigfileset
110045: 0.005: Creating fileset bigfileset...
110045: 0.083: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
110045: 0.083: waiting for fileset pre-allocation to finish
110048: 0.083: Starting 1 filereader instances
110049: 0.084: Starting 16 filereaderthread threads
110045: 1.107: Running...
110045: 301.131: Run took 300 seconds...
110045: 301.132: Per-Operation Breakdown
closefile4 41114ops 137ops/s 0.0mb/s 0.0ms/op 585us/op-cpu [0ms - 0ms]
readfile4 41114ops 137ops/s 1.2mb/s 0.0ms/op 790us/op-cpu [0ms - 0ms]
openfile4 41114ops 137ops/s 0.0mb/s 0.0ms/op 1083us/op-cpu [0ms - 0ms]
closefile3 41114ops 137ops/s 0.0mb/s 0.0ms/op 1391us/op-cpu [0ms - 0ms]
fsyncfile3 41114ops 137ops/s 0.0mb/s 54.2ms/op 13220us/op-cpu [11ms - 168ms]
appendfilerand3 41115ops 137ops/s 1.1mb/s 0.0ms/op 800us/op-cpu [0ms - 1ms]
readfile3 41115ops 137ops/s 1.6mb/s 0.0ms/op 747us/op-cpu [0ms - 0ms]
openfile3 41115ops 137ops/s 0.0mb/s 0.0ms/op 1051us/op-cpu [0ms - 0ms]
closefile2 41115ops 137ops/s 0.0mb/s 0.0ms/op 1391us/op-cpu [0ms - 0ms]
fsyncfile2 41115ops 137ops/s 0.0mb/s 53.3ms/op 13466us/op-cpu [10ms - 162ms]
appendfilerand2 41130ops 137ops/s 1.1mb/s 0.0ms/op 488us/op-cpu [0ms - 7ms]
createfile2 41130ops 137ops/s 0.0mb/s 0.2ms/op 1209us/op-cpu [0ms - 0ms]
deletefile1 41129ops 137ops/s 0.0mb/s 0.2ms/op 1308us/op-cpu [0ms - 1ms]
110045: 301.132: IO Summary: 534534 ops, 1781.642 ops/s, (274/274 r/w), 4.9mb/s, 868us cpu/op, 27.0ms latency
110045: 301.132: Shutting down processes
Filebench Version 1.4.9.1
110070: 0.000: Allocated 170MB of shared memory
110070: 0.001: Varmail Version 3.0 personality successfully loaded
110070: 0.001: Creating/pre-allocating files and filesets
110070: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
110070: 0.005: Removed any existing fileset bigfileset in 1 seconds
110070: 0.005: making tree for filset /ssd1/bigfileset
110070: 0.005: Creating fileset bigfileset...
110070: 0.081: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
110070: 0.081: waiting for fileset pre-allocation to finish
110073: 0.081: Starting 1 filereader instances
110074: 0.082: Starting 16 filereaderthread threads
110070: 1.101: Running...
110070: 301.122: Run took 300 seconds...
110070: 301.123: Per-Operation Breakdown
closefile4 414791ops 1383ops/s 0.0mb/s 0.0ms/op 1152us/op-cpu [0ms - 0ms]
readfile4 414792ops 1383ops/s 21.3mb/s 0.0ms/op 1213us/op-cpu [0ms - 1ms]
openfile4 414793ops 1383ops/s 0.0mb/s 0.0ms/op 1147us/op-cpu [0ms - 2ms]
closefile3 414793ops 1383ops/s 0.0mb/s 0.0ms/op 1118us/op-cpu [0ms - 0ms]
fsyncfile3 414793ops 1383ops/s 0.0mb/s 0.8ms/op 3684us/op-cpu [0ms - 75ms]
appendfilerand3 414795ops 1383ops/s 10.8mb/s 0.0ms/op 1288us/op-cpu [0ms - 2ms]
readfile3 414797ops 1383ops/s 21.5mb/s 0.0ms/op 1218us/op-cpu [0ms - 0ms]
openfile3 414797ops 1383ops/s 0.0mb/s 0.0ms/op 1180us/op-cpu [0ms - 2ms]
closefile2 414798ops 1383ops/s 0.0mb/s 0.0ms/op 1128us/op-cpu [0ms - 0ms]
fsyncfile2 414799ops 1383ops/s 0.0mb/s 0.9ms/op 3621us/op-cpu [0ms - 76ms]
appendfilerand2 414802ops 1383ops/s 10.8mb/s 0.0ms/op 1256us/op-cpu [0ms - 2ms]
createfile2 414803ops 1383ops/s 0.0mb/s 0.1ms/op 1459us/op-cpu [0ms - 21ms]
deletefile1 414807ops 1383ops/s 0.0mb/s 0.1ms/op 1536us/op-cpu [0ms - 21ms]
110070: 301.123: IO Summary: 5392360 ops, 17973.315 ops/s, (2765/2765 r/w), 64.4mb/s, 553us cpu/op, 0.5ms latency
110070: 301.123: Shutting down processes
Filebench Version 1.4.9.1
4223: 0.000: Allocated 170MB of shared memory
4223: 0.001: Varmail Version 3.0 personality successfully loaded
4223: 0.001: Creating/pre-allocating files and filesets
4223: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
4223: 0.005: Removed any existing fileset bigfileset in 1 seconds
4223: 0.005: making tree for filset /sata1/bigfileset
4223: 0.006: Creating fileset bigfileset...
4223: 0.105: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
4223: 0.105: waiting for fileset pre-allocation to finish
4226: 0.106: Starting 1 filereader instances
4227: 0.107: Starting 16 filereaderthread threads
4223: 1.128: Running...
4223: 301.148: Run took 300 seconds...
4223: 301.149: Per-Operation Breakdown
closefile4 366251ops 1221ops/s 0.0mb/s 0.0ms/op 1216us/op-cpu [0ms - 0ms]
readfile4 366251ops 1221ops/s 18.2mb/s 0.0ms/op 1286us/op-cpu [0ms - 0ms]
openfile4 366251ops 1221ops/s 0.0mb/s 0.0ms/op 1235us/op-cpu [0ms - 0ms]
closefile3 366252ops 1221ops/s 0.0mb/s 0.0ms/op 1213us/op-cpu [0ms - 0ms]
fsyncfile3 366255ops 1221ops/s 0.0mb/s 0.9ms/op 2299us/op-cpu [0ms - 153ms]
appendfilerand3 366256ops 1221ops/s 9.5mb/s 0.0ms/op 1367us/op-cpu [0ms - 4ms]
readfile3 366256ops 1221ops/s 18.9mb/s 0.0ms/op 1279us/op-cpu [0ms - 0ms]
openfile3 366256ops 1221ops/s 0.0mb/s 0.0ms/op 1231us/op-cpu [0ms - 2ms]
closefile2 366256ops 1221ops/s 0.0mb/s 0.0ms/op 1230us/op-cpu [0ms - 0ms]
fsyncfile2 366256ops 1221ops/s 0.0mb/s 1.1ms/op 2272us/op-cpu [0ms - 153ms]
appendfilerand2 366257ops 1221ops/s 9.5mb/s 0.0ms/op 1322us/op-cpu [0ms - 6ms]
createfile2 366260ops 1221ops/s 0.0mb/s 0.1ms/op 1524us/op-cpu [0ms - 17ms]
deletefile1 366265ops 1221ops/s 0.0mb/s 0.1ms/op 1620us/op-cpu [0ms - 15ms]
4223: 301.149: IO Summary: 4761322 ops, 15870.028 ops/s, (2442/2442 r/w), 56.2mb/s, 581us cpu/op, 0.6ms latency
4223: 301.149: Shutting down processes
Filebench Version 1.4.9.1
4254: 0.000: Allocated 170MB of shared memory
4254: 0.001: Varmail Version 3.0 personality successfully loaded
4254: 0.001: Creating/pre-allocating files and filesets
4254: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
4254: 0.005: Removed any existing fileset bigfileset in 1 seconds
4254: 0.005: making tree for filset /ssd1/bigfileset
4254: 0.005: Creating fileset bigfileset...
4254: 0.095: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
4254: 0.095: waiting for fileset pre-allocation to finish
4257: 0.096: Starting 1 filereader instances
4258: 0.097: Starting 16 filereaderthread threads
4254: 1.113: Running...
4254: 301.133: Run took 300 seconds...
4254: 301.133: Per-Operation Breakdown
closefile4 407559ops 1358ops/s 0.0mb/s 0.0ms/op 1230us/op-cpu [0ms - 0ms]
readfile4 407559ops 1358ops/s 19.8mb/s 0.0ms/op 1285us/op-cpu [0ms - 5ms]
openfile4 407562ops 1358ops/s 0.0mb/s 0.0ms/op 1231us/op-cpu [0ms - 0ms]
closefile3 407562ops 1358ops/s 0.0mb/s 0.0ms/op 1211us/op-cpu [0ms - 0ms]
fsyncfile3 407562ops 1358ops/s 0.0mb/s 0.4ms/op 2285us/op-cpu [0ms - 60ms]
appendfilerand3 407564ops 1358ops/s 10.6mb/s 0.0ms/op 1360us/op-cpu [0ms - 26ms]
readfile3 407564ops 1358ops/s 20.8mb/s 0.0ms/op 1285us/op-cpu [0ms - 0ms]
openfile3 407567ops 1358ops/s 0.0mb/s 0.0ms/op 1242us/op-cpu [0ms - 2ms]
closefile2 407568ops 1358ops/s 0.0mb/s 0.0ms/op 1218us/op-cpu [0ms - 2ms]
fsyncfile2 407570ops 1358ops/s 0.0mb/s 0.5ms/op 2252us/op-cpu [0ms - 59ms]
appendfilerand2 407573ops 1358ops/s 10.6mb/s 0.0ms/op 1338us/op-cpu [0ms - 32ms]
createfile2 407574ops 1358ops/s 0.0mb/s 0.1ms/op 1534us/op-cpu [0ms - 15ms]
deletefile1 407575ops 1358ops/s 0.0mb/s 0.1ms/op 1625us/op-cpu [0ms - 15ms]
4254: 301.133: IO Summary: 5298359 ops, 17660.065 ops/s, (2717/2717 r/w), 61.8mb/s, 561us cpu/op, 0.3ms latency
4254: 301.133: Shutting down processes
So, it does look like you are reverting some sort of a "performance fix" related to bugid 6535160 with your issue-174 branch. At least that's what it seems to me.
fsyncfile3 366255ops 1221ops/s 0.0mb/s 0.9ms/op 2299us/op-cpu [0ms - 153ms]
fsyncfile2 366256ops 1221ops/s 0.0mb/s 1.1ms/op 2272us/op-cpu [0ms - 153ms]
fsyncfile3 41114ops 137ops/s 0.0mb/s 54.2ms/op 13220us/op-cpu [11ms - 168ms]
fsyncfile2 41115ops 137ops/s 0.0mb/s 53.3ms/op 13466us/op-cpu [10ms - 162ms]
The SSD zpool performance doesn't really change much between tests but on the other hand without your patch (issue-174) the SATA zpool performs almost as fast as the ssd pool! This is strange and seems unreal. As if there is some heavy caching effect taking place before your patch.
Let me know what you think. It definitely looks like regression, although I am not sure if it's for better or worse. I'll run additional tests if needed. Maybe we need to reopen this ticket?
Thanks you Brian.
Rumen Telbizov Unix Systems Administrator
@telbizov That's for running these benchmarks, your absolutely right there's a regression here so I'm reopening the issue. I've posted a new patch which readds the optimization as per-file instead of pre-thread. Can you please verify it fixes the performance regression.
@behlendorf
Thank you for reopening the issue and continuing to looking into it. I will recompile a new kernel against the latest issue-174
branch tomorrow and I'll repeat the performance tests that I did yesterday and report back here. I hope this latest change will not bring things back to the previous broken state which caused kernel panics. I'll run stability tests over the weekend too and I'll report on Monday.
Another thing I keep wondering is: can you explain this insane performance boost that I see when testing SATA disks? How could it possible be anywhere close to SSD, unless there's heavy caching involved?
Cheers,
Rumen Telbizov Unix Systems Administrator
@telbizov Due to the performance regression I'm going to revert the original fix for this issue from the master branch. Once we're happy the the revised fix doesn't suffer from a similar problem we'll merge it. Until then you'll want to run with a locally patched version to avoid the panic which your workload seems particularly good at hitting.
will recompile a new kernel against the latest issue-174 branch tomorrow and I'll repeat the performance tests that I did yesterday and report back here.
Thank you.
I hope this latest change will not bring things back to the previous broken state which caused kernel panics.
It won't. The panic you were hitting was in the thread specific data implementation itself. With the new patch applied that code is still no longer used so we can't really hit that issue. Things may in fact perform slightly better because the new approach sheds some additional locking overhead.
Another thing I keep wondering is: can you explain this insane performance boost that I see when testing SATA disks?
The issue here is really lock contention in zil_commit() on the zr_lock and not disk speed. This optimization is designed to minimize that contention by falling back to synchronous write I/Os when it detects a thread which is calling fsync(2) constantly. This effectively reduces the contention and improves the throughput. The updated patch simply changes the existing logic to detect contention on a per-file instead of per-thread basic.
@behlendorf
Thanks for your continuing efforts Brian. Here are the results of my tests.
Kernel is 2.6.32-279.14.1.el6-v2
build against the latest version of branch issue-174
, i.e. fsync(2) optimization per-file.
Filebench Version 1.4.9.1
2572: 0.000: Allocated 170MB of shared memory
2572: 0.001: Varmail Version 3.0 personality successfully loaded
2572: 0.001: Creating/pre-allocating files and filesets
2572: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
2572: 0.005: Removed any existing fileset bigfileset in 1 seconds
2572: 0.005: making tree for filset /sata1/bigfileset
2572: 0.005: Creating fileset bigfileset...
2572: 0.081: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
2572: 0.081: waiting for fileset pre-allocation to finish
2575: 0.081: Starting 1 filereader instances
2576: 0.082: Starting 16 filereaderthread threads
2572: 1.109: Running...
2572: 301.132: Run took 300 seconds...
2572: 301.133: Per-Operation Breakdown
closefile4 52183ops 174ops/s 0.0mb/s 0.0ms/op 576us/op-cpu [0ms - 0ms]
readfile4 52183ops 174ops/s 1.7mb/s 0.0ms/op 795us/op-cpu [0ms - 30ms]
openfile4 52183ops 174ops/s 0.0mb/s 0.0ms/op 1124us/op-cpu [0ms - 0ms]
closefile3 52183ops 174ops/s 0.0mb/s 0.0ms/op 1354us/op-cpu [0ms - 0ms]
fsyncfile3 52183ops 174ops/s 0.0mb/s 41.8ms/op 12371us/op-cpu [10ms - 168ms]
appendfilerand3 52198ops 174ops/s 1.4mb/s 0.0ms/op 772us/op-cpu [0ms - 2ms]
readfile3 52198ops 174ops/s 2.1mb/s 0.0ms/op 787us/op-cpu [0ms - 0ms]
openfile3 52198ops 174ops/s 0.0mb/s 0.0ms/op 1046us/op-cpu [0ms - 0ms]
closefile2 52198ops 174ops/s 0.0mb/s 0.0ms/op 1393us/op-cpu [0ms - 0ms]
fsyncfile2 52198ops 174ops/s 0.0mb/s 40.9ms/op 12332us/op-cpu [9ms - 167ms]
appendfilerand2 52199ops 174ops/s 1.4mb/s 0.0ms/op 498us/op-cpu [0ms - 0ms]
createfile2 52199ops 174ops/s 0.0mb/s 0.1ms/op 1085us/op-cpu [0ms - 0ms]
deletefile1 52199ops 174ops/s 0.0mb/s 0.2ms/op 1357us/op-cpu [0ms - 5ms]
2572: 301.133: IO Summary: 678502 ops, 2261.503 ops/s, (348/348 r/w), 6.5mb/s, 849us cpu/op, 20.8ms latency
2572: 301.133: Shutting down processes
So it seems to me that yes, there is some performance improvement but it's not on par with the initial "plain" kernel without your patches. Here are the extracts:
fsyncfile3 366255ops 1221ops/s 0.0mb/s 0.9ms/op 2299us/op-cpu [0ms - 153ms]
fsyncfile2 366256ops 1221ops/s 0.0mb/s 1.1ms/op 2272us/op-cpu [0ms - 153ms]
fsyncfile3 52183ops 174ops/s 0.0mb/s 41.8ms/op 12371us/op-cpu [10ms - 168ms]
fsyncfile2 52198ops 174ops/s 0.0mb/s 40.9ms/op 12332us/op-cpu [9ms - 167ms]
fsyncfile3 41114ops 137ops/s 0.0mb/s 54.2ms/op 13220us/op-cpu [11ms - 168ms]
fsyncfile2 41115ops 137ops/s 0.0mb/s 53.3ms/op 13466us/op-cpu [10ms - 162ms]
Filebench Version 1.4.9.1
2633: 0.000: Allocated 170MB of shared memory
2633: 0.001: Varmail Version 3.0 personality successfully loaded
2633: 0.001: Creating/pre-allocating files and filesets
2633: 0.002: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
2633: 0.005: Removed any existing fileset bigfileset in 1 seconds
2633: 0.005: making tree for filset /ssd1/bigfileset
2633: 0.005: Creating fileset bigfileset...
2633: 0.077: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
2633: 0.077: waiting for fileset pre-allocation to finish
2636: 0.077: Starting 1 filereader instances
2637: 0.078: Starting 16 filereaderthread threads
2633: 1.092: Running...
2633: 301.112: Run took 300 seconds...
2633: 301.113: Per-Operation Breakdown
closefile4 418051ops 1393ops/s 0.0mb/s 0.0ms/op 1127us/op-cpu [0ms - 0ms]
readfile4 418052ops 1393ops/s 21.6mb/s 0.0ms/op 1192us/op-cpu [0ms - 0ms]
openfile4 418052ops 1393ops/s 0.0mb/s 0.0ms/op 1127us/op-cpu [0ms - 2ms]
closefile3 418054ops 1393ops/s 0.0mb/s 0.0ms/op 1093us/op-cpu [0ms - 0ms]
fsyncfile3 418056ops 1393ops/s 0.0mb/s 0.9ms/op 3727us/op-cpu [0ms - 59ms]
appendfilerand3 418057ops 1393ops/s 10.9mb/s 0.0ms/op 1270us/op-cpu [0ms - 2ms]
readfile3 418059ops 1393ops/s 21.7mb/s 0.0ms/op 1207us/op-cpu [0ms - 0ms]
openfile3 418059ops 1393ops/s 0.0mb/s 0.0ms/op 1157us/op-cpu [0ms - 2ms]
closefile2 418059ops 1393ops/s 0.0mb/s 0.0ms/op 1107us/op-cpu [0ms - 0ms]
fsyncfile2 418060ops 1393ops/s 0.0mb/s 0.9ms/op 3672us/op-cpu [0ms - 61ms]
appendfilerand2 418062ops 1393ops/s 10.9mb/s 0.0ms/op 1225us/op-cpu [0ms - 1ms]
createfile2 418062ops 1393ops/s 0.0mb/s 0.1ms/op 1423us/op-cpu [0ms - 27ms]
deletefile1 418066ops 1393ops/s 0.0mb/s 0.1ms/op 1540us/op-cpu [0ms - 27ms]
2633: 301.113: IO Summary: 5434749 ops, 18114.622 ops/s, (2787/2787 r/w), 65.0mb/s, 542us cpu/op, 0.5ms latency
2633: 301.113: Shutting down processes
So SSD test - as good as always.
I'll run my original sysbench test which was causing kernel panics in the first place over the weekend and make sure that we haven't regressed back in this aspect either. I'll report my results on Monday, 24th.
Regards.
Rumen Telbizov Unix Systems Administrator
That's surprising, but that's why we run the tests. OK, so there's clearly more work to do here. Perhaps I will look in to just fixing the TSD issue.
@behlendorf
Brian,
Just wanted to report on my second part of the last test - stability. It seems like on this front we're still OK. I've been running my sysbench, stability test for the weekend and the machine has been just fine - no kernel panics. So I guess at this point I'll wait for you to evaluate what further options there are and advise if there will be any need for future tests. If there are such - feel free to ask. If I'm around I'll be happy to perform them.
Merry Christmas
Rumen Telbizov Unix Systems Administrator
Hello
I can confirm that without that patch my 2 servers are crashing, but only under heavy workload. Most of a times is kind of deadlock, (no output, no dmesg) - silent death.
Once, I managed to get a stack trace (included below)
Problem is easy to reproduce. Just start dbench with 300-400 processes and wait 15-20 minutes.
Both servers are exactly the same: OpenSuSe 12.2 with 3.4.11-2.16-default
System Hardware: Processor: Intel Xeon X5660 @ 2.80GHz (12 Cores), Motherboard: Dell 0DPRKF, Chipset: Intel 5500 I/O + ICH10R, Memory: 4 x 8192 MB DDR3-1066MHz,
kernel: [ 1359.439585] general protection fault: 0000 [#1] SMP. kernel: [ 1359.439595] CPU 5. kernel: [ 1359.439598] Modules linked in: loop bridge stp bonding nls_iso8859_1 nls_cp437 vfat fat pciehp pci_hotplug zfs(PO) zcommon(PO) znvpai kernel: [ 1359.439689]. kernel: [ 1359.439695] Pid: 8248, comm: dbench Tainted: P O 3.4.11-2.16-default #1 Dell Inc. PowerEdge R510/0DPRKF kernel: [ 1359.439707] RIP: 0010:[
] [ ] tsd_hash_search.isra.1+0xa8/0x1b0 [spl] kernel: [ 1359.439733] RSP: 0018:ffff8807a3bf1e08 EFLAGS: 00010286 kernel: [ 1359.439738] RAX: dead000000100100 RBX: 0000000000008001 RCX: 0000000000000037 kernel: [ 1359.439745] RDX: 000000000000711b RSI: 01c0000000000000 RDI: ffff880805c5b380 kernel: [ 1359.439752] RBP: ffff880805c5b380 R08: ffffbf8f8f3c2038 R09: 696c632f73746e65 kernel: [ 1359.439758] R10: 7e2f383032746e65 R11: 0000000000000246 R12: 0000000000002038 kernel: [ 1359.439765] R13: ffff880803bbb570 R14: ffff880803bbb564 R15: ffff880803bbb570 kernel: [ 1359.439773] FS: 00007f80e4256700(0000) GS:ffff88082f2a0000(0000) knlGS:0000000000000000 kernel: [ 1359.439780] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: [ 1359.439788] CR2: 00007f6aeb747000 CR3: 000000078bf63000 CR4: 00000000000007e0 kernel: [ 1359.439795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kernel: [ 1359.439803] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 kernel: [ 1359.439812] Process dbench (pid: 8248, threadinfo ffff8807a3bf0000, task ffff8807d84267c0) kernel: [ 1359.439819] Stack: kernel: [ 1359.439824] ffff880806c8b7a0 ffff8807450ea440 ffff8807b2d76340 ffffffffa04a0f9d kernel: [ 1359.439837] 0000000000000002 ffff880803bbb560 0000000000002038 0000000000000004 kernel: [ 1359.439849] ffff880803bbb564 ffffffffa032c317 0000000000000004 ffffffff81163954 kernel: [ 1359.439863] Call Trace: kernel: [ 1359.439926] [ ] tsd_set+0x1a7/0x720 [spl] kernel: [ 1359.440033] [ ] zfs_fsync+0x3b/0x110 [zfs] kernel: [ 1359.440368] [ ] zpl_fsync+0x76/0xe0 [zfs] kernel: [ 1359.440685] [ ] do_fsync+0x4d/0x80 kernel: [ 1359.440699] [ ] sys_fdatasync+0xe/0x20 kernel: [ 1359.440714] [ ] system_call_fastpath+0x1a/0x1f kernel: [ 1359.440730] [<00007f80e3d8d150>] 0x7f80e3d8d14f kernel: [ 1359.440735] Code: ed 48 c1 e5 04 49 03 6d 00 48 89 ef e8 92 11 21 e1 48 8b 45 08 48 85 c0 75 11 eb 57 0f 1f 80 00 00 00 00 48 8b 00 4 kernel: [ 1359.440800] RIP [ ] tsd_hash_search.isra.1+0xa8/0x1b0 [spl] kernel: [ 1359.440820] RSP kernel: [ 1359.456921] ---[ end trace 5895a75b9961ffb5 ]---
@telbizov @azaroth Thanks for the dbench reproducer. I was able to recreate the issue fairly easily with the right options dbench -F -t 3600 500
. The root cause here was a race between tsd_set() and tsd_exit() and some missing locking. The following patch fixes the issue and as expected I'm no longer able to hit the bug after several hours of testing. However, I'd appreciate it if you guys could verify the patch as well.
1d6e830 Fix tsd_get/set() race with tsd_exit/destroy()
This fix has been merged in to master.
6ef94aa Fix tsd_get/set() race with tsd_exit/destroy()
@behlendorf Thanks for reproducing the issue yourself and even better - fixing it. Unfortunately at this time I no longer have that test setup that I was playing with before and cannot reproduce it right now. Soon enough though I might get a similar machine and will be able to try and test the now master branch which contains your fix.
Once again, thank you for your efforts.
Regards,
Rumen Telbizov Unix Systems Administrator
Hi, I've now seen this issue on 4 different servers, seems to happen at random with no pattern / frequency that I can spot - other than they all see the problem.
Using Ubuntu 11.04 and Ubuntu 12.04, PPA. spl-dkms (2 0.6.0.71)
Tends to render the system "unhappy" although it keeps running .. although all ZFS volumes seem to die ...