hhoffstaette / kernel-patches

Custom Linux kernel patches
39 stars 7 forks source link

slow btrfs balance / hanging balance over dm+lvm #2

Closed disaster123 closed 9 years ago

disaster123 commented 9 years ago

Running cur current git head on top of 4.1.10 results in ultimativly slow btrfs balance.

I've seen your post here too: http://www.spinics.net/lists/linux-btrfs/msg47474.html

I had a 50TB volume where i wasn't even able to mount that while cause it was hanging at wait on current transaction (soft lockup).

Mounting on plain 4.1.10 works in seconds.

Greets, Stefan

hhoffstaette commented 9 years ago

This turned out to be a bug which I found and fixed in commit 3971782c84fe89fbaa8da4d127e61a59be7a15e9. Josef also sent a v2 to the list. However that patch is not yet in upstream git, so whatever you are experiencing has nothing to do with this.

I can balance just fine. :)

disaster123 commented 9 years ago

Sad - but good you can.

Starting balancing with dusage and musage=5 results in a lot of traces and a not usable fs. (all i/o is blocked)

INFO: task btrfs-transacti:8616 blocked for more than 120 seconds. Tainted: G W O 4.1.10+36-ph #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-transacti D ffff880070c6fd68 0 8616 2 0x00000000 ffff880070c6fd68 ffff88085c2e1d70 ffff88085ab15850 0000000000000246 ffff880070c70000 ffff88105bfc79f0 ffff88105bfc79f0 0000000000000000 ffff88085f3e8d20 ffff880070c6fd88 ffffffff9263b827 ffff88105bfc79f0 Call Trace: [] schedule+0x37/0x90 [] wait_current_trans.isra.34+0xa7/0x110 [btrfs] [] ? finish_wait+0x80/0x80 [] start_transaction+0x240/0x540 [btrfs] [] btrfs_attach_transaction+0x17/0x20 [btrfs] [] transaction_kthread+0x16b/0x240 [btrfs] [] ? open_ctree+0x2390/0x2390 [btrfs] [] kthread+0xc9/0xe0 [] ? kthread_create_on_node+0x1a0/0x1a0 [] ret_from_fork+0x42/0x70 [] ? kthread_create_on_node+0x1a0/0x1a0

hhoffstaette commented 9 years ago

Couple of questions:

disaster123 commented 9 years ago

1.) yes 2.) build head was from yesterday commit 0d75ec897e2a91039ae391c5d04fdb4faab78b00 3.) no

disaster123 commented 9 years ago

Something i recognized too is: block relocation speed running 4.1.10: Oct 13 21:37:09 server kernel: [ 3014.990673] BTRFS info (device dm-0): relocating block group 10278422970368 flags 1 Oct 13 21:37:09 server kernel: [ 3015.112688] BTRFS info (device dm-0): found 143 extents Oct 13 21:37:10 server kernel: [ 3016.091895] BTRFS info (device dm-0): found 143 extents Oct 13 21:37:10 server kernel: [ 3016.178879] BTRFS info (device dm-0): relocating block group 10277349228544 flags 1 Oct 13 21:37:10 server kernel: [ 3016.270042] BTRFS info (device dm-0): found 17 extents Oct 13 21:37:11 server kernel: [ 3017.267539] BTRFS info (device dm-0): found 17 extents Oct 13 21:37:11 server kernel: [ 3017.405318] BTRFS info (device dm-0): relocating block group 10276275486720 flags 1 Oct 13 21:37:12 server kernel: [ 3017.824346] BTRFS info (device dm-0): found 2134 extents Oct 13 21:37:13 server kernel: [ 3019.478046] BTRFS info (device dm-0): found 2134 extents Oct 13 21:37:13 server kernel: [ 3019.619306] BTRFS info (device dm-0): relocating block group 10275201744896 flags 1 Oct 13 21:37:14 server kernel: [ 3019.828894] BTRFS info (device dm-0): found 73 extents Oct 13 21:37:15 server kernel: [ 3021.377389] BTRFS info (device dm-0): found 73 extents Oct 13 21:37:15 server kernel: [ 3021.503394] BTRFS info (device dm-0): relocating block group 10274128003072 flags 1 Oct 13 21:37:16 server kernel: [ 3021.772331] BTRFS info (device dm-0): found 512 extents Oct 13 21:37:17 server kernel: [ 3023.075138] BTRFS info (device dm-0): found 512 extents Oct 13 21:37:17 server kernel: [ 3023.191875] BTRFS info (device dm-0): relocating block group 10273054261248 flags 1 Oct 13 21:37:17 server kernel: [ 3023.353029] BTRFS info (device dm-0): found 349 extents Oct 13 21:37:18 server kernel: [ 3024.724881] BTRFS info (device dm-0): found 349 extents Oct 13 21:37:19 server kernel: [ 3024.871419] BTRFS info (device dm-0): relocating block group 10270906777600 flags 1 Oct 13 21:37:19 server kernel: [ 3025.209894] BTRFS info (device dm-0): found 544 extents Oct 13 21:37:21 server kernel: [ 3026.914781] BTRFS info (device dm-0): found 544 extents Oct 13 21:37:21 server kernel: [ 3027.055448] BTRFS info (device dm-0): relocating block group 10269833035776 flags 1 Oct 13 21:37:21 server kernel: [ 3027.323139] BTRFS info (device dm-0): found 656 extents Oct 13 21:37:23 server kernel: [ 3029.124579] BTRFS info (device dm-0): found 656 extents Oct 13 21:37:23 server kernel: [ 3029.298675] BTRFS info (device dm-0): relocating block group 10265001197568 flags 1 Oct 13 21:37:23 server kernel: [ 3029.514228] BTRFS info (device dm-0): found 1 extents Oct 13 21:37:24 server kernel: [ 3030.699826] BTRFS info (device dm-0): found 1 extents Oct 13 21:37:25 server kernel: [ 3030.837214] BTRFS info (device dm-0): relocating block group 10260706230272 flags 1 Oct 13 21:37:25 server kernel: [ 3031.039128] BTRFS info (device dm-0): found 281 extents Oct 13 21:37:27 server kernel: [ 3032.957891] BTRFS info (device dm-0): found 281 extents Oct 13 21:37:27 server kernel: [ 3033.189209] BTRFS info (device dm-0): relocating block group 10220977782784 flags 1 Oct 13 21:37:27 server kernel: [ 3033.504665] BTRFS info (device dm-0): found 77 extents Oct 13 21:37:29 server kernel: [ 3035.264047] BTRFS info (device dm-0): found 77 extents Oct 13 21:37:29 server kernel: [ 3035.434820] BTRFS info (device dm-0): relocating block group 10172659400704 flags 1 Oct 13 21:37:30 server kernel: [ 3035.940158] BTRFS info (device dm-0): found 1 extents Oct 13 21:37:32 server kernel: [ 3038.308445] BTRFS info (device dm-0): found 1 extents Oct 13 21:37:32 server kernel: [ 3038.514463] BTRFS info (device dm-0): relocating block group 10156553273344 flags 1 Oct 13 21:37:32 server kernel: [ 3038.704062] BTRFS info (device dm-0): found 3 extents Oct 13 21:37:35 server kernel: [ 3041.039652] BTRFS info (device dm-0): found 3 extents Oct 13 21:37:35 server kernel: [ 3041.161175] BTRFS info (device dm-0): relocating block group 10136152178688 flags 1 Oct 13 21:37:35 server kernel: [ 3041.493013] BTRFS info (device dm-0): found 1 extents Oct 13 21:37:38 server kernel: [ 3043.776984] BTRFS info (device dm-0): found 1 extents Oct 13 21:37:38 server kernel: [ 3044.023700] BTRFS info (device dm-0): relocating block group 10118972309504 flags 1 Oct 13 21:37:38 server kernel: [ 3044.350797] BTRFS info (device dm-0): found 33 extents Oct 13 21:37:40 server kernel: [ 3045.850969] BTRFS info (device dm-0): found 33 extents Oct 13 21:37:40 server kernel: [ 3045.975143] BTRFS info (device dm-0): relocating block group 10115751084032 flags 1 Oct 13 21:37:40 server kernel: [ 3046.110044] BTRFS info (device dm-0): found 3 extents Oct 13 21:37:41 server kernel: [ 3046.839285] BTRFS info (device dm-0): found 3 extents Oct 13 21:37:41 server kernel: [ 3046.915783] BTRFS info (device dm-0): relocating block group 10065285218304 flags 1 Oct 13 21:37:41 server kernel: [ 3047.121301] BTRFS info (device dm-0): found 80 extents Oct 13 21:37:42 server kernel: [ 3048.683889] BTRFS info (device dm-0): found 80 extents

have a look at the timestamps how often is relocates.

Running with your patches (btrfs balance start runs at 100% cpu): Oct 13 11:05:55 server kernel: [43819.204669] BTRFS info (device dm-0): found 4119 extents Oct 13 11:05:59 server kernel: [43822.209693] BTRFS info (device dm-0): relocating block group 31130489192448 flags 1 Oct 13 11:06:01 server kernel: [43824.873894] BTRFS info (device dm-0): found 3167 extents Oct 13 11:26:47 server kernel: [45071.067252] BTRFS info (device dm-0): found 3167 extents Oct 13 11:26:50 server kernel: [45074.521574] BTRFS info (device dm-0): relocating block group 31129415450624 flags 1 Oct 13 11:26:51 server kernel: [45075.207566] BTRFS info (device dm-0): found 67 extents Oct 13 11:28:09 server kernel: [45153.253372] BTRFS info (device dm-0): found 67 extents Oct 13 11:28:09 server kernel: [45153.487092] BTRFS info (device dm-0): relocating block group 30863127478272 flags 1 Oct 13 11:28:10 server kernel: [45154.693602] BTRFS info (device dm-0): found 532 extents Oct 13 11:59:01 server kernel: [47005.552345] BTRFS info (device dm-0): found 532 extents Oct 13 11:59:04 server kernel: [47008.421635] BTRFS info (device dm-0): relocating block group 30862053736448 flags 1 Oct 13 11:59:07 server kernel: [47012.114997] BTRFS info (device dm-0): found 210 extents Oct 13 12:27:48 server kernel: [48733.392930] BTRFS info (device dm-0): found 210 extents Oct 13 12:27:53 server kernel: [48738.682130] BTRFS info (device dm-0): relocating block group 30859906252800 flags 1 Oct 13 12:27:56 server kernel: [48741.766478] BTRFS info (device dm-0): found 366 extents Oct 13 13:34:50 server kernel: [52756.680890] BTRFS info (device dm-0): found 366 extents Oct 13 13:34:50 server kernel: [52756.914649] BTRFS info (device dm-0): relocating block group 30224251092992 flags 1 Oct 13 13:34:51 server kernel: [52758.008142] BTRFS info (device dm-0): found 3141 extents Oct 13 13:54:25 server kernel: [53932.854512] BTRFS info (device dm-0): found 3141 extents Oct 13 13:54:25 server kernel: [53933.005472] BTRFS info (device dm-0): relocating block group 30081443430400 flags 1 Oct 13 13:54:27 server kernel: [53934.236204] BTRFS info (device dm-0): found 240 extents Oct 13 14:02:09 server kernel: [54396.818045] BTRFS info (device dm-0): found 240 extents Oct 13 14:02:10 server kernel: [54397.742376] BTRFS info (device dm-0): relocating block group 30080369688576 flags 1 Oct 13 14:02:11 server kernel: [54398.470810] BTRFS info (device dm-0): found 139 extents Oct 13 14:06:17 server kernel: [54644.867652] BTRFS info (device dm-0): found 139 extents

hhoffstaette commented 9 years ago

Thanks - that rules out a few things. Just asking because I had to rebase/clean up heavily over the last few days and wanted to suggest a clean clone instead of maybe an update to an old (possibly confused) copy, but if you made a clean clone yesterday you should be good.

The fact that it's running over dm is interesting (I don't use dm - only plain devices here), and the delays are very strange indeed. Right off the top of my head the only thing I can suggest is that you try again with BFS disabled (it's enabled by default when patched in), i.e. use the stock CFS scheduler. If that helps I might have an idea what's going on. Other than that I don't know what to do from here without access to the hardware.

disaster123 commented 9 years ago

Yes i'm running two md raids (raid 5) and lvm striping on top of it - to have an expandable "raid" 0. I already tried to remove / delete all bfs patches to use cfq and also the logs above are from running cfq.

Strange that you don't see this problem.

hhoffstaette commented 9 years ago

So it's not what I suspected (IO plugging, which is a constant source of confusion even for mainline devs). Strange.

That only leaves btrfs-* or something in dm+lvm. It's not impossible to track this down, but I don't know how much time you have or are willing to spend on this since obviously I can't do it alone. If you want to, send me a private email (address is on the btrfs list) and we can arrange a Skype or phone call. I also have an idea about your neverending ENOSPC issues, but need more info.

hhoffstaette commented 9 years ago

One more suggestion. I'm assuming you're running with single data profile over your dm+lvm setup. If so, can you please retry without: btrfs-20150203-use-asynchronous-submit-for-large-DIO-io-in-single-profile.patch

disaster123 commented 9 years ago

I'm on holidays right now. Thanks for all your tips and help. I'll try to remove the mentioned patch first. Just need a chance to reboot the node - hangs at balance currently.

hhoffstaette commented 9 years ago

Just to verify here I just ran a balance with -dusage=5 -musage=5 on my 1TB SSD and it completed in ~1s. If this would be "more obviously broken" I would have noticed it myself already.

disaster123 commented 9 years ago

yes but i have a lot of data to balance. So it never starts balancing in your case. May be my use case is too crazy.

all of the above happens for 300 files each day (that's why i need 50tb)

hhoffstaette commented 9 years ago

I know I can balance more and that would take longer, obviously. Please stay on topic and let's not confuse the problem(s).

disaster123 commented 9 years ago

removing the patch does not help

msg also stays the same. last resort would be bisecting - but testing this would take a lot of time.

[ 1194.788641] BTRFS info (device dm-0): relocating block group 19853515685888 flags 1 [ 1196.729661] BTRFS info (device dm-0): found 187 extents [ 1441.388043] INFO: task btrfs-transacti:7678 blocked for more than 120 seconds. [ 1441.390113] Tainted: G O 4.1.10+39-ph #1 [ 1441.391698] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1441.393914] btrfs-transacti D ffff88081bfffd68 0 7678 2 0x00000000 [ 1441.396008] ffff88081bfffd68 ffff88085c338000 ffff880850648000 0000000000000246 [ 1441.398295] ffff88081c000000 ffff88085bd619f0 ffff88085bd619f0 0000000000000000 [ 1441.400583] ffff88105537e1f8 ffff88081bfffd88 ffffffffb463b797 ffff88085bd619f0 [ 1441.402874] Call Trace: [ 1441.403574] [] schedule+0x37/0x90 [ 1441.405001] [] wait_current_trans.isra.34+0xa7/0x110 [btrfs] [ 1441.407073] [] ? finish_wait+0x80/0x80 [ 1441.408614] [] start_transaction+0x240/0x540 [btrfs] [ 1441.410497] [] btrfs_attach_transaction+0x17/0x20 [btrfs] [ 1441.412502] [] transaction_kthread+0x16b/0x240 [btrfs] [ 1441.414440] [] ? open_ctree+0x2390/0x2390 [btrfs] [ 1441.416253] [] kthread+0xc9/0xe0 [ 1441.417633] [] ? kthread_create_on_node+0x1a0/0x1a0 [ 1441.419486] [] ret_from_fork+0x42/0x70 [ 1441.421014] [] ? kthread_create_on_node+0x1a0/0x1a0 [ 1481.612924] BTRFS info (device dm-0): found 187 extents

hhoffstaette commented 9 years ago

OK, thanks for testing - that patch never made it upstream and could interfere with dm/lvm. If I'm not mistaken the timeout comes from an attempt to replay the log and continue the previous balance, so if you just want to get up and running try to mount with skip_balance. If that doesn't work either (with timeout again) then I'm afraid don't know. Maybe the free-space-cache, but who knows. :( Note that this is not a crash, just a lack of progress. It seems insane that mounting a fs takes >2 minutes.

disaster123 commented 9 years ago

This happens also if i cancel the balance and start a new balance (same trace). I don't need the skip_balance mount option as just booting the 4.1.10 vanilla one fixed it - by balancing pretty fast and giving I/O to other processes for this volume.

hhoffstaette commented 9 years ago

OK, understood. Have you tried 4.3-rc5, just to see what happens? Does it have the same slow balance symptom?

disaster123 commented 9 years ago

yes same symptom.

hhoffstaette commented 9 years ago

Excellent! No, really. This means I didn't mess up somehow and it's a generic upstream bug which we can triage. Can you try 4.2.3? Sorry to be so insisting, but the more information we have the better, and the easier it will be to raise this on the list. I understand why you need balance in the first place (for the ENOSPC/allocator behaviour over time with your rsync) - sorry that this didn't work out.

disaster123 commented 9 years ago

Will try tomorrow. Have to go to bed now. Thanks for your help.

disaster123 commented 9 years ago

bisecting your btrfs patches tells me it's: 126317b5dc4810381700a739ab0fdc2ed3ea95df is the first bad commit commit 126317b5dc4810381700a739ab0fdc2ed3ea95df Author: Mark Fasheh mfasheh@suse.de Date: Tue Sep 22 13:15:48 2015 -0700

btrfs: qgroup: account shared subtree during snapshot delete

Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup
mechanism.') removed our qgroup accounting during
btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
going bad shortly after a snapshot is removed.

Fix this by adding a dirty extent record when we encounter extents during
our shared subtree walk. This effectively restores the functionality we had
with the original shared subtree walking code in 1152651 (btrfs: qgroup:
account shared subtrees during snapshot delete).

The idea with the original patch (and this one) is that shared subtrees can
get skipped during drop_snapshot. The shared subtree walk then allows us a
chance to visit those extents and add them to the qgroup work for later
processing. This ultimately makes the accounting for drop snapshot work.

The new qgroup code nicely handles all the other extents during the tree
walk via the ref dec/inc functions so we don't have to add actions beyond
what we had originally.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
hhoffstaette commented 9 years ago

Thanks for the bisect. I wanted you to try 4.2.x for a very specific reason. :/

The following things don't add up:

If you really want to be sure you are on the right track, then try with all btrfs-* patches except: btrfs-20150910-use-btrfs_get_fs_root-in-resolve_indirect_ref btrfs-20150915-keep-dropped-roots-in-cache-until-transaction-commit btrfs-20150922-account-shared-subtree-during-snapshot-delete btrfs-20150922-add-qgroup-tracing

If that fixes the slow balance (which would be surprising but not impossible!) try to re-add btrfs-20150915-keep-dropped-roots-in-cache-until-transaction-commit and see what happens.

disaster123 commented 9 years ago

4.2.3 works fine at least for balance have to retry 4.3-rc5 may be i did something wrong

disaster123 commented 9 years ago

OK retested 4.3-rc5 it works fine too. No idea why my last test was failing or if i was too tired and did a mistake.

disaster123 commented 9 years ago

Both ideas removing those 4 and readding the root cache are working fine.

hhoffstaette commented 9 years ago

This is great! I'll get in touch with Mark and see what he thinks.

disaster123 commented 9 years ago

I've found one more issue - should i create a new issue? on one system i see a permant kworker process consuming 100% cpu

hhoffstaette commented 9 years ago

This is really not the btrfs-debugtracker. Unless you can bisect and point out a specific patch that causes the problem, please try to run with an upstream kernel first and post to the list instead.

disaster123 commented 9 years ago

Sorry surr never expected you to be the bugtracker. Sally there is none.

disaster123 commented 9 years ago

just FYI i was able to fix this 2nd problem too by using the ssd allocator in btrfs instead of the default one.

hhoffstaette commented 9 years ago

Haven't gotten a reply from Mark yet, but after further review of his patches I'm reasonably sure that whatever you have been seeing had nothing to do with them specifically. Therefore closing this with ¯\_(ツ)_/¯

hhoffstaette commented 9 years ago

Just FYI: Mark had gotten my mail but did not yet have time to look into it. :-} I've removed the patch(es) recently in commit 08c46e9205220eadbc6fb859c665e606fc0ad0ce and will commit the new updated version when it becomes available. Apparently - just as I suspected above, but also missed, just like Mark - there really was a shortcut missing.

hhoffstaette commented 9 years ago

This should be fixed with the v2 update in c9d4d7404116c312feb51042620193cd154a0385.