Oops with latest master while mounting

disaster123 commented 8 years ago

Hi,

i get this Oops with latest master (654011f518b23ea6198e272d5c26f15d16f76c16) while trying to mount my volume.

[ 499.744427] IP: [] qgroup_fix_relocated_data_extents+0x37/0x290 [btrfs] [ 499.821720] PGD 13c0b067 PUD 13c0d067 PMD 0 [ 499.897724] Oops: 0000 [#1] SMP [ 499.971888] Modules linked in: usbhid coretemp loop ehci_pci ehci_hcd i40e(O) sb_edac vxlan i2c_i801 usbcore ipmi_si ip6_udp_tunnel shpchp edac_core i2c_core udp_tunnel usb_common ipmi_msghandler button btrfs dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 md_mod sg ixgbe mdio sd_mod ahci ptp aacraid libahci pps_core [ 500.210789] CPU: 7 PID: 4351 Comm: mount Tainted: G O 4.4.19+50-ph #1 [ 500.292250] Hardware name: Supermicro X10DRH/X10DRH-IT, BIOS 1.0c 02/18/2015 [ 500.374453] task: ffff88017c3f0000 ti: ffff881055480000 task.ti: ffff881055480000 [ 500.457499] RIP: 0010:[] [] qgroup_fix_relocated_data_extents+0x37/0x290 [btrfs] [ 500.542738] RSP: 0018:ffff8810554838f8 EFLAGS: 00010246 [ 500.628202] RAX: ffff88085a3eb800 RBX: 0000000000000000 RCX: 0000000000000000 [ 500.714229] RDX: ffff88017fb36130 RSI: ffff880856d98000 RDI: ffff88017fb360a0 [ 500.800816] RBP: ffff881055483978 R08: 000060ef80006d60 R09: ffff88017fb360a0 [ 500.886848] R10: 000000001a240c01 R11: ffffea00216ee900 R12: ffff880856db2800 [ 500.973393] R13: ffff88085766c000 R14: 0000000000000000 R15: ffff880856d98000 [ 501.059455] FS: 00007faf522bb7e0(0000) GS:ffff88085fce0000(0000) knlGS:0000000000000000 [ 501.145334] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 501.228042] CR2: fffffffffffffe10 CR3: 000000085ac45000 CR4: 00000000001406e0 [ 501.310558] Stack: [ 501.390012] ffff881055483968 ffffffffc02133b3 ffff881055483938 ffff88017fb360a0 [ 501.469982] 00ff881055483978 0000000000000000 0000000000000000 ffff88017c3f0000 [ 501.548102] ffff881055483938 ffff88081a384000 ffff880856db2800 ffff88081a384000 [ 501.624543] Call Trace: [ 501.700253] [] ? start_transaction+0x93/0x490 [btrfs] [ 501.777879] [] btrfs_recover_relocation+0x367/0x410 [btrfs] [ 501.855888] [] open_ctree+0x218c/0x25d0 [btrfs] [ 501.935233] [] btrfs_mount+0xbda/0xdf0 [btrfs] [ 502.014666] [] ? find_next_zero_bit+0x1d/0x20 [ 502.094322] [] ? pcpu_next_unpop+0x4b/0x60 [ 502.171407] [] mount_fs+0x20/0xb0 [ 502.246182] [] vfs_kern_mount+0x74/0x130 [ 502.321479] [] btrfs_mount+0x1b0/0xdf0 [btrfs] [ 502.396489] [] ? find_next_zero_bit+0x1d/0x20 [ 502.471876] [] ? pcpu_next_unpop+0x4b/0x60 [ 502.545970] [] mount_fs+0x20/0xb0 [ 502.619673] [] vfs_kern_mount+0x74/0x130 [ 502.692394] [] do_mount+0x218/0xda0 [ 502.765371] [] ? alloc_pages_current+0x96/0x110 [ 502.839144] [] ? __get_free_pages+0xe/0x40 [ 502.912810] [] ? copy_mount_options+0x3a/0x150 [ 502.986691] [] SyS_mount+0x7a/0xc0 [ 503.061035] [] entry_SYSCALL_64_fastpath+0x12/0x71 [ 503.135531] Code: 80 4c 89 75 f0 48 89 5d d8 45 31 f6 4c 89 65 e0 4c 89 6d e8 4c 89 7d f8 48 8b 46 08 48 89 7d 98 48 8b 5e 10 4c 8b a8 f0 01 00 00 <4c> 8b a3 10 fe ff ff 41 f6 85 98 0d 00 00 01 74 09 80 be b0 05 [ 503.292318] RIP [] qgroup_fix_relocated_data_extents+0x37/0x290 [btrfs] [ 503.371008] RSP [ 503.447396] CR2: fffffffffffffe10 [ 503.520935] ---[ end trace 67f0b219193ca330 ]---

hhoffstaette commented 8 years ago

I just used qgroups in various ways (set limits to subvols, exceeded limits, cleared them..), mounted/unmounted a few times and cannot reproduce this at all. The 3 patches with qgroup_fix_relocated_data_extents() are not new to .19, so this should have happened with previous versions as well (since .17) and likely means that your volume is corrupted, or at least the quota information.

For now the easiest things to try are:

use a working kernel to mount and clear qgroups/disable quota completely, then just use .19 without quota.
try 4.4.19 with the three patches btrfs-20160815-00[1-3]-* removed. This might let the mount proceed, but will probably result in broken qgroup numbers, just as in mainline since 4.2. Nothing I can do about that.
try https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git/ branch for-chris-4.8 which contains the three patches in question and should therefore also fail, which you can then report to the list.

disaster123 commented 8 years ago

Yes you're right. The first working kernel is one without all those qgroup patches. Do you know how to clean qgroups? Just mount with noquota?

hhoffstaette commented 8 years ago

I usually don't use qgroups at all but don't think noquota is a valid btrfs option. You will probably have to mount with a working kernel and then issue btrfs quota disable <volume>. If you really want to be sure you can also delete all the qgroups with btrfs qgroup destroy <id>.

Btw since I can use qgroups just fine with those patches it indicates that your volume is hosed, or that the patches somehow fail with your particular volume state. I've sent email to Qu Wenru, maybe he can comment on possible causes/fixes.

disaster123 commented 8 years ago

thanks but both do not work. As quota and qgroups were never in use. btrfs just reports that there is no file or directory for showing qgroup's.

hhoffstaette commented 8 years ago

In that case please try removing those 3 patches, that should work for now. Depending on what feedback I get from Qu I can just revert them, so let's wait a day or two first. Does btrfs check say anything meaningful?

The strange part (and somewhat valuable information) is that qgroup_fix_relocated_data_extents() does not do anything when quota is disabled - see right at the beginning:

    if (!fs_info->quota_enabled)
        return 0;

So whatever went wrong must have happened before that. Strange, but maybe it gives Qu a hint.

adam900710 commented 8 years ago

Would you please show the code of "qgroup_fix_relocated_data_extents+0x37"?

In my desktop, it's the last line of that function, which is definitely not correct.

And, are you using balance with qgroup enabled? The trace shows that it's doing balance recovery. I could try to build such case to see if I can reproduce it.

Thanks

adam900710 commented 8 years ago

I built a image which leads to the same call trace.

It's just a image with balance done in half way, caused by power loss. And it does entered btrfs_recover_relocation() then qgroup_fix_data_extents().

While just as Holger said, for qgroup disabled case, qgroup_fix_data_extents() will just exit without doing anything. So the bug is not triggered.

Would you please try the following kernel and to see if it can reproduce the bug? https://github.com/adam900710/linux/tree/qgroup_fix_for_testing

Maybe there is some other btrfs patches needed to co-operate with the fix?

Thanks

hhoffstaette commented 8 years ago

@adam900710 Thank you for taking a look. As you can see from the list of commits at https://github.com/hhoffstaette/kernel-patches/commits/master there may be several qgroups-related suspects that are not yet in mainline (but part of David's last pull request for Chris), though it's not obvious to me how they might cause problems:

f79526e268d41cd0c6f0353e101a9d39d26c3ce3 "btrfs: waiting on qgroup rescan should not always be interruptible" bd7cd6e0d26c932f5b597474f2dcafa1894cf2a6 "btrfs: properly track when qgroup rescan worker is running" fb23f4b8706404a6a088a8ff3ca843ee60a0dc9f "btrfs: don't create or leak aliased root while cleaning up orphans"

Could there be something wrong with the root/fs_info or some of the other values that are initialized before the qgroups-enabled-check?

adam900710 commented 8 years ago

@hhoffstaette Yes, I also suspect that. And in that case, it's fs_info got corrupted and should cause more problems. In that case, it's not qgroup patches, but any btrfs patches can be the cause.

The best method is to do a bisect to locate the problem, but I'm afraid since neither you and I reproduced the problem, we can only expect @disaster123 to do the bisect then.

Thanks

disaster123 commented 8 years ago

@adam900710 Hi, i booted the sytstem with an older kernel without patches and was able to mount the volume. It than started immediatly a balance which seems to be started by a cronjob before the crash.

After that i was able to boot into the new kernel with your patches successfully.

hhoffstaette commented 8 years ago

@disaster123 I really recommend that you always mount with skip_balance to prevent balance from interfering with mount. Automatically continuing balance never should have been the default, and has caused really bad problems in other cases as well.

disaster123 commented 8 years ago

Thanks, didn't knew that. Why not just change the default?

hhoffstaette commented 8 years ago

Why not just change the default?

Lack of interest, "compatibility" (always a good excuse), "someone might be surprised".. IMHO these are all nonsense and the flag should just be nuked. Last time I checked disabling it (= not continuing) was just removing a single line of code. The full solution is probably a bit more work.

hhoffstaette commented 8 years ago

So now that it seems that the problem was some unrelated corruption I think I can close this. Depending on how it works out the next release I'll start using is either 4.8 or 4.9, which apparently will be the next LTS.

disaster123 commented 8 years ago

I got this one today even with skip_balance:

[ 423.047327] BUG: unable to handle kernel paging request at fffffffffffffe10 [ 423.049408] IP: [] qgroup_fix_relocated_data_extents+0x27/0x270 [btrfs] [ 423.051879] PGD dc0a067 PUD dc0c067 PMD 0 [ 423.053182] Oops: 0000 [#1] SMP [ 423.054240] Modules linked in: binfmt_misc netconsole ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter ip_tables x_tables 8021q garp bonding x86_pkg_temp_thermal coretemp kvm_intel kvm i40e(O) vxlan ip6_udp_tunnel irqbypass udp_tunnel i2c_i801 crc32_pclmul sb_edac edac_core i2c_core shpchp ipmi_si ipmi_msghandler button loop fuse btrfs dm_mod raid10 raid0 multipath linear raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq ixgbe mdio raid1 usbhid md_mod sg ehci_pci sd_mod ehci_hcd ahci usbcore ptp aacraid usb_common libahci pps_core [ 423.071586] CPU: 6 PID: 2006 Comm: mount Tainted: G O 4.4.19+51-ph #1 [ 423.073726] Hardware name: Supermicro X10DRH/X10DRH-IT, BIOS 1.0c 02/18/2015 [ 423.075718] task: ffff880072aa0000 ti: ffff8808582b8000 task.ti: ffff8808582b8000 [ 423.077842] RIP: 0010:[] [] qgroup_fix_relocated_data_extents+0x27/0x270 [btrfs] [ 423.080899] RSP: 0018:ffff8808582bb9d8 EFLAGS: 00010246 [ 423.082400] RAX: 0000000000000000 RBX: ffff8807b0b7e000 RCX: 0000000000000000 [ 423.084428] RDX: ffff880072490090 RSI: ffff88085e6fa800 RDI: ffff880072490000 [ 423.086452] RBP: ffff8808582bba40 R08: ffffea0001c92400 R09: ffff880072490000 [ 423.088470] R10: 0000000000000000 R11: 0000000000019280 R12: ffff880858298000 [ 423.090488] R13: ffff8808582bba68 R14: ffff88085e6fa800 R15: 0000000000000000 [ 423.092506] FS: 00007f9a89f6c840(0000) GS:ffff88085fcc0000(0000) knlGS:0000000000000000 [ 423.094793] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 423.096425] CR2: fffffffffffffe10 CR3: 0000000857a35000 CR4: 00000000001406e0 [ 423.098442] Stack: [ 423.099008] ffff8808582bba30 ffffffffc03bd00a 00ff8808582bba40 0000000000000000 [ 423.101296] ffff880072aa0000 0000000000000000 ffff8807b0b7e000 ffff88085af03800 [ 423.103584] ffff8807b0b7e000 ffff88085af03800 ffff8808582bba68 ffff88085e6fa800 [ 423.105872] Call Trace: [ 423.106579] [] ? start_transaction+0x9a/0x4f0 [btrfs] [ 423.183594] [] btrfs_recover_relocation+0x393/0x420 [btrfs] [ 423.261243] [] open_ctree+0x229b/0x2720 [btrfs] [ 423.339182] [] btrfs_mount+0xc0c/0xd00 [btrfs] [ 423.416076] [] ? find_next_zero_bit+0x1a/0x20 [ 423.491618] [] mount_fs+0x15/0x90 [ 423.566430] [] vfs_kern_mount+0x67/0x110 [ 423.640715] [] btrfs_mount+0x1dd/0xd00 [btrfs] [ 423.714554] [] mount_fs+0x15/0x90 [ 423.787384] [] vfs_kern_mount+0x67/0x110 [ 423.859799] [] do_mount+0x1dc/0xcd0 [ 423.931151] [] SyS_mount+0x8b/0xd0 [ 424.002243] [] entry_SYSCALL_64_fastpath+0x12/0x71 [ 424.074026] Code: 0f 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 ec 40 48 8b 46 08 4c 8b 7e 10 4c 8b a0 f0 01 00 00 31 c0 <4d> 8b af 10 fe ff ff 41 f6 84 24 70 0d 00 00 01 74 09 80 be b0 [ 424.227298] RIP [] qgroup_fix_relocated_data_extents+0x27/0x270 [btrfs] [ 424.304467] RSP [ 424.381426] CR2: fffffffffffffe10 [ 424.457940] ---[ end trace 706e612d9f2cd389 ]---

disaster123 commented 8 years ago

As i don't use qgroups at all i will simply revert those three patches.

hhoffstaette commented 8 years ago

With this and the other balance problem i's fairly clear that there is something else going on, and that this particular problem on mount is just a side effect. However since I don't care about quota I've just reverted the three patches.

hhoffstaette / kernel-patches

Oops with latest master while mounting #6