naota / linux

Linux kernel source tree
Other
3 stars 1 forks source link

transaction abort in __btrfs_cow_block with -22 EINVAL #45

Closed kdave closed 3 years ago

kdave commented 3 years ago

The test was a git random checkout, on 2 devices (2G), with compression zstd:15 (with the recent compression patches applied, no problems), then followed by scrub, remount with compress=no, some subvolume creation and copying files.

[ 2563.190460] BTRFS info (device nullb0): use no compression
[ 2563.196209] BTRFS info (device nullb0): using free space tree
[ 2635.925391] BTRFS info (device nullb0): reclaiming chunk 1879048192 with 0% used
[ 2635.934448] BTRFS info (device nullb0): relocating block group 1879048192 flags data
[ 2636.165805] BTRFS info (device nullb0): found 49 extents, stage: move data extents
[ 2636.489277] BTRFS info (device nullb0): found 49 extents, stage: update data pointers
[ 2637.449847] BTRFS info (device nullb0): reclaiming chunk 2147483648 with 13% used
[ 2637.458188] BTRFS info (device nullb0): relocating block group 2147483648 flags data
[ 2639.922304] BTRFS info (device nullb0): found 1143 extents, stage: move data extents
[ 2640.188939] ------------[ cut here ]------------
[ 2640.193915] BTRFS: Transaction aborted (error -22)
[ 2640.198912] WARNING: CPU: 1 PID: 1777 at fs/btrfs/ctree.c:491 __btrfs_cow_block+0x43b/0x5f0 [btrfs]
[ 2640.208365] Modules linked in: null_blk af_packet btrfs i2c_algo_bit drm_ttm_helper ttm bridge drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm blake2b_generic libcrc32c xor zstd_decompress stp 
zstd_compress llc tpm_infineon tg3 iscsi_ibft tpm_tis tpm_tis_core iscsi_boot_sysfs tpm libphy xxhash raid6_pq mptctl i2c_piix4 serio_raw k10temp acpi_cpufreq button ext4 mbcache jbd2 ohci_pci ehci_pci mptsas ohci
_hcd ata_generic scsi_transport_sas ehci_hcd mptscsih mptbase usbcore sata_svw pata_serverworks sg scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[ 2640.208572] CPU: 1 PID: 1777 Comm: kworker/u16:2 Not tainted 5.13.0-rc2-git+ #761
[ 2640.208579] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
[ 2640.208586] Workqueue: events_unbound btrfs_reclaim_bgs_work [btrfs]
[ 2640.208693] RIP: 0010:__btrfs_cow_block+0x43b/0x5f0 [btrfs]
[ 2640.208766] Code: 1c 00 00 03 8b 44 24 04 72 26 83 f8 fb 0f 84 54 a2 0d 00 83 f8 e2 0f 84 4b a2 0d 00 89 c6 48 c7 c7 d0 92 9a c0 e8 13 e4 04 fa <0f> 0b 8b 44 24 04 89 44 24 04 89 c1 ba eb 01 00 00 48 c7 c6 d0 a
4
[ 2640.208772] RSP: 0018:ffffae930024fac8 EFLAGS: 00010282
[ 2640.208780] RAX: 0000000000000000 RBX: 0000000000000047 RCX: 0000000000000000
[ 2640.208785] RDX: ffff9208e71ea2c0 RSI: ffff9208e71d9970 RDI: ffff9208e71d9970
[ 2640.208789] RBP: ffff9208358b19d0 R08: 0000000000000001 R09: 0000000000000001
[ 2640.208793] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9208110dcfd0
[ 2640.208797] R13: ffff9207d7d54000 R14: ffff9207cec66910 R15: ffff92081393d2f8
[ 2640.208802] FS:  0000000000000000(0000) GS:ffff9208e7000000(0000) knlGS:0000000000000000
[ 2640.208807] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2640.208811] CR2: 00000000006a7188 CR3: 0000000121d2d000 CR4: 00000000000006e0
[ 2640.208816] Call Trace:
[ 2640.208849]  btrfs_cow_block+0x10a/0x160 [btrfs]
[ 2640.208938]  do_relocation+0x518/0x6f0 [btrfs]
[ 2640.209046]  ? lock_is_held_type+0x9a/0x110
[ 2640.209076]  ? select_one_root+0x3d/0x140 [btrfs]
[ 2640.388539]  ? _raw_spin_unlock+0x1f/0x40
[ 2640.392880]  relocate_tree_block+0x1a6/0x270 [btrfs]
[ 2640.393061]  relocate_tree_blocks+0xe8/0x260 [btrfs]
[ 2640.393211]  relocate_block_group+0x200/0x580 [btrfs]
[ 2640.408920]  btrfs_relocate_block_group+0x15f/0x300 [btrfs]
[ 2640.409028]  btrfs_relocate_chunk+0x38/0x120 [btrfs]
[ 2640.409130]  btrfs_reclaim_bgs_work.cold+0x63/0x147 [btrfs]
[ 2640.409297]  process_one_work+0x262/0x5f0
[ 2640.409337]  worker_thread+0x4e/0x300
[ 2640.409344]  ? process_one_work+0x5f0/0x5f0
[ 2640.409358]  kthread+0x141/0x160
[ 2640.409366]  ? __kthread_bind_mask+0x60/0x60
[ 2640.409380]  ret_from_fork+0x1f/0x30
[ 2640.409424] irq event stamp: 4393501
[ 2640.409428] hardirqs last  enabled at (4393507): [<ffffffffba10f1d5>] console_trylock_spinning+0x185/0x1a0
[ 2640.409437] hardirqs last disabled at (4393512): [<ffffffffba10f1ac>] console_trylock_spinning+0x15c/0x1a0
[ 2640.409444] softirqs last  enabled at (4393458): [<ffffffffbac00307>] __do_softirq+0x307/0x50f
[ 2640.409451] softirqs last disabled at (4393449): [<ffffffffba08ad56>] irq_exit_rcu+0xb6/0xf0
[ 2640.409459] ---[ end trace 947a79fb8c194609 ]---
[ 2640.409565] BTRFS: error (device nullb0) in __btrfs_cow_block:491: errno=-22 unknown
[ 2640.409575] BTRFS info (device nullb0): forced readonly
[ 2640.418851] BTRFS error (device nullb0): error relocating chunk 2147483648
$ btrfs fi df
Data, single: total=768.00MiB, used=262.49MiB, zone_unusable=188.62MiB
System, single: total=256.00MiB, used=16.00KiB, zone_unusable=288.00KiB
Metadata, single: total=256.00MiB, used=8.11MiB, zone_unusable=161.33MiB
GlobalReserve, single: total=3.25MiB, used=0.00B
$ blkzoned report device1
  start: 0x000000000, len 0x080000, cap 0x080000, wptr 0x0001a0 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000080000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000100000, len 0x080000, cap 0x080000, wptr 0x000240 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000180000, len 0x080000, cap 0x080000, wptr 0x053380 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x080000 reset:0 non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000280000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000300000, len 0x080000, cap 0x080000, wptr 0x080000 reset:0 non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000380000, len 0x080000, cap 0x080000, wptr 0x04cec8 reset:0 non-seq:0, zcond: 2(oi) [type: 2(SEQ_WRITE_REQUIRED)]
$ blkzoned report device2
  start: 0x000000000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000080000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000100000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000180000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000200000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000280000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000300000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000380000, len 0x080000, cap 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
kdave commented 3 years ago
 486         if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {                                                                                                                                                     
 487                 ret = btrfs_reloc_cow_block(trans, root, buf, cow);                                                                                                                                             
 488                 if (ret) {                                                                                                                                                                                      
 489                         btrfs_tree_unlock(cow);                                                                                                                                                                 
 490                         free_extent_buffer(cow);                                                                                                                                                                
 491                         btrfs_abort_transaction(trans, ret);                                                                                                                                                    
 492                         return ret;                                                                                                                                                                             
 493                 }                                                                                                                                                                                               
 494         } 
morbidrsa commented 3 years ago
 486         if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {                                                                                                                                                     
 487                 ret = btrfs_reloc_cow_block(trans, root, buf, cow);                                                                                                                                             
 488                 if (ret) {                                                                                                                                                                                      
 489                         btrfs_tree_unlock(cow);                                                                                                                                                                 
 490                         free_extent_buffer(cow);                                                                                                                                                                
 491                         btrfs_abort_transaction(trans, ret);                                                                                                                                                    
 492                         return ret;                                                                                                                                                                             
 493                 }                                                                                                                                                                                               
 494         }

This then probably is:

1034         if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {                                                                                                          1035                 ret = -EINVAL;                                                                                                                                                  
1036                 goto out;                                                                                                                                                       
1037         }       

Via

btrfs_reloc_cow_block() 
`-> replace_file_extents() 
       `-> get_new_location()
fdmanana commented 3 years ago

This is very likely due to the ordered extent splitting done in zoned filesystems.

Relocation does not work well if the original extent is split into two or more.

Similar problem to what was fixed in:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=432cd2a10f1c10cead91fe706ff5dc52f06d642a

But this was due to during writeback having to fallback to COW and due to fragmented free space getting multiple ordered extents / bios submitted instead of a single one.

kdave commented 3 years ago

I've hit it again, same stack trace. It could be related to the compress=zstd:1 -> compress=no switch, with compression it was ok for a few minutes and then in a few seconds after it crashed. (It could be a false hunch too.)

[67043.421620] BTRFS info (device nullb0): use no compression
[67043.427268] BTRFS info (device nullb0): using free space tree
[67043.433910] BTRFS info (device nullb0): reclaiming chunk 2147483648 with 2% used
[67043.442983] BTRFS info (device nullb0): relocating block group 2147483648 flags data
[67044.327969] BTRFS info (device nullb0): found 470 extents, stage: move data extents
[67044.549314] ------------[ cut here ]------------
[67044.554250] BTRFS: Transaction aborted (error -22)
[67044.559267] WARNING: CPU: 0 PID: 15775 at fs/btrfs/ctree.c:491 __btrfs_cow_block+0x43b/0x5f0 [btrfs]
...
morbidrsa commented 3 years ago

This can be reproduced with fstests btrfs/004 as well.

morbidrsa commented 3 years ago

OK here's what I have so far (in the case of btrfs/004):

  1. It seems as if this is timing dependent, i.e. adding debug printk()s mitigate the error.
  2. Inode 257 is constant
  3. disk_num_bytes and disk_bytenr are variable
  4. The ordered extent is NOT getting split (i.e. disk_num_bytes == bio->bi_iter.bi_size in extract_ordered_extent())

Adding debug tracing I get:

    rapido1:/home/johannes/src/xfstests-dev# cat /sys/kernel/debug/tracing/trace| grep 1722478592
        kworker/u4:1-55      [001] ....    23.305236: cow_file_range: cow_file_range: ino: 257, disk_bytenr: 1722478592, disk_num_bytes: 65536, min_alloc_size: 65536, start: 18866176, end: 18931711
        kworker/u4:1-55      [001] ....    23.305315: extract_ordered_extent: extract_ordered_extent: ino: 257, oe->disk_bytenr:  1722478592, oe->disk_num_bytes: 65536, len: 65536
        kworker/u4:3-701     [001] ....    23.305460: btrfs_finish_ordered_io.isra.0: insert_ordered_extent_file_extent: inode: 257, disk_bytenr: 1722478592, disk_num_bytes: 65536
        kworker/u4:2-61      [000] ....    23.932027: replace_file_extents: get_new_location: ino: 257, bytenr: 18866176 (1092608000), num_bytes: 126976, fi->disk_bytenr: 1722478592, fi->disk_num_bytes: 65536
    rapido1:/home/johannes/src/xfstests-dev# echo $((18931711 - 18866176 + 1))
    65536

This looks all normal and expected, but get_new_location()'s num_bytes parameter doesn't match the on-disk num_bytes from the newly written file_extent_item, which is why it breaks.

fdmanana commented 3 years ago

OK here's what I have so far (in the case of btrfs/004):

  1. It seems as if this is timing dependent, i.e. adding debug printk()s mitigate the error.
  2. Inode 257 is constant
  3. disk_num_bytes and disk_bytenr are variable
  4. The ordered extent is NOT getting split (i.e. disk_num_bytes == bio->bi_iter.bi_size in extract_ordered_extent())

Adding debug tracing I get:

  rapido1:/home/johannes/src/xfstests-dev# cat /sys/kernel/debug/tracing/trace| grep 1722478592
      kworker/u4:1-55      [001] ....    23.305236: cow_file_range: cow_file_range: ino: 257, disk_bytenr: 1722478592, disk_num_bytes: 65536, min_alloc_size: 65536, start: 18866176, end: 18931711
      kworker/u4:1-55      [001] ....    23.305315: extract_ordered_extent: extract_ordered_extent: ino: 257, oe->disk_bytenr:  1722478592, oe->disk_num_bytes: 65536, len: 65536
      kworker/u4:3-701     [001] ....    23.305460: btrfs_finish_ordered_io.isra.0: insert_ordered_extent_file_extent: inode: 257, disk_bytenr: 1722478592, disk_num_bytes: 65536
      kworker/u4:2-61      [000] ....    23.932027: replace_file_extents: get_new_location: ino: 257, bytenr: 18866176 (1092608000), num_bytes: 126976, fi->disk_bytenr: 1722478592, fi->disk_num_bytes: 65536
  rapido1:/home/johannes/src/xfstests-dev# echo $((18931711 - 18866176 + 1))
  65536

This looks all normal and expected, but get_new_location()'s num_bytes parameter doesn't match the on-disk num_bytes from the newly written file_extent_item, which is why it breaks.

If the zoned code is not doing the ordered extent splitting I can only think of one thing right now: when running delalloc we limit the maximum range to 128M iirc - maybe the first 65536 of the extent are at the end of such range and the remaining would be considered for the next delalloc range. Or maybe there's some other thing causing the delalloc range to be split into several - maybe zone specific stuff there too?

morbidrsa commented 3 years ago

If the zoned code is not doing the ordered extent splitting I can only think of one thing right now: when running delalloc we limit the maximum range to 128M iirc - maybe the first 65536 of the extent are at the end of such range and the remaining would be considered for the next delalloc range. Or maybe there's some other thing causing the delalloc range to be split into several - maybe zone specific stuff there too?

There is zoned specific code in brtfs_run_delalloc_range() but run_delalloc_zoned() basically is a wrapper around cow_file_range() + extent_write_locked_range()

fdmanana commented 3 years ago

Ok, so it seems what needs to be figured out is why find_lock_delalloc_range() is not returning the whole expected range at once, and instead returning first a sub-range that gets run before the remaining.

I also noticed in the meanwhile for zoned mode, relocation does not preallocate extents in the relocation inode.

Maybe extent state records are missing when find_lock_delalloc_range() is called, or maybe that don't have the DELALLOC bit set, etc.

morbidrsa commented 3 years ago

Ok, so it seems what needs to be figured out is why find_lock_delalloc_range() is not returning the whole expected range at once, and instead returning first a sub-range that gets run before the remaining.

I also noticed in the meanwhile for zoned mode, relocation does not preallocate extents in the relocation inode.

Yes as preallocation would break our sequential write requirement.

Maybe extent state records are missing when find_lock_delalloc_range() is called, or maybe that don't have the DELALLOC bit set, etc.

Good idea, thanks for the tip.

morbidrsa commented 3 years ago

Here's something I noticed as well:

  rapido1:/home/johannes/src/xfstests-dev# cat /sys/kernel/debug/tracing/trace| grep 1722478592
      kworker/u4:1-55      [001] ....    23.305236: cow_file_range: cow_file_range: ino: 257, disk_bytenr: 1722478592, disk_num_bytes: 65536, min_alloc_size: 65536, start: 18866176, end: 18931711
      kworker/u4:2-61      [000] ....    23.932027: replace_file_extents: get_new_location: ino: 257, bytenr: 18866176 (1092608000), num_bytes: 126976, fi->disk_bytenr: 1722478592, fi->disk_num_bytes: 65536

Another thing I've just discovered:

In cow_file_range for disk_bytenr: 1722478592 I see we have a 'start' of 18866176 which corresponds to bytenr in get_new_location().

Not sure yet if this is expected or not.

fdmanana commented 3 years ago

That's expected IIRC, the offsets of file extent items in the data reloc inode should match the logical addresses of the data extents, so that we can do a lookup in the reloc tree for a file extent item that contains the relocated extent.

Either we are running delalloc for a shorter range than expected, due to some splitting, missing extent state, missing the delalloc bit on some extent state, etc, or the relocation code is not setting a part of the extent's range as delalloc. I also noticed that for the zoned case, because it does not preallocates extents, it has to setup the i_size of the data reloc inode - maybe it's not setting the size correctly in some case and we end up with an i_size that is smaller then "extent disk_bytenr + expected extent length"?

morbidrsa commented 3 years ago

Either we are running delalloc for a shorter range than expected, due to some splitting, missing extent state, missing the delalloc bit on some extent state, etc, or the relocation code is not setting a part of the extent's range as delalloc. I also noticed that for the zoned case, because it does not preallocates extents, it has to setup the i_size of the data reloc inode - maybe it's not setting the size correctly in some case and we end up with an i_size that is smaller then "extent disk_bytenr + expected extent length"?

This seems to be the case yes. If I add more debug information to prealloc_file_extent_cluster:

rapido1:/home/johannes/src/xfstests-dev# grep  -e 10878976 -e 1928560640 /host/trace.txt 
    kworker/u4:2-61      [001] ....    21.487117: prealloc_file_extent_cluster: prealloc_file_extent_cluster: ino: 257, prealloc_start: 10878976, prealloc_end: 11005951, end: 11005952, end - prealloc_start + 1: 126976
    kworker/u4:4-701     [000] ....    21.487841: cow_file_range: cow_file_range: ino: 257, disk_bytenr: 1928560640, disk_num_bytes: 12288, min_alloc_size: 12288, start: 10878976, end: 10891263
    kworker/u4:4-701     [000] ....    21.487996: extract_ordered_extent: extract_ordered_extent: ino: 257, oe->disk_bytenr:  1928560640, oe->disk_num_bytes: 12288, len: 12288
    kworker/u4:5-702     [001] ....    21.488490: btrfs_finish_ordered_io.isra.0: insert_ordered_extent_file_extent: inode: 257, disk_bytenr: 1928560640, disk_num_bytes: 12288
    kworker/u4:2-61      [000] ....    22.138421: replace_file_extents: get_new_location: ino: 257, bytenr: 10878976 (1084620800), num_bytes: 126976, fi->disk_bytenr: 1928560640, fi->disk_num_bytes: 12288

I get the desired number of bytes for the range starting at 'prealloc_start'. But changing the i_size_write() results in -ENOENT returns from __btrfs_cow_block().

morbidrsa commented 3 years ago

Maybe extent state records are missing when find_lock_delalloc_range() is called, or maybe that don't have the DELALLOC bit set, etc.

OK verified this as well, in cow_file_range() we have all of the start, end range with the EXTENT_DELALLOC bit set, but OTOH it already is the too small range.

Next one will be find_lock_delalloc_range() why we're not getting the full range.

morbidrsa commented 3 years ago

Maybe extent state records are missing when find_lock_delalloc_range() is called, or maybe that don't have the DELALLOC bit set, etc.

OK verified this as well, in cow_file_range() we have all of the start, end range with the EXTENT_DELALLOC bit set, but OTOH it already is the too small range.

Next one will be find_lock_delalloc_range() why we're not getting the full range.

rapido1:/home/johannes/src/xfstests-dev# grep get_new_location /sys/kernel/debug/tracing/trace
    kworker/u4:4-699     [000] ....    25.068999: replace_file_extents: get_new_location: ino: 257, bytenr: 2289664 (1076031488), num_bytes: 118784, fi->disk_bytenr: 1688629248, fi->disk_num_bytes: 28672
rapido1:/home/johannes/src/xfstests-dev# grep -e 1688629248 -e 2289664  /sys/kernel/debug/tracing/trace
    kworker/u4:4-699     [000] ....    24.336347: prealloc_file_extent_cluster: prealloc_file_extent_cluster: ino: 257, prealloc_start: 2289664, prealloc_end: 2408447, end: 2408448, end - prealloc_start: 118784
    kworker/u4:3-698     [000] ....    24.338055: find_lock_delalloc_range: find_lock_delalloc_range: start: 2289664, delalloc_start: 2289664, delalloc_end: 2318335
    kworker/u4:3-698     [000] ....    24.338200: cow_file_range: cow_file_range: ino: 257, disk_bytenr: 1688629248, disk_num_bytes: 28672, min_alloc_size: 28672, start: 2289664, end: 2318335, EXTENT_DELALLOC: 1
    kworker/u4:3-698     [000] ....    24.338345: extract_ordered_extent: extract_ordered_extent: ino: 257, oe->disk_bytenr:  1688629248, oe->disk_num_bytes: 28672, len: 28672
    kworker/u4:0-7       [000] ....    24.338505: btrfs_finish_ordered_io.isra.0: insert_ordered_extent_file_extent: inode: 257, disk_bytenr: 1688629248, disk_num_bytes: 28672
    kworker/u4:4-699     [000] ....    25.068999: replace_file_extents: get_new_location: ino: 257, bytenr: 2289664 (1076031488), num_bytes: 118784, fi->disk_bytenr: 1688629248, fi->disk_num_bytes: 28672
rapido1:/home/johannes/src/xfstests-dev# echo $((2318335 - 2289664 + 1))
28672

OK, the too small range is coming from btrfs_find_delalloc_range(). So I suspect we're missing the extent state record for the desired range.

morbidrsa commented 3 years ago

OK, the too small range is coming from btrfs_find_delalloc_range(). So I suspect we're missing the extent state record for the desired range.

The appropriate extent_state should come from the loop in relocate_file_extent_cluster(). It's calling btrfs_set_extent_delalloc() for every page in the cluster. But according to my instrumentation these calls to btrfs_set_extent_delalloc() are interleaved with find_delalloc_range() calls to subranges, that are not completely locked yet:

kworker/u4:2-61      [000] ....    23.653232: prealloc_file_extent_cluster: prealloc_file_extent_cluster: ino: 257, prealloc_start: 10719232, prealloc_end: 11128831, end: 11128832, end - prealloc_start: 409600, cluster->nr: 7
    kworker/u4:2-61      [000] ....    23.653264: relocate_file_extent_cluster: relocate_file_extent_cluster: cluster_start: 1084461056, cluster_end: 1084870655, index: 2617, last_index: 2716
    kworker/u4:2-61      [000] ....    23.654001: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10719232, end: 10723327 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654025: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10723328, end: 10727423 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654045: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10727424, end: 10731519 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654065: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10731520, end: 10735615 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654085: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10735616, end: 10739711 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654105: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10739712, end: 10743807 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654125: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10743808, end: 10747903 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654145: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10747904, end: 10751999 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654166: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10752000, end: 10756095 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654186: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10756096, end: 10760191 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654208: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10760192, end: 10764287 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654229: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10764288, end: 10768383 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654250: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10768384, end: 10772479 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654270: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10772480, end: 10776575 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654290: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10776576, end: 10780671 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654311: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10780672, end: 10784767 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654351: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10784768, end: 10788863 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654371: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10788864, end: 10792959 (len: 4096), extra_bits: 0x00000000
    kworker/u4:1-55      [001] ....    23.654378: find_lock_delalloc_range: find_lock_delalloc_range: start: 10719232, delalloc_start: 10719232, delalloc_end: 10792959
    kworker/u4:2-61      [000] .N..    23.654397: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10792960, end: 10797055 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654438: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10797056, end: 10801151 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654461: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10801152, end: 10805247 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654481: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10805248, end: 10809343 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654502: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10809344, end: 10813439 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654523: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10813440, end: 10817535 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654544: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10817536, end: 10821631 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654565: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10821632, end: 10825727 (len: 4096), extra_bits: 0x00000000
    kworker/u4:1-55      [001] ....    23.654570: cow_file_range: cow_file_range: ino: 257, disk_bytenr: 1712738304, disk_num_bytes: 73728, min_alloc_size: 73728, start: 10719232, end: 10792959, EXTENT_DELALLOC: 1
    kworker/u4:2-61      [000] .N..    23.654591: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10825728, end: 10829823 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654610: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10829824, end: 10833919 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654630: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10833920, end: 10838015 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654650: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10838016, end: 10842111 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654673: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10842112, end: 10846207 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654703: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10846208, end: 10850303 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654724: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10850304, end: 10854399 (len: 4096), extra_bits: 0x00000000
    kworker/u4:1-55      [001] ....    23.654725: extract_ordered_extent: extract_ordered_extent: ino: 257, oe->disk_bytenr:  1712738304, oe->disk_num_bytes: 73728, len: 73728
    kworker/u4:2-61      [000] .N..    23.654745: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10854400, end: 10858495 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654765: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10858496, end: 10862591 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654785: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10862592, end: 10866687 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654806: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10866688, end: 10870783 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654827: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10870784, end: 10874879 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654847: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10874880, end: 10878975 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654868: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10878976, end: 10883071 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654889: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10883072, end: 10887167 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654909: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10887168, end: 10891263 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654929: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10891264, end: 10895359 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654953: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10895360, end: 10899455 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654979: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10899456, end: 10903551 (len: 4096), extra_bits: 0x00000000
    kworker/u4:1-55      [001] ....    23.654982: btrfs_finish_ordered_io.isra.0: insert_ordered_extent_file_extent: inode: 257, disk_bytenr: 1712738304, disk_num_bytes: 73728
    kworker/u4:2-61      [000] .N..    23.655017: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10903552, end: 10907647 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655038: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10907648, end: 10911743 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655061: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10911744, end: 10915839 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655083: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10915840, end: 10919935 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655105: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10919936, end: 10924031 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655127: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10924032, end: 10928127 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655149: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10928128, end: 10932223 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655171: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10932224, end: 10936319 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655193: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10936320, end: 10940415 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655216: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10940416, end: 10944511 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655244: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10944512, end: 10948607 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655264: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10948608, end: 10952703 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655284: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10952704, end: 10956799 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655307: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10956800, end: 10960895 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655329: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10960896, end: 10964991 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655352: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10964992, end: 10969087 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655373: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10969088, end: 10973183 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655396: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10973184, end: 10977279 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655418: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10977280, end: 10981375 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655445: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10981376, end: 10985471 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655465: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10985472, end: 10989567 (len: 4096), extra_bits: 0x00000000
kworker/u4:2-61      [000] .N..    23.655485: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10989568, end: 10993663 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655507: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10993664, end: 10997759 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655530: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10997760, end: 11001855 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655553: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11001856, end: 11005951 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655576: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11005952, end: 11010047 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655598: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11010048, end: 11014143 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655626: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11014144, end: 11018239 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655652: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11018240, end: 11022335 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655672: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11022336, end: 11026431 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655692: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11026432, end: 11030527 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655715: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11030528, end: 11034623 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655737: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11034624, end: 11038719 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655759: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11038720, end: 11042815 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655780: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11042816, end: 11046911 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655802: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11046912, end: 11051007 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655825: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11051008, end: 11055103 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655847: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11055104, end: 11059199 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655870: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11059200, end: 11063295 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655891: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11063296, end: 11067391 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655913: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11067392, end: 11071487 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655935: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11071488, end: 11075583 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655956: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11075584, end: 11079679 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655977: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11079680, end: 11083775 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655999: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11083776, end: 11087871 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656025: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11087872, end: 11091967 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656044: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11091968, end: 11096063 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656064: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11096064, end: 11100159 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656086: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11100160, end: 11104255 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656107: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11104256, end: 11108351 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656129: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11108352, end: 11112447 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656150: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11112448, end: 11116543 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656181: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11116544, end: 11120639 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656201: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11120640, end: 11124735 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656223: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11124736, end: 11128831 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656247: find_lock_delalloc_range: find_lock_delalloc_range: start: 10792960, delalloc_start: 10792960, delalloc_end: 10825727

Are we missing a lock here?

fdmanana commented 3 years ago

OK, the too small range is coming from btrfs_find_delalloc_range(). So I suspect we're missing the extent state record for the desired range.

The appropriate extent_state should come from the loop in relocate_file_extent_cluster(). It's calling btrfs_set_extent_delalloc() for every page in the cluster. But according to my instrumentation these calls to btrfs_set_extent_delalloc() are interleaved with find_delalloc_range() calls to subranges, that are not completely locked yet:

kworker/u4:2-61      [000] ....    23.653232: prealloc_file_extent_cluster: prealloc_file_extent_cluster: ino: 257, prealloc_start: 10719232, prealloc_end: 11128831, end: 11128832, end - prealloc_start: 409600, cluster->nr: 7
    kworker/u4:2-61      [000] ....    23.653264: relocate_file_extent_cluster: relocate_file_extent_cluster: cluster_start: 1084461056, cluster_end: 1084870655, index: 2617, last_index: 2716
    kworker/u4:2-61      [000] ....    23.654001: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10719232, end: 10723327 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654025: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10723328, end: 10727423 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654045: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10727424, end: 10731519 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654065: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10731520, end: 10735615 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654085: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10735616, end: 10739711 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654105: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10739712, end: 10743807 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654125: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10743808, end: 10747903 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654145: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10747904, end: 10751999 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654166: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10752000, end: 10756095 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654186: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10756096, end: 10760191 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654208: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10760192, end: 10764287 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654229: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10764288, end: 10768383 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654250: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10768384, end: 10772479 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654270: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10772480, end: 10776575 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654290: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10776576, end: 10780671 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] ....    23.654311: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10780672, end: 10784767 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654351: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10784768, end: 10788863 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654371: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10788864, end: 10792959 (len: 4096), extra_bits: 0x00000000
    kworker/u4:1-55      [001] ....    23.654378: find_lock_delalloc_range: find_lock_delalloc_range: start: 10719232, delalloc_start: 10719232, delalloc_end: 10792959
    kworker/u4:2-61      [000] .N..    23.654397: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10792960, end: 10797055 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654438: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10797056, end: 10801151 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654461: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10801152, end: 10805247 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654481: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10805248, end: 10809343 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654502: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10809344, end: 10813439 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654523: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10813440, end: 10817535 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654544: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10817536, end: 10821631 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654565: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10821632, end: 10825727 (len: 4096), extra_bits: 0x00000000
    kworker/u4:1-55      [001] ....    23.654570: cow_file_range: cow_file_range: ino: 257, disk_bytenr: 1712738304, disk_num_bytes: 73728, min_alloc_size: 73728, start: 10719232, end: 10792959, EXTENT_DELALLOC: 1
    kworker/u4:2-61      [000] .N..    23.654591: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10825728, end: 10829823 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654610: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10829824, end: 10833919 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654630: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10833920, end: 10838015 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654650: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10838016, end: 10842111 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654673: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10842112, end: 10846207 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654703: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10846208, end: 10850303 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654724: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10850304, end: 10854399 (len: 4096), extra_bits: 0x00000000
    kworker/u4:1-55      [001] ....    23.654725: extract_ordered_extent: extract_ordered_extent: ino: 257, oe->disk_bytenr:  1712738304, oe->disk_num_bytes: 73728, len: 73728
    kworker/u4:2-61      [000] .N..    23.654745: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10854400, end: 10858495 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654765: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10858496, end: 10862591 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654785: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10862592, end: 10866687 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654806: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10866688, end: 10870783 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654827: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10870784, end: 10874879 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654847: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10874880, end: 10878975 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654868: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10878976, end: 10883071 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654889: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10883072, end: 10887167 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654909: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10887168, end: 10891263 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654929: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10891264, end: 10895359 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654953: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10895360, end: 10899455 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.654979: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10899456, end: 10903551 (len: 4096), extra_bits: 0x00000000
    kworker/u4:1-55      [001] ....    23.654982: btrfs_finish_ordered_io.isra.0: insert_ordered_extent_file_extent: inode: 257, disk_bytenr: 1712738304, disk_num_bytes: 73728
    kworker/u4:2-61      [000] .N..    23.655017: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10903552, end: 10907647 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655038: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10907648, end: 10911743 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655061: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10911744, end: 10915839 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655083: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10915840, end: 10919935 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655105: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10919936, end: 10924031 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655127: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10924032, end: 10928127 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655149: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10928128, end: 10932223 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655171: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10932224, end: 10936319 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655193: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10936320, end: 10940415 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655216: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10940416, end: 10944511 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655244: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10944512, end: 10948607 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655264: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10948608, end: 10952703 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655284: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10952704, end: 10956799 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655307: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10956800, end: 10960895 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655329: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10960896, end: 10964991 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655352: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10964992, end: 10969087 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655373: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10969088, end: 10973183 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655396: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10973184, end: 10977279 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655418: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10977280, end: 10981375 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655445: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10981376, end: 10985471 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655465: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10985472, end: 10989567 (len: 4096), extra_bits: 0x00000000
kworker/u4:2-61      [000] .N..    23.655485: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10989568, end: 10993663 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655507: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10993664, end: 10997759 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655530: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 10997760, end: 11001855 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655553: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11001856, end: 11005951 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655576: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11005952, end: 11010047 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655598: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11010048, end: 11014143 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655626: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11014144, end: 11018239 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655652: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11018240, end: 11022335 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655672: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11022336, end: 11026431 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655692: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11026432, end: 11030527 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655715: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11030528, end: 11034623 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655737: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11034624, end: 11038719 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655759: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11038720, end: 11042815 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655780: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11042816, end: 11046911 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655802: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11046912, end: 11051007 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655825: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11051008, end: 11055103 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655847: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11055104, end: 11059199 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655870: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11059200, end: 11063295 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655891: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11063296, end: 11067391 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655913: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11067392, end: 11071487 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655935: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11071488, end: 11075583 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655956: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11075584, end: 11079679 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655977: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11079680, end: 11083775 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.655999: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11083776, end: 11087871 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656025: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11087872, end: 11091967 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656044: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11091968, end: 11096063 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656064: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11096064, end: 11100159 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656086: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11100160, end: 11104255 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656107: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11104256, end: 11108351 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656129: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11108352, end: 11112447 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656150: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11112448, end: 11116543 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656181: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11116544, end: 11120639 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656201: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11120640, end: 11124735 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656223: btrfs_set_extent_delalloc: btrfs_set_extent_delalloc: start: 11124736, end: 11128831 (len: 4096), extra_bits: 0x00000000
    kworker/u4:2-61      [000] .N..    23.656247: find_lock_delalloc_range: find_lock_delalloc_range: start: 10792960, delalloc_start: 10792960, delalloc_end: 10825727

Are we missing a lock here?

We aren't. That's a consequence of not preallocating the extents like in the non-zoned case. In between setting up pages, writeback can kick in at any time, resulting in those partial ranges/extents.

Locking the full range won't solve the problem, as finding the delalloc range only locks the range after finding it. The writeback path can't also take the inode's vfs lock, as that could deadlock (setting up the pages can trigger writeback). This works pretty much the same as the buffered write path for regular inodes. I'm not seeing a simple fix right away. Maybe taking the vfs inode lock before setting up all pages, moving the call to balance_dirty_pages_ratelimited() out of that critical section, and making writeback of the data reloc inode take the vfs inode lock might work - not entirely sure if it will.

morbidrsa commented 3 years ago

Locking the full range won't solve the problem, as finding the delalloc range only locks the range after finding it. The writeback path can't also take the inode's vfs lock, as that could deadlock (setting up the pages can trigger writeback). This works pretty much the same as the buffered write path for regular inodes. I'm not seeing a simple fix right away. Maybe taking the vfs inode lock before setting up all pages, moving the call to balance_dirty_pages_ratelimited() out of that critical section, and making writeback of the data reloc inode take the vfs inode lock might work - not entirely sure if it will.

Yup I just spotted both the call to balance_dirty_pages_ratelimited() and the different CPU number on the find_lock_delalloc_range() calls. I'll try implementing your idea and report back.

morbidrsa commented 3 years ago

This proposal ran 50 times without exposing the error (but ENOSPC issues which might be a different problem or real ENOSPC):

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index fc831597cb22..e536865a59de 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2921,11 +2921,14 @@ static int relocate_file_extent_cluster(struct inode *inode,

        index = (cluster->start - offset) >> PAGE_SHIFT;
        last_index = (cluster->end - offset) >> PAGE_SHIFT;
+       btrfs_inode_lock(inode, 0);
        while (index <= last_index) {
                ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode),
                                PAGE_SIZE);
-               if (ret)
+               if (ret) {
+                       btrfs_inode_unlock(inode, 0);
                        goto out;
+               }

                page = find_lock_page(inode->i_mapping, index);
                if (!page) {
@@ -2939,6 +2942,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
                                                        PAGE_SIZE, true);
                                btrfs_delalloc_release_extents(BTRFS_I(inode),
                                                        PAGE_SIZE);
+                               btrfs_inode_unlock(inode, 0);
                                ret = -ENOMEM;
                                goto out;
                        }
@@ -2950,6 +2954,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
                        btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
                        unlock_page(page);
                        put_page(page);
+                       btrfs_inode_unlock(inode, 0);
                        goto out;
                }

@@ -2969,6 +2974,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
                                                        PAGE_SIZE, true);
                                btrfs_delalloc_release_extents(BTRFS_I(inode),
                                                               PAGE_SIZE);
+                               btrfs_inode_unlock(inode, 0);
                                ret = -EIO;
                                goto out;
                        }
@@ -3000,6 +3006,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
                        clear_extent_bits(&BTRFS_I(inode)->io_tree,
                                          page_start, page_end,
                                          EXTENT_LOCKED | EXTENT_BOUNDARY);
+                       btrfs_inode_unlock(inode, 0);
                        goto out;

                }
@@ -3011,14 +3018,16 @@ static int relocate_file_extent_cluster(struct inode *inode,
                put_page(page);

                index++;
-               btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
-               balance_dirty_pages_ratelimited(inode->i_mapping);
-               btrfs_throttle(fs_info);
                if (btrfs_should_cancel_balance(fs_info)) {
+                       btrfs_inode_unlock(inode, 0);
                        ret = -ECANCELED;
                        goto out;
                }
        }
+       btrfs_inode_unlock(inode, 0);
+       btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
+       balance_dirty_pages_ratelimited(inode->i_mapping);
+       btrfs_throttle(fs_info);
        WARN_ON(nr != cluster->nr);
        if (btrfs_is_zoned(fs_info) && !ret)
                ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
fdmanana commented 3 years ago

It misses the inode lock in the writeback path (extent_io.c) for the data reloc inode. As it is, mm or something else outside btrfs might still trigger writeback at any time.

Is the enospc always coming from the same place? Does it cause a transaction abort?

morbidrsa commented 3 years ago

It misses the inode lock in the writeback path (extent_io.c) for the data reloc inode. As it is, mm or something else outside btrfs might still trigger writeback at any time.

Ah I thought the write back path already takes the inode lock. I wonder why the error is gone then? Changed timing?

Is the enospc always coming from the same place? Does it cause a transaction abort?

Yes it's coming out if btrfs_free_extent() (haven't had time to investigate it further):

       } else if (WARN_ON(ret == -ENOENT)) {                                    
                btrfs_print_leaf(path->nodes[0]);                                
                btrfs_err(info,                                                  
                        "unable to find ref byte nr %llu parent %llu root %llu  owner %llu offset %llu",
                        bytenr, parent, root_objectid, owner_objectid,           
                        owner_offset);                                           
                btrfs_abort_transaction(trans, ret);                             
                goto out;                                                        
        } else {                                                                 
                btrfs_abort_transaction(trans, ret);                             <<--
                goto out;                                                        
        }                                                                        

        leaf = path->nodes[0];                                                   
        item_size = btrfs_item_size_nr(leaf, extent_slot);                       
fdmanana commented 3 years ago

It misses the inode lock in the writeback path (extent_io.c) for the data reloc inode. As it is, mm or something else outside btrfs might still trigger writeback at any time.

Ah I thought the write back path already takes the inode lock. I wonder why the error is gone then? Changed timing?

Likely because writeback was being triggered only by the call to balance_dirty_pages_ratelimited()

Writeback path doesn't take the inode's lock (rwsem), as that could lead to deadlock. And IIRC this is true for at least all the major filesystems. We just to have to be sure that while holding the inode locked we never trigger writeback (like calling balance_dirty_pages_ratelimited() or anything that does allocations with GFP_KERNEL for e.g.).

Is the enospc always coming from the same place? Does it cause a transaction abort?

Yes it's coming out if btrfs_free_extent() (haven't had time to investigate it further):

       } else if (WARN_ON(ret == -ENOENT)) {                                    
                btrfs_print_leaf(path->nodes[0]);                                
                btrfs_err(info,                                                  
                        "unable to find ref byte nr %llu parent %llu root %llu  owner %llu offset %llu",
                        bytenr, parent, root_objectid, owner_objectid,           
                        owner_offset);                                           
                btrfs_abort_transaction(trans, ret);                             
                goto out;                                                        
        } else {                                                                 
                btrfs_abort_transaction(trans, ret);                             <<--
                goto out;                                                        
        }                                                                        

        leaf = path->nodes[0];                                                   
        item_size = btrfs_item_size_nr(leaf, extent_slot);                       
morbidrsa commented 3 years ago

Likely because writeback was being triggered only by the call to balance_dirty_pages_ratelimited()

But uncommenting it didn’t change the behavior.

Writeback path doesn't take the inode's lock (rwsem), as that could lead to deadlock. And IIRC this is true for at least all the major filesystems. We just to have to be sure that while holding the inode locked we never trigger writeback (like calling balance_dirty_pages_ratelimited() or anything that does allocations with GFP_KERNEL for e.g.).

There are two more options though:

  1. Creating a new lock for this case
  2. Checking if we can do in memory only preallocation of files so we’re on par with the non-zoned case.
morbidrsa commented 3 years ago

After adding a call to btrfs_inode_lock() into run_delalloc_zoned() I got this lockdep splat:

btrfs/004       [   18.754754]               
[   18.754997] ============================================
[   18.755704] WARNING: possible recursive locking detected
[   18.756403] 5.13.0-rc7-btrfs-004 #1105 Not tainted
[   18.757037] --------------------------------------------
[   18.757731] fsstress/697 is trying to acquire lock:
[   18.758368] ffff888106ddd2c8 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: run_delalloc_zoned+0x41/0xb0 [btrfs]
[   18.759803]                                            
[   18.759803] but task is already holding lock: 
[   18.760562] ffff888106ddd2c8 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_inode_lock+0x38/0x70 [btrfs]      
[   18.761924]                                                                                                         
[   18.761924] other info that might help us debug this:
[   18.762780]  Possible unsafe locking scenario:
[   18.762780]                                          
[   18.763581]        CPU0
[   18.763909]        ---- 
[   18.764240]   lock(&sb->s_type->i_mutex_key#11);
[   18.764854]   lock(&sb->s_type->i_mutex_key#11);
[   18.765474]                               
[   18.765474]  *** DEADLOCK ***                     
[   18.765474]                               
[   18.766255]  May be due to missing lock nesting notation
[   18.766255]                                                                                                         
[   18.767154] 3 locks held by fsstress/697:                                                                           
[   18.767691]  #0: ffff88810e0b1430 (sb_writers#9){.+.+}-{0:0}, at: __do_sys_copy_file_range+0xb1/0x1e0
[   18.768902]  #1: ffff888106ddd2c8 (&sb->s_type->i_mutex_key#11){++++}-{3:3}, at: btrfs_inode_lock+0x38/0x70 [btrfs]
[   18.770314]  #2: ffff888106ddd158 (&ei->i_mmap_lock){+.+.}-{3:3}, at: btrfs_inode_lock+0x49/0x70 [btrfs]
[   18.771606]                             
[   18.771606] stack backtrace:                            
[   18.772181] CPU: 0 PID: 697 Comm: fsstress Not tainted 5.13.0-rc7-btrfs-004 #1105
[   18.773154] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[   18.774614] Call Trace:
[   18.774955]  dump_stack+0x6d/0x89                                                                                                                                                                                                          
[   18.775410]  __lock_acquire.cold+0xad/0x2ab            
[   18.775964]  lock_acquire+0xab/0x340      
[   18.776445]  ? run_delalloc_zoned+0x41/0xb0 [btrfs]                                                                 
[   18.777140]  ? lock_is_held_type+0x98/0x110                                                                         
[   18.777698]  ? find_held_lock+0x2b/0x80
[   18.778205]  down_write+0x25/0x60
[   18.778647]  ? run_delalloc_zoned+0x41/0xb0 [btrfs]
[   18.779334]  run_delalloc_zoned+0x41/0xb0 [btrfs]
[   18.780004]  btrfs_run_delalloc_range+0x12b/0x540 [btrfs]
[   18.780764]  ? find_lock_delalloc_range+0x1d5/0x200 [btrfs]
[   18.781546]  writepage_delalloc+0x9e/0x150 [btrfs]                                                                  
[   18.782229]  __extent_writepage+0xfa/0x3a0 [btrfs]
[   18.782908]  extent_write_cache_pages+0x24e/0x3f0 [btrfs]
[   18.783677]  extent_writepages+0x44/0xa0 [btrfs]
[   18.784336]  do_writepages+0x28/0xa0                                                                                
[   18.784810]  filemap_flush+0x76/0xa0                                                                                
[   18.785285]  btrfs_remap_file_range+0x145/0x4d0 [btrfs]                                                             
[   18.786041]  vfs_copy_file_range+0x267/0x550                                                                        
[   18.786607]  __do_sys_copy_file_range+0xb1/0x1e0                                                                    
[   18.787213]  do_syscall_64+0x45/0x80                                                                                
[   18.787705]  entry_SYSCALL_64_after_hwframe+0x44/0xae                                                               
[   18.788373] RIP: 0033:0x7fa8a5868e4d
[   18.788849] Code: c8 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f3 7f 0c 00 f7 d8 64 89 01 48
[   18.791282] RSP: 002b:00007ffd4dabd458 EFLAGS: 00000202 ORIG_RAX: 0000000000000146
[   18.792276] RAX: ffffffffffffffda RBX: 00000000004144cb RCX: 00007fa8a5868e4d                                       
[   18.793200] RDX: 0000000000000004 RSI: 00007ffd4dabd490 RDI: 0000000000000003                                       
[   18.794129] RBP: 0000000000000003 R08: 00000000000144cb R09: 0000000000000000                                       
[   18.795061] R10: 00007ffd4dabd498 R11: 0000000000000202 R12: 0000000000000027                                       
[   18.796003] R13: 0000000000000004 R14: 00000000000144cb R15: 000000000034d472                                       

This comes from the btrfs_inode_lock() calls in btrfs_remap_file_range().

But for other callpaths that end up in btrfs_writepages() don't take the inode lock. So this is getting tricky.

fdmanana commented 3 years ago

That's likely because you made writeback lock any inode, right? With any inode, you'll get the issue on clone/dedupe, fsync, and many other places. That's the issue I mentioned earlier about a deadlock. But since we can't have clone/fsync/etc called for the data reloc inode...

If you make writeback only lock the data reloc inode, it shouldn't happen unless lockdep gets confused somehow.

naota commented 3 years ago

I think adding btrfs_inode_lock() into run_delalloc_zoned() won't solve the case anyway. Because run_delalloc_zoned() is called after we decide the delalloc range, we can't stop the short allocation.

How about keeping the pages locked while in the loop in relocate_file_extent_cluster()? Or, can't we btrfs_set_extent_delalloc() after the loop?

fdmanana commented 3 years ago

I think adding btrfs_inode_lock() into run_delalloc_zoned() won't solve the case anyway. Because run_delalloc_zoned() is called after we decide the delalloc range, we can't stop the short allocation.

Why not? The idea was to lock the inode before setting up the ranges and make running delalloc also lock the inode early on. That way writeback will always have to wait for the full range to be setup.

How about keeping the pages locked while in the loop in relocate_file_extent_cluster()? Or, can't we btrfs_set_extent_delalloc() after the loop?

Keeping the pages locked should be equivalent to using the inode lock (or any new lock) as mentioned before. Though we would still need to move out the call to balance_dirty_pages_ratelimited() from the critical section, and make sure nothing we call inside the critical section does memory allocations that can trigger reclaim.

naota commented 3 years ago

I think adding btrfs_inode_lock() into run_delalloc_zoned() won't solve the case anyway. Because run_delalloc_zoned() is called after we decide the delalloc range, we can't stop the short allocation.

Why not? The idea was to lock the inode before setting up the ranges and make running delalloc also lock the inode early on. That way writeback will always have to wait for the full range to be setup.

I admit the idea itself. I doubt the implementation Johaness said. As far as I understand, we need to block delalloc goes on while in the loop in relocate_file_extent_cluster(). But since run_delalloc_zoned() is called after deciding the delalloc range, we can still see a non-full range to be delalloced without hitting the lock.

How about keeping the pages locked while in the loop in relocate_file_extent_cluster()? Or, can't we btrfs_set_extent_delalloc() after the loop?

Keeping the pages locked should be equivalent to using the inode lock (or any new lock) as mentioned before. Though we would still need to move out the call to balance_dirty_pages_ratelimited() from the critical section, and make sure nothing we call inside the critical section does memory allocations that can trigger reclaim.

By keeping the pages locked, we can naturally block the writeback process before deciding delalloc range (at lock_page() in extent_write_cache_pages())

I'm currently testing a patch to call btrfs_set_extent_delalloc() after the loop.

morbidrsa commented 3 years ago

I've submitted a patch reliably fixing the problem on my side: https://github.com/btrfs/linux/issues/306

morbidrsa commented 3 years ago

Fixes are in misc-next.