openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.64k stars 1.75k forks source link

zfs_bclone_wait_dirty=1 broken for files with unallocated blocks at the end #15994

Closed rrevans closed 7 months ago

rrevans commented 8 months ago

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 39
Kernel Version 6.6.8-200.fc39.x86_64
Architecture x86_64
OpenZFS Version Build from head (8f2f6cd2ac6 DEBUG)

Describe the problem you're observing

With zfs_bclone_enabled=1 and zfs_bclone_wait_dirty=1, copying a file with unallocated blocks at the end gets stuck in the kernel forever. During this time, the kernel is also forcing txg syncs in an infinite loop.

This behavior is also observed in #15933.

(The same underlying issue also means that bclone always fails for files with unallocated blocks at the end if zfs_bclone_wait_dirty=0.)

Describe how to reproduce the problem

$ echo 1 > /sys/module/zfs/parameters/zfs_bclone_enabled
$ echo 1 > /sys/module/zfs/parameters/zfs_bclone_wait_dirty
$ cd /testpool
$ truncate --size=256M src
$ zpool sync
$ cp --reflink=always src dst

cp never finishes and is stuck in an uninterruptible state unresponsive to both SIGINT and SIGQUIT.

Setting zfs_bclone_wait_dirty=0 while cp is still running causes cp to finish immediately with error cp: failed to clone 'dst' from 'src': Resource temporarily unavailable

Include any warning/errors/backtraces from the system logs

Nothing immediately in dmesg or dbgmsg, but during the failure /proc/spl/kstat/zfs/testpool/txgs shows that zfs is generating a lot of empty txgs:

$ tail -30 /proc/spl/kstat/zfs/testpool/txgs
1571017  393315256830     C     0            0            0            0        0        701860       6090         25220        77480       
1571018  393315958690     C     0            0            0            0        0        704610       6090         25140        77340       
1571019  393316663300     C     0            0            0            0        0        695860       6800         25400        106120      
1571020  393317359160     C     0            0            0            0        0        719080       6120         25310        88020       
1571021  393318078240     C     0            0            0            0        0        710381       6100         25260        78150       
1571022  393318788621     C     0            0            0            0        0        710010       6100         25780        78240       
1571023  393319498631     C     0            0            0            0        0        693820       6130         25420        79990       
1571024  393320192451     C     0            0            0            0        0        701620       6090         24460        77400       
1571025  393320894071     C     0            0            0            0        0        701870       6100         25250        78390       
1571026  393321595941     C     0            0            0            0        0        696431       6080         25560        77490       
1571027  393322292372     C     0            0            0            0        0        699960       6070         25330        77490       
1571028  393322992332     C     0            0            0            0        0        702420       6220         25210        77350       
1571029  393323694752     C     0            0            0            0        0        695030       6160         25730        97120       
1571030  393324389782     C     0            0            0            0        0        710290       6080         25340        87720       
1571031  393325100072     C     0            0            0            0        0        709801       6070         25430        78120       
1571032  393325809873     C     0            0            0            0        0        697140       6240         24630        77370       
1571033  393326507013     C     0            0            0            0        0        696710       6080         25130        78150       
1571034  393327203723     C     0            0            0            0        0        700170       6110         24730        77050       
1571035  393327903893     C     0            0            0            0        0        703480       6140         25691        78570       
1571036  393328607373     C     0            0            0            0        0        698401       6110         25590        78960       
1571037  393329305774     C     0            0            0            0        0        699180       6090         25290        77460       
1571038  393330004954     C     0            0            0            0        0        703890       6090         25230        77220       
1571039  393330708844     C     0            0            0            0        0        694670       6080         25520        96280       
1571040  393331403514     C     0            0            0            0        0        707651       6080         25380        87750       
1571041  393332111165     C     0            0            0            0        0        711130       6250         25450        77950       
1571042  393332822295     C     0            0            0            0        0        705860       6120         24780        83430       
1571043  393333528155     C     0            0            0            0        0        745090       6140         34510        96930       
1571044  393334273245     C     0            0            0            0        0        790700       6180         32650        106500      
1571045  393335063945     S     0            0            0            0        0        784211       6140         32690        0           
1571046  393335848156     O     0            0            0            0        0        0            0            0            0           

Root cause

15842 adds logic to wait for sync when encountering dirty blocks implemented as syncing when dmu_read_l0_bps returns EAGAIN, but the logic is broken. cc @behlendorf

Normally dmu_read_l0_bps returns EAGAIN for dirty blocks. However it also returns EAGAIN whenever db->db_blkptr == NULL. This normally occurs for newly-written blocks not-yet-allocated, but it also occurs for sparse, unallocated blocks beyond the end of a fully-synced object. (More specifically, this occurs for any of the conditions that cause dbuf_findbp to return ENOENT when holding the dbuf.)

In this situation, zfs_clone_range tries to force a sync when zfs_bclone_wait_dirty=1, but syncing does not allocate any blocks since none are actually dirty. Then the next attempt runs into the same condition and syncs again in an infinite loop. Setting zfs_bclone_wait_dirty=0 breaks the loop and returns an error to cp.

This is trivially reproducible by creating an empty sparse file, as seen by zdb:

$ zdb -vvvvvv -O testpool src
obj=128 dataset=testpool path=/src type=19 bonustype=44

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       128    1   128K   128K      0     512   128K    0.00  ZFS plain file (K=inherit) (Z=inherit=lz4)
                                               184   bonus  System attributes
    dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
    dnode maxblkid: 0
    uid     1000
    gid     1000
    atime   Wed Mar 13 23:17:42 2024
    mtime   Wed Mar 13 23:17:42 2024
    ctime   Wed Mar 13 23:17:42 2024
    crtime  Wed Mar 13 23:17:42 2024
    gen 2368119
    mode    100644
    size    268435456
    parent  34
    links   1
    pflags  840800000004
    xattr   129
Indirect blocks:

Note that despite being a 256MiB file size according to ZPL metadata, the actual on-disk object is still a 1-level object with dn_maxblkid == 0 and no indirect blocks which is sufficient to trigger the db_blkptr == NULL case upon dbuf_hold.

amotin commented 8 months ago

This seems to be completely unrelated issue from #15933. #16007 should fix this.