openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.46k stars 1.73k forks source link

some copied files are still corrupted (chunks replaced by zeros) #15933

Closed thulle closed 6 months ago

thulle commented 7 months ago

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version (rolling)
Kernel Version 6.7.6-gentoo-x86_64
Architecture amd64
OpenZFS Version 2.2.3

Describe the problem you're observing

After seeing 2.2.3 released and #15526 closed I enabled blockcloning again and tried my unfailing reproducer: compiling go. It still seems to trigger the/some bug.

Worth noting is that I'm running on top of LUKS. Scrub reports no issues.

Describe how to reproduce the problem

# echo 1 > /sys/module/zfs/parameters/zfs_bclone_enabled 

# emerge --buildpkg=n --usepkg=n --quiet-build=y dev-lang/go 
>>> Completed (1 of 1) dev-lang/go-1.22.0::gentoo

# file /usr/lib/go/pkg/tool/linux_amd64/*|grep data$
/usr/lib/go/pkg/tool/linux_amd64/asm:       data
/usr/lib/go/pkg/tool/linux_amd64/cgo:       data
/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/usr/lib/go/pkg/tool/linux_amd64/link:      data
/usr/lib/go/pkg/tool/linux_amd64/vet:       data

# hexdump -C /usr/lib/go/pkg/tool/linux_amd64/asm
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fa0  00 00 00 00 00 00 00 00  00 00 00 00 53 43 51 58  |............SCQX|
00000fb0  63 56 34 74 49 75 66 65  59 67 64 4a 32 54 54 4e  |cV4tIufeYgdJ2TTN|
00000fc0  2f 51 78 63 67 68 50 6b  43 33 67 48 31 54 66 64  |/QxcghPkC3gH1Tfd|
00000fd0  70 59 48 39 33 2f 43 36  32 31 50 69 70 58 45 71  |pYH93/C621PipXEq|
00000fe0  30 6f 5f 39 4c 57 65 71  49 4d 2f 48 32 52 64 56  |0o_9LWeqIM/H2RdV|
00000ff0  50 64 46 45 4c 77 6d 77  4a 37 42 31 51 33 47 00  |PdFELwmwJ7B1Q3G.|
00001000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
002346c0  53 43 51 58 63 56 34 74  49 75 66 65 59 67 64 4a  |SCQXcV4tIufeYgdJ|
002346d0  32 54 54 4e 2f 51 78 63  67 68 50 6b 43 33 67 48  |2TTN/QxcghPkC3gH|
002346e0  31 54 66 64 70 59 48 39  33 2f 43 36 32 31 50 69  |1TfdpYH93/C621Pi|
002346f0  70 58 45 71 30 6f 5f 39  4c 57 65 71 49 4d 2f 48  |pXEq0o_9LWeqIM/H|
00234700  32 52 64 56 50 64 46 45  4c 77 6d 77 4a 37 42 31  |2RdVPdFELwmwJ7B1|
00234710  51 33 47 00 00 00 00 00  00 00 00 00 00 00 00 00  |Q3G.............|
00234720  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
003a88d0  00 00 00 00 00 00 00 00  00 00 00 00 00           |.............|
003a88dd

# zfs -V
zfs-2.2.3-r0-gentoo
zfs-kmod-2.2.3-r0-gentoo
thesamesam commented 7 months ago

cc @robn

thulle commented 7 months ago
thulle: also, it matters as to when the corruption occurs. is it when `cp` calls stuff in src_install via Go (is it corrupted already in the image dir?) or is it OK in the image dir, but corrupted when it gets merged to the livefs? Ran the steps manually first without issue, then made a script that passed the first loop, but stopped on the 2nd loop: ``` #!/bin/bash function checkfiles { if [ "`file ${1}/*|grep data$`" != "" ];then echo Corrupted files in ${1}: file ${1}/*|grep data$ exit fi } cd /usr/portage/dev-lang/go while :;do ebuild go-1.22.0.ebuild clean ebuild go-1.22.0.ebuild compile checkfiles /var/tmp/portage/dev-lang/go-1.22.0/work/go/pkg/tool/linux_amd64 ebuild go-1.22.0.ebuild install checkfiles /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64 ebuild go-1.22.0.ebuild merge checkfiles /usr/lib/go/pkg/tool/linux_amd64 done ``` On the 2nd loop I got: ``` Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64: /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover: data /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet: data ```
thulle commented 7 months ago

\ neat. the next question i'd have, is if you have brt enabled, and set zfs_bclone_wait_dirty=1, does the issue also go away?

# echo 1 > /sys/module/zfs/parameters/zfs_bclone_wait_dirty

Started the script again, but it stalled and something seemed stuck writing at ~10MB/s to the nvme. Killed the script and saw this in syslog:

Feb 26 15:35:27 soma kernel: INFO: task kworker/u65:1:22874 blocked for more than 122 seconds.
Feb 26 15:35:27 soma kernel:       Tainted: P        W  OE   T  6.7.6-gentoo-x86_64 #2
Feb 26 15:35:27 soma kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 26 15:35:27 soma kernel: task:kworker/u65:1   state:D stack:0     pid:22874 tgid:22874 ppid:2      flags:0x00004000
Feb 26 15:35:27 soma kernel: Workqueue: writeback wb_workfn (flush-zfs-5)
Feb 26 15:35:27 soma kernel: Call Trace:
Feb 26 15:35:27 soma kernel:  <TASK>
Feb 26 15:35:27 soma kernel:  __schedule+0x277/0x680
Feb 26 15:35:27 soma kernel:  schedule+0x31/0xe0
Feb 26 15:35:27 soma kernel:  __cv_broadcast+0x154/0x190 [spl]
Feb 26 15:35:27 soma kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Feb 26 15:35:27 soma kernel:  zfs_rangelock_exit+0x815/0x930 [zfs]
Feb 26 15:35:27 soma kernel:  zfs_putpage+0x128/0x7f0 [zfs]
Feb 26 15:35:27 soma kernel:  zio_do_crypt_abd+0x21fd/0x38f0 [zfs]
Feb 26 15:35:27 soma kernel:  write_cache_pages+0x13e/0x3b0
Feb 26 15:35:27 soma kernel:  ? zio_do_crypt_abd+0x21c0/0x38f0 [zfs]
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  zio_do_crypt_abd+0x1bb1/0x38f0 [zfs]
Feb 26 15:35:27 soma kernel:  do_writepages+0xce/0x1f0
Feb 26 15:35:27 soma kernel:  __writeback_single_inode+0x3d/0x290
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  writeback_sb_inodes+0x206/0x4e0
Feb 26 15:35:27 soma kernel:  __writeback_inodes_wb+0x4c/0xf0
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  wb_writeback+0x29e/0x310
Feb 26 15:35:27 soma kernel:  wb_do_writeback+0x238/0x2c0
Feb 26 15:35:27 soma kernel:  wb_workfn+0x5b/0x270
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  ? queue_delayed_work_on+0x8d/0xa0
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  process_one_work+0x173/0x340
Feb 26 15:35:27 soma kernel:  worker_thread+0x28b/0x3b0
Feb 26 15:35:27 soma kernel:  ? preempt_count_add+0x7a/0xb0
Feb 26 15:35:27 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel:  ? __pfx_worker_thread+0x10/0x10
Feb 26 15:35:27 soma kernel:  kthread+0xf6/0x130
Feb 26 15:35:27 soma kernel:  ? __pfx_kthread+0x10/0x10
Feb 26 15:35:27 soma kernel:  ret_from_fork+0x30/0x50
Feb 26 15:35:27 soma kernel:  ? __pfx_kthread+0x10/0x10
Feb 26 15:35:27 soma kernel:  ret_from_fork_asm+0x1b/0x30
Feb 26 15:35:27 soma kernel:  </TASK>

checking the process list I suspect this straggler is the culprit, only remaining child of the ebuild python process:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      9953 33.3  0.0  10136  2008 ?        RN   15:31  12:42 cp -R api bin doc lib pkg misc src test /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go

edit: As soon as I disabled zfs_bclone_wait_dirty the write load disappeared and the cp-process died.

thulle commented 7 months ago

Rebooted and tried the same thing again, but not killing the script this time, instead just disabling zfs_bclone_wait_dirty. Waited for the trace first:

Feb 26 16:37:51 soma kernel: INFO: task kworker/u66:2:2121 blocked for more than 122 seconds.
Feb 26 16:37:51 soma kernel:       Tainted: P        W  OE   T  6.7.6-gentoo-x86_64 #2
Feb 26 16:37:51 soma kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 26 16:37:51 soma kernel: task:kworker/u66:2   state:D stack:0     pid:2121  tgid:2121  ppid:2      flags:0x00004000
Feb 26 16:37:51 soma kernel: Workqueue: writeback wb_workfn (flush-zfs-5)
Feb 26 16:37:51 soma kernel: Call Trace:
Feb 26 16:37:51 soma kernel:  <TASK>
Feb 26 16:37:51 soma kernel:  __schedule+0x277/0x680
Feb 26 16:37:51 soma kernel:  schedule+0x31/0xe0
Feb 26 16:37:51 soma kernel:  __cv_broadcast+0x154/0x190 [spl]
Feb 26 16:37:51 soma kernel:  ? __pfx_autoremove_wake_function+0x10/0x10
Feb 26 16:37:51 soma kernel:  zfs_rangelock_exit+0x815/0x930 [zfs]
Feb 26 16:37:51 soma kernel:  zfs_putpage+0x128/0x7f0 [zfs]
Feb 26 16:37:51 soma kernel:  zio_do_crypt_abd+0x21fd/0x38f0 [zfs]
Feb 26 16:37:51 soma kernel:  write_cache_pages+0x13e/0x3b0
Feb 26 16:37:51 soma kernel:  ? zio_do_crypt_abd+0x21c0/0x38f0 [zfs]
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  zio_do_crypt_abd+0x1bb1/0x38f0 [zfs]
Feb 26 16:37:51 soma kernel:  do_writepages+0xce/0x1f0
Feb 26 16:37:51 soma kernel:  __writeback_single_inode+0x3d/0x290
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  writeback_sb_inodes+0x206/0x4e0
Feb 26 16:37:51 soma kernel:  __writeback_inodes_wb+0x4c/0xf0
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  wb_writeback+0x29e/0x310
Feb 26 16:37:51 soma kernel:  wb_do_writeback+0x238/0x2c0
Feb 26 16:37:51 soma kernel:  wb_workfn+0x5b/0x270
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel:  ? try_to_wake_up+0x9d/0x690
Feb 26 16:37:51 soma kernel:  process_one_work+0x173/0x340
Feb 26 16:37:51 soma kernel:  worker_thread+0x28b/0x3b0
Feb 26 16:37:51 soma kernel:  ? __pfx_worker_thread+0x10/0x10
Feb 26 16:37:51 soma kernel:  kthread+0xf6/0x130
Feb 26 16:37:51 soma kernel:  ? __pfx_kthread+0x10/0x10
Feb 26 16:37:51 soma kernel:  ret_from_fork+0x30/0x50
Feb 26 16:37:51 soma kernel:  ? __pfx_kthread+0x10/0x10
Feb 26 16:37:51 soma kernel:  ret_from_fork_asm+0x1b/0x30
Feb 26 16:37:51 soma kernel:  </TASK>

Not killing the script and disabling zfs_bclone_wait_dirty it completes instead, no corruption detected. I let it loop like this another 6 times, surprisingly without issue, until I realized my CPU was limited to 3GHz, default limit if a script doesn't detect that any listed processes are running. Listed are processes needing more speed, like portage or similar. Uncapping CPU speed it fails in the next loop:

Loop 8 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data

Two theories:

\ tiny window that's only hit w higher frequencies? \ possible. might be something more indirect like the speed change causing things to briefly stall and skewing timing that way.

Stabilizing the core speeds a bit:

# cpupower frequency-set -g performance -d 4000000 -u 5200000

Loop 3 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data

Letting it vary more:

# cpupower frequency-set -g schedutil -d 550000 -u 5200000

Loop 2 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data
thulle commented 7 months ago

\ Please be sure to include the output of "zfs get all [DATASET THIS HAPPENS ON]"

NAME                     PROPERTY              VALUE                  SOURCE
kc3000/EPHEMERAL/usrtmp  type                  filesystem             -
kc3000/EPHEMERAL/usrtmp  creation              fre nov 24 16:51 2023  -
kc3000/EPHEMERAL/usrtmp  used                  7.31G                  -
kc3000/EPHEMERAL/usrtmp  available             449G                   -
kc3000/EPHEMERAL/usrtmp  referenced            7.31G                  -
kc3000/EPHEMERAL/usrtmp  compressratio         2.40x                  -
kc3000/EPHEMERAL/usrtmp  mounted               yes                    -
kc3000/EPHEMERAL/usrtmp  quota                 none                   default
kc3000/EPHEMERAL/usrtmp  reservation           none                   default
kc3000/EPHEMERAL/usrtmp  recordsize            128K                   default
kc3000/EPHEMERAL/usrtmp  mountpoint            /usr/tmp               local
kc3000/EPHEMERAL/usrtmp  sharenfs              off                    default
kc3000/EPHEMERAL/usrtmp  checksum              blake3                 inherited from kc3000
kc3000/EPHEMERAL/usrtmp  compression           zstd-3                 inherited from kc3000
kc3000/EPHEMERAL/usrtmp  atime                 on                     default
kc3000/EPHEMERAL/usrtmp  devices               on                     default
kc3000/EPHEMERAL/usrtmp  exec                  on                     default
kc3000/EPHEMERAL/usrtmp  setuid                on                     default
kc3000/EPHEMERAL/usrtmp  readonly              off                    default
kc3000/EPHEMERAL/usrtmp  zoned                 off                    default
kc3000/EPHEMERAL/usrtmp  snapdir               hidden                 default
kc3000/EPHEMERAL/usrtmp  aclmode               discard                default
kc3000/EPHEMERAL/usrtmp  aclinherit            restricted             default
kc3000/EPHEMERAL/usrtmp  createtxg             15                     -
kc3000/EPHEMERAL/usrtmp  canmount              on                     default
kc3000/EPHEMERAL/usrtmp  xattr                 sa                     inherited from kc3000
kc3000/EPHEMERAL/usrtmp  copies                1                      default
kc3000/EPHEMERAL/usrtmp  version               5                      -
kc3000/EPHEMERAL/usrtmp  utf8only              on                     -
kc3000/EPHEMERAL/usrtmp  normalization         formD                  -
kc3000/EPHEMERAL/usrtmp  casesensitivity       sensitive              -
kc3000/EPHEMERAL/usrtmp  vscan                 off                    default
kc3000/EPHEMERAL/usrtmp  nbmand                off                    default
kc3000/EPHEMERAL/usrtmp  sharesmb              off                    default
kc3000/EPHEMERAL/usrtmp  refquota              none                   default
kc3000/EPHEMERAL/usrtmp  refreservation        none                   default
kc3000/EPHEMERAL/usrtmp  guid                  10378968521923463382   -
kc3000/EPHEMERAL/usrtmp  primarycache          all                    default
kc3000/EPHEMERAL/usrtmp  secondarycache        all                    default
kc3000/EPHEMERAL/usrtmp  usedbysnapshots       0B                     -
kc3000/EPHEMERAL/usrtmp  usedbydataset         7.31G                  -
kc3000/EPHEMERAL/usrtmp  usedbychildren        0B                     -
kc3000/EPHEMERAL/usrtmp  usedbyrefreservation  0B                     -
kc3000/EPHEMERAL/usrtmp  logbias               latency                default
kc3000/EPHEMERAL/usrtmp  objsetid              398                    -
kc3000/EPHEMERAL/usrtmp  dedup                 off                    default
kc3000/EPHEMERAL/usrtmp  mlslabel              none                   default
kc3000/EPHEMERAL/usrtmp  sync                  standard               default
kc3000/EPHEMERAL/usrtmp  dnodesize             legacy                 default
kc3000/EPHEMERAL/usrtmp  refcompressratio      2.40x                  -
kc3000/EPHEMERAL/usrtmp  written               7.31G                  -
kc3000/EPHEMERAL/usrtmp  logicalused           13.9G                  -
kc3000/EPHEMERAL/usrtmp  logicalreferenced     13.9G                  -
kc3000/EPHEMERAL/usrtmp  volmode               default                default
kc3000/EPHEMERAL/usrtmp  filesystem_limit      none                   default
kc3000/EPHEMERAL/usrtmp  snapshot_limit        none                   default
kc3000/EPHEMERAL/usrtmp  filesystem_count      none                   default
kc3000/EPHEMERAL/usrtmp  snapshot_count        none                   default
kc3000/EPHEMERAL/usrtmp  snapdev               hidden                 default
kc3000/EPHEMERAL/usrtmp  acltype               posix                  inherited from kc3000
kc3000/EPHEMERAL/usrtmp  context               none                   default
kc3000/EPHEMERAL/usrtmp  fscontext             none                   default
kc3000/EPHEMERAL/usrtmp  defcontext            none                   default
kc3000/EPHEMERAL/usrtmp  rootcontext           none                   default
kc3000/EPHEMERAL/usrtmp  relatime              on                     inherited from kc3000
kc3000/EPHEMERAL/usrtmp  redundant_metadata    all                    default
kc3000/EPHEMERAL/usrtmp  overlay               on                     default
kc3000/EPHEMERAL/usrtmp  encryption            off                    default
kc3000/EPHEMERAL/usrtmp  keylocation           none                   default
kc3000/EPHEMERAL/usrtmp  keyformat             none                   default
kc3000/EPHEMERAL/usrtmp  pbkdf2iters           0                      default
kc3000/EPHEMERAL/usrtmp  special_small_blocks  0                      default
thulle commented 7 months ago

\ the one other experiment would be to try zfs_dmu_offset_next_sync \ it defaults to 1, if you change it to 0, does this still mangle your bits \ in a case where it would have mangled your bits otherwise, e.g. changing clock speeds

# echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

Loop 2 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data
rincebrain commented 7 months ago

Yeah this reproduces really easily on 6.7.6 and 2.2.3 with a dataset with all default settings and the tunable flipped to enable BRT.

Gross.

AllKind commented 7 months ago

How about https://github.com/openzfs/zfs/pull/15588 - I wonder if that changes the outcome?

rincebrain commented 7 months ago

I tried, and no.

AllKind commented 7 months ago

Thanks for trying - and damn!

DocJr90 commented 6 months ago

Yeah this reproduces really easily on 6.7.6 and 2.2.3 with a dataset with all default settings and the tunable flipped to enable BRT.

Gross.

"Really easily" as in "just as easily as before"?

numinit commented 6 months ago

Rebased and running this all night... https://github.com/numinit/nixpkgs/commit/c0712aad80c9e89a5004ec5133f74748787755f9

Increased blocksize to 256k kept at 16k and added more logging:

zfs_2_2_3_stock # [zhammer::1266] Uname: Linux zfs223stock 6.7.9 #1-NixOS SMP PREEMPT_DYNAMIC Wed Mar  6 14:54:01 UTC 2024 x86_64 GNU/Linux
zfs_2_2_3_stock # [zhammer::1266] Cmdline: initrd=\EFI\nixos\zi31gj6mb1lmp3w92wxl0n93s36kszq9-initrd-linux-6.7.9-initrd.efi init=/nix/store/gs2xw1yis46qx221h1jy0z4jv7z8w5sd-nixos-syst
em-zfs_2_2_3_stock-test/init console=ttyS0 panic=1 boot.panic_on_fail clock=acpi_pm zfs.zfs_bclone_enabled=1 nohibernate loglevel=7 net.ifnames=0
zfs_2_2_3_stock # [zhammer::1266]   - zfs_bclone_enabled: 1
zfs_2_2_3_stock # [zhammer::1266]   - zfs_bclone_wait_dirty: 0
zfs_2_2_3_stock # [zhammer::1266]   - zfs_dmu_offset_next_sync: 1
zfs_2_2_3_stock # [zhammer::1266] ZFS userspace: zfs-2.2.3-1
zfs_2_2_3_stock # [zhammer::1266] ZFS kernel: zfs-kmod-2.2.3-1
zfs_2_2_3_stock # [zhammer::1266] Module: /run/booted-system/kernel-modules/lib/modules/6.7.9/extra/zfs.ko.xz
zfs_2_2_3_stock # [zhammer::1266] Srcversion: 62EA6EC3953A8524022765E
zfs_2_2_3_stock # [zhammer::1266] SHA256: 9f3dc6469b41a73ab0cf0c6c76c7ddce6e9ca0b1bc4da64f0e09206ee5d70f2b
zfs_2_2_3_stock # [zhammer::1266] ===
zfs_2_2_3_stock # [zhammer::1266] Work dir: /test
zfs_2_2_3_stock # [zhammer::1266] Count: 10000000 files
zfs_2_2_3_stock # [zhammer::1266] Block size: 16k
zfs_2_2_3_stock # [zhammer::1266] Check every: 5000 files
zfs_2_2_3_stock # [zhammer::1266] writing 5000 files at iteration 0

I'll update in case anything happens but it didn't repro immediately which is a good sign.

rrevans commented 6 months ago

@thulle Are you able to reproduce this with zfs_bclone_enabled=0?

rrevans commented 6 months ago

See #15994 for a possible explanation for cp stuckness when zfs_bclone_wait_dirty=1.

numinit commented 6 months ago

10,000,000 iterations of 16k blocks on each of 8 cores, and... no trouble.

(finished: must succeed: parallel --lb --halt-on-error now,fail=1 zhammer /test 10000000 16k 5000 ::: $(seq $(nproc)), in 81147.21 seconds)
(finished: run the VM test script, in 81156.74 seconds)
test script finished in 81156.79s
cleanup
kill machine (pid 9)
qemu-kvm: terminating on signal 15 from pid 6 (/nix/store/3v2ch16fkl50i85n05h5ckss8pxx6836-python3-3.11.8/bin/python3.11)
(finished: cleanup, in 0.08 seconds)
kill vlan (pid 7)
/nix/store/vxxi7mfvynzqpyavjszgqmjd5ni0l1xh-vm-test-run-zfs-zfs_2_2_3_stock

Is this another problem?

rincebrain commented 6 months ago

Yes.

numinit commented 6 months ago

So, not racing holes; probably very specifically pointing toward something else related to block cloning, or at least a second-order issue thereof, then. Maybe something ARC related?

rrevans commented 6 months ago

I think this is zpl_clone_file_range_impl not flushing mapped pages before attempting to perform the clone.

The file is miscopied if it has been synced out once, pages are dirty, and the kernel has not flushed any of those pages yet. (See zfs_holey_common for another code path that needs to flush dirty pages before examining the file.)

Here is a trivial reproducer:

#!/bin/bash
echo 1 >/sys/module/zfs/parameters/zfs_bclone_enabled
dd if=/dev/zero of=x bs=1M count=1 status=none
zpool sync
python -c "
from mmap import mmap
with open('x', 'r+b') as fh:
    with mmap(fh.fileno(), 0) as m:
        m[4000:5000] = b'z' * 1000
"
cp --reflink=always x y
diff -q x y

Output:

Files x and y differ

What's happening here?

  1. The zpool sync flushes out the file and its block pointers
  2. Some pages get written with mmap
  3. Reflink-copy clones the zero blocks from (1) not the pages written in (2)

This happens because none of the pages have been written back yet, so the file looks clean to the clone code. But the contents are the last-synced rather than the latest mmap'ed state. (Also note that merely reading the file without using mmap or clone will flush dirty pages hiding the bug.)

golang build is a reproducer because it builds the tool chain multiple times in the work directory and overwrites the contents with mmap. This reliably triggers the sequence above.

The script above succeeds with no differences for zfs_bclone_enabled=0 or cp --reflink=never.

rincebrain commented 6 months ago

That tracks with what I privately found, which was that turning the conditional check for cached pages triggering a flush in zfs_holey_common into an unconditional flush made this bug vanish.

My theory when I last looked at this was that because we didn't hold the range lock until after we did this check, there was a nasty window for us to have passed that check, not trigger the flush, and then go bang. But attempting to move that check inside the range lock (in addition to causing fun with nested locks sometimes) seemed to still break where unconditional flushing did not, suggesting that there was more to it.

Then life got busy for the past few days and I haven't looked again since.

Another interesting question is whether this reproduces on FreeBSD, because my theory for the issue I was observing (which was that, in my testing at least, I ended up with everything past a certain point zeroed where it shouldn't have been) was that the filesize was changing between when we read it and used it for the check, or between the check and when we actually did the probe, resulting in comedy, but since FreeBSD's mapping of the check doesn't actually use the range arguments, it might not have that problem if that was indeed the problem.

Anyway, if someone wants to just slap a FIXME on this, I imagine you could do something like turning

        /* Flush any mmap()'d data to disk */
        if (zn_has_cached_data(zp, 0, file_sz - 1))
                zn_flush_cached_data(zp, B_FALSE);

        lr = zfs_rangelock_enter(&zp->z_rangelock, 0, UINT64_MAX, RL_READER);
        error = dmu_offset_next(ZTOZSB(zp)->z_os, zp->z_id, hole, &noff);
        zfs_rangelock_exit(lr);

into

        /* Flush any mmap()'d data to disk */
        if (zn_has_cached_data(zp, 0, file_sz - 1) || (zfs_brt_force_flush))
                zn_flush_cached_data(zp, B_FALSE);

        lr = zfs_rangelock_enter(&zp->z_rangelock, 0, UINT64_MAX, RL_READER);
        error = dmu_offset_next(ZTOZSB(zp)->z_os, zp->z_id, hole, &noff);
        zfs_rangelock_exit(lr);

with a new tunable plumbed in, and have a nice day for now. But this feels like the problem there is that a clone should be triggering that before it tries it, not flushing before anything that might reference it is hitting it later. (cf. #15772).

So similar to how zfs_clone_range added:

                error = dmu_brt_clone(outos, outzp->z_id, outoff, size, tx,
                    bps, nbps);
                if (error != 0) {
                        dmu_tx_commit(tx);
                        break;
                }

                if (zn_has_cached_data(outzp, outoff, outoff + size - 1)) {
                        update_pages(outzp, outoff, size, outos);
                }

after the clone, we probably want something like (untested):

                if (zn_has_cached_data(inzp, inoff, inoff + size - 1)) {
                        update_pages(inzp, inoff, size, inos);
                }

                nbps = maxblocks;
                last_synced_txg = spa_last_synced_txg(dmu_objset_spa(inos));
                error = dmu_read_l0_bps(inos, inzp->z_id, inoff, size, bps,
                    &nbps);

.

rincebrain commented 6 months ago
$ for i in `seq 1 50`;do sudo bash repro.sh;done;
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
^C
$ echo 1 | sudo tee /sys/module/zfs/parameters/zfs_brt_fix_this_mess
1
$ for i in `seq 1 50`;do sudo bash repro.sh;done;
$

Where that tunable is an if guard around the above patch.

rincebrain commented 6 months ago

Don't try it, probably.

This ameliorates the reproducer script listed, but trying the original report, of emerge go on Gentoo, produces mangled results more often with that proposed change, fascinatingly.

Wonder if what I'm seeing is this fix turning this test case into #15994.

rincebrain commented 6 months ago

Nah, just me rushing in trying things.

What I actually wanted was:

                if (zn_has_cached_data(inzp, inoff, inoff + size - 1)) {
                        zn_flush_cached_data(inzp, B_FALSE);
                }

                nbps = maxblocks;
                last_synced_txg = spa_last_synced_txg(dmu_objset_spa(inos));
                error = dmu_read_l0_bps(inos, inzp->z_id, inoff, size, bps,
                    &nbps);

But I didn't think carefully about what update_pages did. I wanted to flush cached pages, not rewrite them with the buffer's contents, that's the bad place.

...and if I do that in there, I deadlock on the rangelock. So let's move that around...

rrevans commented 6 months ago

rrevans/zfs@ae5efbed360e9ba12ce38e5a66dfe14e0b12a77b fixes both emerge go (using @thulle's step-at-a-time testing script) and the reproducer script for me. I've run both in a loop for 30 minutes with no failures so far.

The dirty page flush happens before the rangelock in zfs_holey_common, and I'm just copying that approach.

I am not sure this is correct or sufficient or if there are other bclone/mmap gaps. (And, it flushes the whole file when it could maybe flush only the cloned range.) That said, it does pass ZTS sanity.run and bclone tests for me.

rincebrain commented 6 months ago

That doesn't fix it, for me, as I already said above.

rincebrain commented 6 months ago

Sorry, to be less terse, I wound up with https://github.com/rincebrain/zfs/commit/e55d3b03643f92500610e8e6066b59304e09b93b because you can't flush that under the rangelock without deadlocking sometimes, so there's going to be a window where it might change between the flush and the lock again.

So I added a case for the check inside the lock as well, told it to retry once to hopefully catch more of the trivial cases of it being dirtied with inconvenient timing infrequently, and then fall back to not reflinking in the event that it's still dirty.

I confirmed that it sometimes did still reflink with that code (since I was slightly worried I might be missing some detail of how it worked that implied the check would always be true), and then pushed it to see whether the CI blew up.

It can probably be improved to avoid that window, but similarly, I could not produce incorrect behavior after hammering it for some time, and anything simpler still failed some of the time.

(Not, admittedly, very often, but "one in 50 runs" is still "not zero times".)

(Also, to be clear, I'm redoing my tests with your patch to see if my memory is faulty of having tried that, because I would certainly prefer a simpler solution...)

amotin commented 6 months ago

I'll need to look closer on the area, but just a first thought: if it is impossible to reliably flush the caches, we could abort cloning in zn_has_cached_data() case. I.e. we could flush caches first before taking the lock, but if somebody is still writing the mmap()'ed region when we get the lock, then give up completely. Though since we'd need to commit the txg before we'll be able to clone the file, we may just give up on it from the beginning and don't bother to even flush the cache, leaving it to fall-back copy routine to do.

rrevans commented 6 months ago

@amotin +1

Looking more at this, I'm suspecting background writeback is a confounding factor (and then in turn also system load).

Staring at zn_flush_cached_data and zfs_putpage, if the pages are already in writeback by the time this is called, then it won't wait if sync == B_FALSE. Then it's possible it returns before any pages get copied to the DMU/ARC state, and the file state looks clean. (@rincebrain's version of the patch uses sync == B_TRUE which is more reliable vs. background WB.)

But note the fallback copy routine might seek holes and end up calling zn_flush_cached_data with sync == FALSE which could in turn have the same problem? Maybe this explains why unconditional flushing in zfs_holey_common has some effect on the outcome?

p.s. I don't think it's correct to use rangelocks to coordinate with mmap writes since page locking is entirely separate and mmap writes are inherently racy. IOW if there are racing page writes there's no way to guarantee any outcome anyway, right? This is true even for filesystems that use the pagecache directly. That said, copying should always reliably observe page writes guaranteed to be dirtied before clone starts (e.g. the build tool process wait is done before clone begins).

thulle commented 6 months ago

@thulle Are you able to reproduce this with zfs_bclone_enabled=0?

It seems like so, or at least something similar. I ran the go-building script overnight and after 13h and ~500 compiles it couldn't loop anymore:

+ /usr/lib/go/bin/go build -o cmd/dist/dist ./cmd/dist
00x0x0x0x0x0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x<0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x00x0x0x0x0x0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x<0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x 
* ERROR: dev-lang/go-1.22.1::gentoo failed (compile phase):

It seems like half the go-binary got zeroed out:

# hexdump -C /usr/lib/go/bin/go|tail 
0065ffa0  16 28 99 00 00 00 00 00  20 f5 a0 00 00 00 00 00  |.(...... .......|
0065ffb0  20 00 00 00 00 00 00 00  23 64 99 00 00 00 00 00  | .......#d......|
0065ffc0  a0 f4 a0 00 00 00 00 00  30 00 00 00 00 00 00 00  |........0.......|
0065ffd0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
0065ffe0  40 00 00 00 00 00 00 00  28 00 00 00 00 00 00 00  |@.......(.......|
0065fff0  39 af b4 e3 07 08 08 19  00 00 00 00 00 00 00 00  |9...............|
00660000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00c18c20  00 00 00 00 00 00 00 00                           |........|
00c18c28
rincebrain commented 6 months ago

Yeah, I don't think in the general case you can guarantee anything with true mmap direct access semantics, though because we play games with indirection precisely to try and avoid things like users modifying a "live" buffer and causing exciting outcomes, we might be able to do better, but I haven't looked hard at how that code works, and it might only be true on Linux since FBSD is more or less a thin shim around UMA.

I haven't been able to reproduce what I said about it occasionally failing again with that patch so far, interestingly, but I wasn't kidding about how uncommonly it reproduced for me before, and I've had a number of changes where it looked like it helped and then "oops".

So personally, even if we don't opt for the "try twice" approach in my commit, I would encourage us to have the bailout case I have there, because I'd rather have a "this should never happen" check that's not used and we discard in a few years, than have to play games like this again in the next 6-12 months.

I would agree that having looked at this, it's almost certainly true that B_FALSE is the wrong thing in zfs_holey_common, even if in practice the window to demonstrate that is vanishingly small, but maybe @robn will have opinions on that.

e: Amusingly, thulle, that's the way I've seen it fail sometimes, with bclone_enabled=1, so perhaps i was just hitting the case you're seeing now.

How very...frustrating.

amotin commented 6 months ago

Looking deeper it does look like zn_flush_cached_data(B_FALSE) may be unreliable, just hard to say if that is the cause. It would be good to check of setting it to B_TRUE fix the go builds (either just in zfs_holey_common() with bclone_enabled = 0 or also in zfs_clone_range() as in https://github.com/rrevans/zfs/commit/ae5efbed360e9ba12ce38e5a66dfe14e0b12a77b). On the other side, it seems ZPL translates all B_TRUE/WB_SYNC_ALL flushes into ZIL commits, that would be much more expensive than we need for zfs_holey_common() and zfs_clone_range(). It would be nice to have something in between, but I don't see anything so far.

Another issue I see in use of zn_has_cached_data() for zfs_holey_common() and zfs_clone_range() purposes. It only reports that some pages are cached, not that they are dirty. I suspect if we use it to abort cloning, we may get many unneeded aborts. Using it for zn_flush_cached_data(B_TRUE) should be safe, but I wonder if it may also cause ZIL commits even if there are no dirty pages. While looking through the Linux I spotted filemap_range_needs_writeback(), which seems to report specifically whether we have any dirty or in writeback pages, that is what we need for zfs_clone_range(), or even for zfs_holey_common() if zfs_dmu_offset_next_sync = 0.

Gendra13 commented 6 months ago

@thulle Are you able to reproduce this with zfs_bclone_enabled=0?

It seems like so, or at least something similar. I ran the go-building script overnight and after 13h and ~500 compiles it couldn't loop anymore:

[...]

It seems like half the go-binary got zeroed out:

Just to make sure, I understood you correctly: You were able to trigger silent data corruption even with the default setting of zfs_bclone_enabled=0? If yes, it is possible to reproduce this behaviour on older versions like 2.1.x, too? Because then we'll have last year all over again.

thulle commented 6 months ago

Just to make sure, I understood you correctly: You were able to trigger silent data corruption even with the default setting of zfs_bclone_enabled=0? If yes, it is possible to reproduce this behaviour on older versions like 2.1.x, too? Because then we'll have last year all over again.

Yes, though it took way longer time. Haven't tried it on 2.1, and can't downgrade on the system I'm testing on since pool features are activated. I can try in a VM tonight, but considering the time it took to trigger it with zfs_bclone_enabled=0, if just being in VM makes it sligthly rarer it might take weeks.

rrevans commented 6 months ago

@Gendra13 I have been able to also reproduce with zfs_bclone_enabled=0 AND either sysctl vm.dirty_writeback_centisecs=100 vm.dirty_expire_centisecs=100 OR zfs_dmu_offset_next_sync=0. These tweaks trigger the underlying writeback race more often, but it's possible the default Linux writeback settings are enough to prevent the problem. @thulle if you do 2.1 testing in a VM, maybe try tweaking the writeback sysctls?

I can also reproduce the problem now with zfs_bclone_enabled=1 even with rrevans@ae5efbed360e9ba12ce38e5a66dfe14e0b12a77b via writeback race.

I'm still trying to reproduce it with default writeback and ZFS settings without bclone. With defaults writeback is infrequent and ZFS txg syncs due to lseek(SEEK_HOLE/SEEK_DATA) interfere with setting up that race.

(Unrelated TIL cp --debug will print how it copied which is handy for trying to repro this class of bug.)

rincebrain commented 6 months ago

I'm just going to add a tunable to kill non-dense seek results for now, as a mitigation.

It's obviously exceedingly rare in most use cases, but really, "exceedingly rare" doesn't mean "this is okay", to me, and that seems like a better plan than waiting for us to figure out all the edge cases.

thulle commented 6 months ago

@rincebrain I tried with the patch you posted

On loop 791 after ~22h:

# hexdump -C /usr/lib/go/bin/go|tail
0071ffa0  bb 5f 01 00 20 03 00 00  5a 00 00 00 00 00 00 00  |._.. ...Z.......|
0071ffb0  e5 5f 01 00 20 03 00 00  08 00 00 00 00 00 00 00  |._.. ...........|
0071ffc0  bb 5f 01 00 8c 00 00 00  5a 00 00 00 00 00 00 00  |._......Z.......|
0071ffd0  e5 5f 01 00 8c 00 00 00  08 00 00 00 00 00 00 00  |._..............|
0071ffe0  2d c1 01 00 16 00 00 00  72 00 00 00 00 00 00 00  |-.......r.......|
0071fff0  bd c0 01 00 51 00 00 00  22 00 00 00 00 00 00 00  |....Q...".......|
00720000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00c18c20  00 00 00 00 00 00 00 00                           |........|
00c18c28
rincebrain commented 6 months ago

Yeah, I'm not completely astonished. Would you care to try with #16000?

thulle commented 6 months ago

I just realized that I'm on an unsupported kernel since 4 days ago, currently running 6.8.0-gentoo-x86 + https://github.com/openzfs/zfs/pull/15931

I assumed it's just a new interface, but worth noting. Building and trying #16000 now.

thulle commented 6 months ago

Would you care to try with #16000?

That resulted in some other issue, the initramfs hands over to the root-on-zfs-pool, OpenRC starts, it creates /run/openrc and /run/lock and then freezes on * /run/lock: correcting owner Both on 6.8.0 and 6.7.9.

rincebrain commented 6 months ago

Cute. Maybe I broke the world.

That'd be kind of weird though since the test suite seems to have mostly passed...

Regardless, I'll try and figure out why that's happening, assuming I can reproduce it here.

rincebrain commented 6 months ago

It doesn't seem broken for me on 6.7.6 in any way I've found so far, so I'll try building 6.8 and see.

rincebrain commented 6 months ago

Ah, yes, SEEK_SET. One moment.

rincebrain commented 6 months ago

Nah, I just missed the required semantics for SEEK_DATA. Silly me. I'll run the updated patch through more testing then push it.

rincebrain commented 6 months ago

...wild. I still got a mangled one with this patch.

rrevans commented 6 months ago

Here's my reproducer:

#!/bin/bash

# kill background jobs on exit
trap 'trap - SIGTERM && kill -- -$$' SIGINT SIGTERM EXIT

tool="""
import mmap, random, sys
blk = 1<<17
with open(sys.argv[1], 'r+b') as fh, mmap.mmap(fh.fileno(), 0) as m:
        for o in range(0, len(m), blk):
                # N.B. last blk must be filled for bclone (see #15993)
                if random.randint(0, 99) < 10 or o+blk >= len(m):
                        m[o:o+blk] = random.randbytes(blk)
                else:
                        m[o:o+blk] = b'\0' * blk
"""

# write, copy, compare, and print an error on mismatch
write() {
        truncate --size=4M "${f?}"
        python3 -c "${tool}" "${f?}" &>/dev/null
        cpout=$(cp --debug "${f?}" "${f?}.2" 2>&1)
        diff=$(diff -q "${f?}" "${f?}.2")
        if (($?)); then
                echo "$diff" "($(echo "$cpout" | tail -1))"
        fi
}

# do copies for the same file in a loop
loop() {
        f="$1"
        while sleep 0.1; do
                write "${f?}"
        done
}

# run many copy loops at once
for i in $(seq 2000); do
        loop "x.${i?}" &
done

# trigger writeback flusher in a loop
while sleep 1; do
        sync -f .
done

This triggers the writeback flavor of this issue within 30 seconds for me most of the time regardless of tunables.

With zfs_bclone_enabled=0, zfs_dmu_offset_next_sync=1, I get this after 10 seconds:

Files x.495 and x.495.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.147 and x.147.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1132 and x.1132.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1387 and x.1387.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)

With zfs_bclone_enabled=0, zfs_dmu_offset_next_sync=0, I get this after about 20 seconds:

Files x.1121 and x.1121.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.393 and x.393.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)

With zfs_bclone_enabled=1, zfs_bclone_wait_dirty=0, zfs_dmu_offset_next_sync=1, I get this after about 20 seconds:

Files x.950 and x.950.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.457 and x.457.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1018 and x.1018.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.320 and x.320.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1604 and x.1604.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.554 and x.554.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.859 and x.859.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1031 and x.1031.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1470 and x.1470.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1187 and x.1187.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.271 and x.271.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)

With zfs_bclone_enabled=1, zfs_bclone_wait_dirty=0, zfs_dmu_offset_next_sync=0, I get this after about 20 seconds:

Files x.1794 and x.1794.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1750 and x.1750.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1103 and x.1103.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1858 and x.1858.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1014 and x.1014.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.572 and x.572.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.544 and x.544.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1856 and x.1856.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)

With zfs_bclone_enabled=1, zfs_bclone_wait_dirty=1, zfs_dmu_offset_next_sync=1, I get this almost immediately:

Files x.1507 and x.1507.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.900 and x.900.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1613 and x.1613.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.83 and x.83.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.312 and x.312.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)

With zfs_bclone_enabled=1, zfs_bclone_wait_dirty=1, zfs_dmu_offset_next_sync=0, I get this almost immediately:

Files x.725 and x.725.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.651 and x.651.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.409 and x.409.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.545 and x.545.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1191 and x.1191.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.63 and x.63.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)

Note that the cp debug part is telling us exactly which path is getting it wrong for each case. reflink: yes failures are bclone getting it wrong, and reflink: unsupported, sparse detection: SEEK_HOLE failures are zfs_holey_common getting it wrong on the fallback leg. (Edit: copy offload: yes cases are still using bclone when enabled.)

So this is direct evidence that bclone without wait_dirty hits both cases as this thread has already hinted. (It also seems that wait_dirty makes this worse somehow with a lot more cases hit per second when I run those tests.)

This seems once again timing dependent, so try tuning the number of parallel writes if this doesn't work for you. Also blk is matching my dataset's recordsize of 128k, and I didn't try other values or recordsizes. YMMV. My exact setup is ZFS at rrevans@ae5efbe (DEBUG), local Fedora 39 VM (6.6.8-200.fc39.x86_64) with 32 vcpu, 64 GiB RAM, and NVMe disk. I'm using default kernel writeback settings for these tests (vm.dirty_writeback_centisecs = 500, vm.dirty_expire_centisecs = 3000).

Lastly note that the use of python3 -c in the reproducer seems to be required to trigger this; placing this script in a file on disk in /tmp will not reproduce the issue as readily despite /tmp being a tmpfs on my system. This is a really sensitive case to try to reproduce.

rincebrain commented 6 months ago

Wonder if Python trying to compile the bytecode and write it out is doing something funny to timing without that.

I've got an improved version of the killswitch patch that I haven't broken yet that I'll push after giving it another 100 or 200 more rounds of "did it explode", as well as some experimenting with that reproducer.

rincebrain commented 6 months ago

What I think is happening, based on seeing failure even with that patch, is that we're racing the znode's filesize being updated from the inode's filesize, and since the hole logic always uses the znode's size, us returning a stale one results in a narrow window where externally, they've widened it, but it hasn't been flushed back yet, so ZFS is out of sync with what the VFS layer thinks reality is.

That would also, to some extent, explain this being very finicky and not reproducing on FreeBSD, since we don't have this heavy pile of indirection between us and FreeBSD.

So I think doing i_size_read in zfs_holey would avoid this, but since that's a Linuxism, I'm currently testing with just moving the bailout logic under the zfs_enter and calling zfs_znode_update_vfs before we do anything, which should do something sane, I think?

If not, it'll certainly break more excitingly, and I'll evaluate different solutions.

rrevans commented 6 months ago

FWIW- My repro is not changing filesize or unlinking nodes between passes. So, the filesize never changes after first truncate which also creates the file.

Also I've been examining diffs and they are 1) not only at the end 2) not recordsize aligned (they are page aligned though).

Meanwhile flush withsync=B_TRUE is sufficient to stop the repro triggering for me without bclone, but bclone is still triggering even with sync forced. (However I can't repro if I change to bailout of clone with EAGAIN in the presence of any mappings dirty or otherwise. Doing that for both paths seems reasonable as a mitigation.)

I think that non-bclone is harder to hit because it would take all pages in a record hitting this at the same time to get seek to skip over it. And non-bclone reads are served from pagecache on mapped objects, so as long as cp tries to read the record it will be correct regardless of writeback.

Contrast with bclone which will copy the record directly via L0 bps. If we miss an in-progress writeback that should dirty the record, then contents will be wrong.

Long story short, I think this is Linux writeback vs. zpl pagecache sync failing to reliably collect all dirty pages at sync time when bclone flushes, but I don't know why. I think it only happens to work without bclone because pages are smaller than records making it harder to hit. Otherwise there's no reason sync=TRUE fixes one path but not the other; IOW I think it's not a sufficient fix for either despite apparently fixing the non-bclone case.

Which means the culprit lies in writeback syncing. The best theory I have is that some page gets background writeback during the time userland is updating the mapped pages and so is immediately redirtied, but this sequence somehow causes the later sync flush to fail to block on the redirtied page.

From a cursory read of Linux writeback, I don't yet see how that could happen. write_cached_pages won't wait for pages it didn't tag on the sync leg, but __write_single_inode will do fdatawait on the whole mapping which should block on any writeback started by that point. But this code is very difficult to follow because ZFS writeback has a lot of special handling for Linux. Are there any pagecache writeback experts who can help look here?

Entirely separately, neither bclone or seek really needs to flush. This is just how ZFS is trying to convey dirty page state downward to the DMU to have the record dirty state match, and it's really tricky to lean on synchronous writeback for this. IMO ZFS should maybe either inspect the mm state to see if any mapped pages for a specific record are dirty or track the dirtied pages with page_mkwrite etc. instead of asking for sync on those pages just to detect dirtiness.

But then again writeback sync being unreliable is probably a differently serious bug because then it calls into question whether or not msync and friends are actually reliable for data integrity writes over mappings in the presence of this writeback race...

rincebrain commented 6 months ago

I was trying to work with what's in front of me, and it was reproducing at recordsize granularity on the end for me, so I was trying to ameliorate that consistently and see what was left.

I've since moved on to seeing if filemap_write_and_wait_range will make it stop, and yes, rearchitecting how it handles this to not be indirect might work, though for at least block cloning that could be fun to reason about since you're cloning things that aren't necessarily synced out yet, and the whole design is around "clone existing block".

Gendra13 commented 6 months ago

I did a few test runs with the reproducer from @rrevans on Ubuntu 22.04. All the zfs-kernel-parameters were left on their default settings (e.g. zfs_dmu_offset_next_sync=1).

TL;DR: Yes, this bug can be triggered on 2.1.x. if you are using the new coreutils EDIT2: Unfortunately, after more tests, I was able to trigger this both with zfs_dmu_offset_next_sync 0 and 1. See my later post. _(was before: In my additional tests I wasn't able to trigger the bug with zfs_dmu_offset_nextsync=0 on v2.1.x)

  1. With a vanilla non-HWE install and the old coreutils 8.32 the script just fails out. ZFS-Version zfs-kmod-2.1.5-1ubuntu6~22.04.2 (that's the old one without the dnode fix since Ubuntu's bug fixes are slooooow) Uname -r : 5.15.0-100-generic cp --version: cp (GNU coreutils) 8.32
diff: x.12.2: No such file or directory
 (Try 'cp --help' for more information.)
diff: x.267.2: No such file or directory
diff: x.1143.2: No such file or directory
diff: x.747.2: No such file or directory
  1. With the same setup and new coreutils I can reproduce the error. ZFS-Version zfs-kmod-2.1.5-1ubuntu6~22.04.2 (that's the old one without the dnode fix since Ubuntu's bug fixes are slooooow) Uname -r : 5.15.0-100-generic cp --version: cp (GNU coreutils) 9.4
Files x.206 and x.206.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.244 and x.244.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.97 and x.97.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.504 and x.504.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.286 and x.286.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
  1. Next run with the dnode-fix (#15571) backported to Ubuntu. ZFS-Version zfs-kmod-2.1.5-1ubuntu6~22.04.3 (that's has the dirty_dnode fix backported) Uname -r : 5.15.0-100-generic cp --version: cp (GNU coreutils) 9.4
Files x.657 and x.657.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.201 and x.201.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.374 and x.374.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
  1. And finally after updating to zfs version 2.1.15. ZFS-Version: zfs-kmod-2.1.15-1 Uname -r : 5.15.0-100-generic cp --version: cp (GNU coreutils) 9.4
Files x.1016 and x.1016.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.346 and x.346.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.415 and x.415.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.616 and x.616.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.450 and x.450.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)

So in my tests I was able to reproduce the bug even with a rather old kernel version (5.15) and running 2.1.x without any of the block-cloning code baked in.

EDIT:

I did an additional test with 2.1.15 but setting zfs_dmu_offset_next_sync=0 and then I could run the reproducer.sh for at least 20min without triggering the bug (the other cases failed within a few seconds). ZFS-Version: zfs-kmod-2.1.15-1 Uname -r : 5.15.0-100-generic cp --version: cp (GNU coreutils) 9.4

EDIT2:

See my later post where I was able to reproduce this even with zfs_dmu_offset_next_sync=0 on 2.1.15.

rincebrain commented 6 months ago

I'm not surprised, but thanks for the confirmation.

My current plan, having tried a bunch of other things, is: