Data corruption when copying files via samba [b0788c47d9]

Dadido3 commented 1 year ago

Version

bcachefs: b0788c47d9 bcachefs-tools: f3976e3733

Generic info

# bcachefs fs usage /mnt/storage/
Filesystem: 98fe7f9a-71e8-4c77-9d1b-a522902aa31b
Size:                 31282032353280
Used:                 16417132694528
Online reserved:                   0

Data type       Required/total  Devices
btree:          1/2             [sdf sdb]                28018999296
btree:          1/2             [sdf sdc]                53045886976
btree:          1/2             [sdc sdb]                28044689408
user:           1/1             [sdd]                  3434805006336
user:           1/2             [sdf sdb]                  843243520
user:           1/2             [sde sdd]              9402095144448
user:           1/1             [sde]                  3434826240000
user:           1/2             [sdf sdc]                 6791978496
user:           1/2             [sdc sdb]                 1020256256
cached:         1/1             [sdf]                    11251107840
cached:         1/1             [sdc]                    11646914048

hdd (device 6):                  sdd              rw
                                data         buckets    fragmented
  free:                            0         7477925
  sb:                        3149824               4       1044480
  journal:                8589934592            8192
  btree:                           0               0
  user:                8135757598208         7773527   15392159744
  cached:                          0               0
  parity:                          0               0
  stripe:                          0               0
  need_gc_gens:                    0               0
  need_discard:                    0               0
  erasure coded:                   0               0
  capacity:           16000900661248        15259648

hdd (device 5):                  sde              rw
                                data         buckets    fragmented
  free:                            0         7478041
  sb:                        3149824               4       1044480
  journal:                8589934592            8192
  btree:                           0               0
  user:                8135968792576         7773411   15059378176
  cached:                          0               0
  parity:                          0               0
  stripe:                          0               0
  need_gc_gens:                    0               0
  need_discard:                    0               0
  erasure coded:                   0               0
  capacity:           16000900661248        15259648

ssd (device 2):                  sdb              rw
                                data         buckets    fragmented
  free:                            0         1797359
  sb:                        3149824               7        520192
  journal:                4294967296            8192
  btree:                 28031844352          100359   24585175040
  user:                    940449792            1821      14278656
  cached:                          0               0
  parity:                          0               0
  stripe:                          0               0
  need_gc_gens:                    0               0
  need_discard:                    0               1
  erasure coded:                   0               0
  capacity:            1000204664832         1907739

ssd (device 1):                  sdc              rw
                                data         buckets    fragmented
  free:                            0          811630
  sb:                        3149824               7        520192
  journal:                3999793152            7629
  btree:                 40545288192          125644   25328353280
  user:                   3900916224            7542      53429760
  cached:                11646914048           24105
  parity:                          0               0
  stripe:                          0               0
  need_gc_gens:                    0               0
  need_discard:                    0               4
  erasure coded:                   0               0
  capacity:             511999213568          976561

ssd (device 0):                  sdf              rw
                                data         buckets    fragmented
  free:                            0         1639664
  sb:                        3149824              13        258048
  journal:                2147483648            8192
  btree:                 40532443136          154619
  user:                   3814112256           14630      21182464
  cached:                11251107840           45228
  parity:                          0               0
  stripe:                          0               0
  need_gc_gens:                    0               0
  need_discard:                    0               4
  erasure coded:                   0               0
  capacity:             488203878400         1862350

# bcachefs show-super /dev/sdc
External UUID:                              98fe7f9a-71e8-4c77-9d1b-a522902aa31b
Internal UUID:                              2d4ba985-eace-4418-a195-4aa3e253a535
Device index:                               1
Label:
Version:                                    1.2: deleted_inodes
Version upgrade complete:                   1.2: deleted_inodes
Oldest version on disk:                     1.1: snapshot_skiplists
Created:                                    Sat Mar 11 22:09:52 2023
Sequence number:                            686
Superblock size:                            7064
Clean:                                      0
Devices:                                    5
Sections:                                   members,replicas_v0,disk_groups,clean,journal_seq_blacklist,journal_v2,counters
Features:                                   lz4,zstd,ec,journal_seq_blacklist_v3,reflink,new_siphash,inline_data,new_extent_overwrite,btree_ptr_v2,extents_above_btree_updates,btree_updates_journalled,reflink_inline_data,new_varint,journal_no_flush,alloc_v2,extents_across_btree_nodes
Compat features:                            alloc_info,alloc_metadata,extents_above_btree_updates_done,bformat_overflow_done

Options:
  block_size:                               4.00 KiB
  btree_node_size:                          256 KiB
  errors:                                   continue [ro] panic
  metadata_replicas:                        2
  data_replicas:                            2
  metadata_replicas_required:               1
  data_replicas_required:                   1
  encoded_extent_max:                       64.0 KiB
  metadata_checksum:                        none [crc32c] crc64 xxhash
  data_checksum:                            none [crc32c] crc64 xxhash
  compression:                              zstd
  background_compression:                   none
  str_hash:                                 crc32c crc64 [siphash]
  metadata_target:                          ssd
  foreground_target:                        ssd
  background_target:                        hdd
  promote_target:                           none
  erasure_code:                             0
  inodes_32bit:                             1
  shard_inode_numbers:                      1
  inodes_use_key_cache:                     1
  gc_reserve_percent:                       8
  gc_reserve_bytes:                         0 B
  root_reserve_percent:                     0
  wide_macs:                                0
  acl:                                      1
  usrquota:                                 0
  grpquota:                                 0
  prjquota:                                 0
  journal_flush_delay:                      1000
  journal_flush_disabled:                   0
  journal_reclaim_delay:                    100
  journal_transaction_names:                1
  version_upgrade:                          [compatible] incompatible none
  nocow:                                    0

members (size 400):
  Device:                                   0
    UUID:                                   73f66016-b61a-44d0-b8f2-827e1be5a9f5
    Size:                                   455 GiB
    Bucket size:                            256 KiB
    First bucket:                           0
    Buckets:                                1862350
    Last mount:                             Sat Aug 12 11:41:01 2023
    State:                                  rw
    Label:                                  ssd (0)
    Data allowed:                           journal,btree,user
    Has data:                               journal,btree,user,cached
    Discard:                                1
    Freespace initialized:                  1
  Device:                                   1
    UUID:                                   18584a22-a534-4dc9-af21-8c1f867acdec
    Size:                                   477 GiB
    Bucket size:                            512 KiB
    First bucket:                           0
    Buckets:                                976561
    Last mount:                             Sat Aug 12 11:41:01 2023
    State:                                  rw
    Label:                                  ssd (0)
    Data allowed:                           journal,btree,user
    Has data:                               journal,btree,user,cached
    Discard:                                1
    Freespace initialized:                  1
  Device:                                   2
    UUID:                                   048033fa-712d-4e03-8aaf-4f868141811d
    Size:                                   932 GiB
    Bucket size:                            512 KiB
    First bucket:                           0
    Buckets:                                1907739
    Last mount:                             Sat Aug 12 11:41:01 2023
    State:                                  rw
    Label:                                  ssd (0)
    Data allowed:                           journal,btree,user
    Has data:                               journal,btree,user
    Discard:                                1
    Freespace initialized:                  1
  Device:                                   5
    UUID:                                   ccc59878-1a44-4db1-bf8c-f5faf9ec71d2
    Size:                                   14.6 TiB
    Bucket size:                            1.00 MiB
    First bucket:                           0
    Buckets:                                15259648
    Last mount:                             Sat Aug 12 11:41:01 2023
    State:                                  rw
    Label:                                  hdd (4)
    Data allowed:                           journal,btree,user
    Has data:                               journal,user
    Discard:                                0
    Freespace initialized:                  1
  Device:                                   6
    UUID:                                   6c33e844-4cd7-4755-944b-ac388928da45
    Size:                                   14.6 TiB
    Bucket size:                            1.00 MiB
    First bucket:                           0
    Buckets:                                15259648
    Last mount:                             Sat Aug 12 11:41:01 2023
    State:                                  rw
    Label:                                  hdd (4)
    Data allowed:                           journal,btree,user
    Has data:                               journal,user
    Discard:                                0
    Freespace initialized:                  1

Kernel was compiled with:

CONFIG_PREEMPT=y
CONFIG_BCACHEFS_DEBUG=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_DEBUG_FS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE=y

dmsg didn't provide any useful information.

Problem

When i move files from my PC to my server with bcachefs via SMB/CIFS, files get corrupted sometimes. This happens irregularly and is hard to reproduce, as sometimes it may just not happen.

I've been testing for weeks now, shuffling around tens of TB back and forth trying to pinpoint the exact cause. Now i'm at a point where i'm pretty sure this is either a bug in bcachefs or SMB, but i'm clueless and i wasn't able to fully rule any of the two. Down below are some notes and details i made during testing.

Notes and details

Bcachefs kernel running in QEmu VM, disks are passed through as block devices. Running Debian bookworm as distribution.
Happens regardless of the compression setting, even with no compression. ~~The compression was changed several times while files were actively transferred.~~ ~~It happens with zstd, and before it also happened with lz4.~~
Happens on a freshly formatted bcachefs FS via bcachefs format /dev/sd[gh], so default values and no further modifications. ~~Some devices were evacuated and then removed (2023-07-29, 2023-07-30) after the files have been transferred.~~ ~~Also happened when no devices were changed/evacuated.~~ ~~But a background target was set, so data was probably also rebalanced when it was copied.~~
~~There were some old striped buckets in the filesystem that were only fully deleted once the before mentioned devices have been evacuated.~~ ~~This was from some old (few months before) testing with erasure coding.~~
The corrupted files read without any errors/warnings. The FS seems to consider them to be correct.
It's unlikely to be the RAM, as both machines were tested with Memtest86+ for 3-4 rounds. Also, the corruption is not random bit flips. There is always a really specific mapping/shifting of bytes going on, see below.
It's either a bug in the filesystem, or in the CIFS/SMB server. I haven't been able to rule either out: I couldn't reproduce the error when using NFS or FTP. On the other hand SMB works fine with ext4 file systems. This needs more testing.
It's not the server's network adapter, as it happens with different adapters (2.5Gb/s RTL8125 and some other 1Gb/s Intel NIC)
Sometimes there is a broken file every few 100 GB of copied files, other times several TB of data copy just fine. Corruption seems to happen in clumps, sometimes several occurrences per file.
Other direction is fine (NAS/server --> PC)

Corruption example

Let's say we have the original file X, and we copy it to the NAS and it got corrupted (we call the corrupted file Y):

Most of the data of Y is identical to X. But there will be one (and rarely more) corrupted chunks, mostly with a size of ~1MB or less. The corrupted chunks in file Y always contain data of X, but shifted by some random offset. Here is an illustration that should make it more clear:

corruption-mapping

An example corruption pattern:

Corrupted chunk in file Y: 0xB32D000-0xB3FF9EF (862704 bytes)
Corrupted chunk in file Y contains data that can be found in X, but with an offset of 1552 bytes.
To be more exact, file Y at 0xB32D000-0xB3FF9EF corresponds exactly with data of file X at 0xB32D610-0xB3FFFFF.

Also to note, corrupted range is always aligned to multiple of 4096 bytes, and the end of the "source" range in X is also always aligned to 4096 bytes.

Every corrupted file i've seen so far has shown this pattern, just with different values. No idea if this information helps, but the pattern is too unique to exclude it from this issue.

jpsollie commented 1 year ago

you're saying you have no idea whether it's CIFS or bcachefs. To narrow down: have you tried both ksmbd and samba?

EDIT: as you say you kept every value as its default one: it is strange you're not hitting any checksum errors ... which may indicate the file was already corrupted before bcachefs takes care of it. So that's why I'd like you to try ksmbd as well

Dadido3 commented 1 year ago

NFS works fine, i have a proxmox backup server running on another machine that mounts to this FS via NFS. Over some months it has probably written and verified a few TB of data without problems.

Thanks for the idea with ksmbd. I just enabled support and recompiled the kernel. It's now copying files, and i'll report back when i get a result.

jpsollie commented 1 year ago

NFS works fine, i have a proxmox backup server running on another machine that mounts to this FS via NFS. Over some months it has probably written and verified a few TB of data without problems.

Thanks for the idea with ksmbd. I just enabled support and recompiled the kernel. It's now copying files, and i'll report back when i get a result.

it may be a memory error in samba causing your file corruption. when a ring buffer underflows, the data gets written twice in a way that may actually look like what we're seeing here. increasing the send / receive buffers could help.

koverstreet commented 1 year ago

So this implicates the write path somehow, but not the normal IO paths.

Could you try the bcachefs-splice-disable branch? Let's see if it's a splice bug

Dadido3 commented 1 year ago

I have now copied enough data via ksmbd that i can confidently say the corruption doesn't happen with ksmbd. But this just means that i can, again, not rule out either bcachefs or smb.

Could you try the bcachefs-splice-disable branch? Let's see if it's a splice bug

I'll revert back to Samba and try that tomorrow.

Dadido3 commented 1 year ago

I tried to compile the bcachefs-splice-disable branch, but i end up with the following error:

...
  CC [M]  drivers/iio/light/vl6180.o
  CC [M]  drivers/iio/temperature/tsys02d.o
  CC [M]  drivers/iio/light/zopt2201.o
  AR      built-in.a
  AR      vmlinux.a
  LD      vmlinux.o
  OBJCOPY modules.builtin.modinfo
  GEN     modules.builtin
  GEN     .vmlinux.objs
  MODPOST Module.symvers
ERROR: modpost: "__SCT__tp_func_contention_begin" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__SCT__tp_func_contention_end" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__tracepoint_contention_begin" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__tracepoint_contention_end" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__SCK__tp_func_contention_end" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__SCK__tp_func_contention_begin" [fs/bcachefs/bcachefs.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:136: Module.symvers] Error 1
make: *** [Makefile:1978: modpost] Error 2

bhzhu203 commented 1 year ago

I tried to compile the bcachefs-splice-disable branch, but i end up with the following error:

...
  CC [M]  drivers/iio/light/vl6180.o
  CC [M]  drivers/iio/temperature/tsys02d.o
  CC [M]  drivers/iio/light/zopt2201.o
  AR      built-in.a
  AR      vmlinux.a
  LD      vmlinux.o
  OBJCOPY modules.builtin.modinfo
  GEN     modules.builtin
  GEN     .vmlinux.objs
  MODPOST Module.symvers
ERROR: modpost: "__SCT__tp_func_contention_begin" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__SCT__tp_func_contention_end" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__tracepoint_contention_begin" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__tracepoint_contention_end" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__SCK__tp_func_contention_end" [fs/bcachefs/bcachefs.ko] undefined!
ERROR: modpost: "__SCK__tp_func_contention_begin" [fs/bcachefs/bcachefs.ko] undefined!
make[1]: *** [scripts/Makefile.modpost:136: Module.symvers] Error 1
make: *** [Makefile:1978: modpost] Error 2

Maybe you could make it not as a module ,just compile into kernel

Dadido3 commented 1 year ago

I have only checked out the git branch and built it the usual way:

cp /boot/config-`uname -r` .config
make menuconfig
make clean
make -j 8
make modules_install
make install

The error happened on make -j 8, no idea what i did wrong.

Anyways, i have now applied the changes of 6dd7cbb1bdd0cad6f9774e1572cee63c1d90db87 on top of the last commit that i have used (b0788c4). This is compiling fine and i'll test with that for now.

Dadido3 commented 1 year ago

Ok, i still get corrupted files, even with the changes of commit 6dd7cbb applied.

koverstreet commented 1 year ago

So we're going to need to figure out what samba is doing different. Could you do an strace of samba while copying some data over? We need to know exactly what IO paths it's going through

Dadido3 commented 1 year ago

Ok, running strace -f -e trace=\!%network -p 2886 -o strace.txt gave me:

strace.txt

While the trace was running i copied a folder with 3 png files to the bcachefs drive.

Also, right now i'm testing Samba in combination with XFS and BTRFS.

Edit: I could actually just let strace run while i'm copying files over and in case i encounter a corruption check what the actual syscalls were for that particular file and offset.

Dadido3 commented 1 year ago

Alright, i just got some corrupted files again, but this time i logged everything with strace. What i see in the corrupted file corresponds what's happening in the trace, which i have shortened to the important parts:

...

2886  openat(39, "03 - Propane Nightmares.flac", O_RDWR|O_CREAT|O_EXCL|O_NONBLOCK|O_NOFOLLOW, 0764) = 42

...

2886  ftruncate(42, 42936176)           = 0

...

4118  pwrite64(42, "\322ly\377)7/\307\4+\345\232\216\252\217\21<\304\315\f`$L\27N\222,\"q\322\247D"..., 1048576, 35651584 <unfinished ...>
2886  epoll_wait(5, [{events=EPOLLIN, data={u32=2716893328, u64=94423277928592}}], 1, 1) = 1
2886  epoll_wait(5, [{events=EPOLLIN, data={u32=2716893328, u64=94423277928592}}], 1, 1) = 1
4118  <... pwrite64 resumed>)           = 1044496
4118  pwrite64(42, "\206\361\27\351\301\352\276\f\36\360q|f\276\223\211\231\211\37D7S\371\303\7\236\35\n]\377\237\344"..., 4080, 36696080 <unfinished ...>
2886  epoll_wait(5,  <unfinished ...>
4118  <... pwrite64 resumed>)           = 4080

...

Basically, samba tries to write 1048576 bytes at offset 35651584, but only 1044496 are actually written. Samba then rewrites the missing 4080 bytes at offset 36696080.

The corrupted range starts at offset 35786752 and has a size of 909328 bytes. Comparing the original and corrupt files, it seems like bcachefs has swallowed 4080 bytes in the middle of that write operation (at offset 35786752), and then continued to write the rest of the data, but shifted by that missing 4080 bytes.

The data starting at offset 36696080 is correct again, as that was the 4080 bytes that samba has rewritten because it noticed that pwrite64 returned less than it should have.

I think that rules out samba, so it's most likely a bug in bcachefs (or something inbetween bcachefs and the pwrite64 syscall? idk).

Edit:

As the numbers are maybe too abstract, here is an image of what's happening, if that makes more sense:

grafik

koverstreet commented 1 year ago

That's some slick debugging work!

I went hunting for a bug that matched, and I believe I found it. Can you try the bcachefs-testing branch?

Dadido3 commented 1 year ago

I've written and verified ~2 TB for now, so i guess it's fixed, and i'll let it run a bit more just to be sure.

Thanks a lot to all involved!

koverstreet commented 1 year ago

That was some slick debugging, thanks for the writeup!

On Tue, Aug 15, 2023, 2:16 PM David Vogel @.***> wrote:

I've written and verified ~2 TB for now, so i guess it's fixed, and i'll let it run a bit more just to be sure.

Thanks a lot to all involved!

— Reply to this email directly, view it on GitHub https://github.com/koverstreet/bcachefs/issues/581#issuecomment-1679384811, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPGX3S4727N6OMN57W33QLXVO4HZANCNFSM6AAAAAA3OGOUUA . You are receiving this because you commented.Message ID: @.***>

koverstreet / bcachefs

Data corruption when copying files via samba [b0788c47d9] #581