Copying files to a dataset and then verifying shows mismatches

EchterAgo commented 1 year ago

I'm using rclone sync to copy ~5TB of data to a ZFS dataset created with these options:

zpool create -o ashift=12 -O atime=off -O compression=lz4 -O recordsize=1MB -O casesensitivity=insensitive -O normalization=formD Data raidz1 PHYSICALDRIVE0 PHYSICALDRIVE1 PHYSICALDRIVE4 PHYSICALDRIVE5 cache PHYSICALDRIVE2 PHYSICALDRIVE3

After the copy is done, I use rclone sync -c to fully verify the contents of the copied files and it always comes up with differences. So far I've only seen it with bigger files, mostly with a 150GB disk image file, but also with some ~30GB image files.

I used HxD to compare the files and it turns out there are sections filled with zeroes instead of the original data.

I'll try to diagnose this further.

EchterAgo commented 1 year ago

The hole in the file seems to start and end at a 4 KiB aligned offset and is 32 KiB in size. The data after the hole is identical again. I'm letting it continue scanning to see if there are further differences.

EchterAgo commented 1 year ago

The harddrives have

LogicalSectorSize  : 512
PhysicalSectorSize : 4096

and the cache SSDs have

LogicalSectorSize  : 512
PhysicalSectorSize : 512

EchterAgo commented 1 year ago

The corruption also remains when I reread the data.

EchterAgo commented 1 year ago

I checked another file and the corruption start and end are again 4 KiB aligned but this time it is larger, 49152 bytes (48 KiB)

lundman commented 1 year ago

Thanks for looking into this, I'll make it higher priority since we definitely don't want any corruption. If it's sections of nulls, i wonder if it is a seek/hole issue. I will dig deeper. Is there a relatively easily to make it happen, like close to a one-liner?

EchterAgo commented 1 year ago

I haven't found an easy way to reproduce yet, but it does keep happening when I transfer a big amount of data. I'm currently trying to see if any rclone options help, like --local-no-preallocate and --local-no-sparse. I'll also check different dataset options, maybe disable compression. It takes a while to test due to the big data size and the low amount of data actually zeroed, in ~5TB of data only 80 KiB were actually bad. I will report back here when I have more information.

I also checked for any errors in logs, but could not find anything, SMART values look good. Both systems I copy to/from have ECC memory, the source system is a Linux system running ZFS. Previous attempts to copy the same data to either NTFS / ReFS worked without issue. I'll also try to get some test running locally that writes random data from a fixed seed and compares the result.

EchterAgo commented 1 year ago

I also tried copying data with Windows Explorer and haven't had any corruption yet. Windows Explorer copies files sequentially in a single thread, rclone copies multiple files simultaneously and can even use multiple streams for single files.

sskras commented 1 year ago

@EchterAgo commented 2 hours ago:

Both systems I copy to/from have ECC memory, the source system is a Linux system running ZFS. Previous attempts to copy the same data to either NTFS / ReFS worked without issue.

It would be nice to check if it can be reproduced by Linux or FreeBSD implementations of OpenZFS. At least Linux one contains some hard to reproduce bugs too. Eg: https://github.com/openzfs/zfs/issues/12014

EchterAgo commented 1 year ago

I'm using the Linux ZFS port for a long time already and I've never encountered corruption like this. The pool options are almost the same, except for case sensitive, RAIDZ2 instead of RAIDZ1 and only one cache device. I also use rclone a lot to copy to the Linux pool and I do sanity check the data, no problems so far.

Yesterday I used Windows Explorer to copy the full 5TB to the Windows pool and it seems everything copied correctly. I also tried rclone with disabled preallocation and disabled sparse files and the corruption remained. I'm not sure I specified the options correctly for the remote though, so I'll retry this now.

lundman commented 1 year ago

I've given up on #223 for now, after 5 weeks, I need a break from it.

Sounds like I should try out rclone on Windows and see if i can figure out what goes wrong. Maybe it is something in seek/write/holes that goes wrong.

sskras commented 1 year ago

Thanks @lundman. Wow, you're cool digger! Take rest and take care. :)

And thanks for referencing the FileSpy tool, which I haven't heard before. It's a bit sad it isn't an open source thing, but it's impressive anyway! Cool. :)

EchterAgo commented 1 year ago

I've given up on #223 for now, after 5 weeks, I need a break from it.

Sounds like I should try out rclone on Windows and see if i can figure out what goes wrong. Maybe it is something in seek/write/holes that goes wrong.

I have also done more testing, none of --multi-thread-streams 0, --local-no-sparse, --local-no-preallocate and --transfers 1 make rclone copy correctly and I'm sure the options are applied correctly. Interestingly I've done a few more copies with Windows Explorer and so far all those copies worked. I've looked at the WriteFile calls and rclone seems to use 32KB writes while Windows explorer starts with 256KB first, then increases to 768KB, then 1 MB.

I plan to try some different filesystem options next. I'll also look into adding code to log zero data in some places.

lundman commented 1 year ago

When rclone runs, and using FileSpy are all the writes in sequence?

Like 0..32KB 32KB..64KB 64KB..96KB. .....

etc? Or does it seek around a bit? I wonder if the "offset counter update" is wrong, or if it sends the offset each time

EchterAgo commented 1 year ago

With the options I used it should just write sequentially with a single thread, but I'll verify this using FileSpy.

EchterAgo commented 1 year ago

FileSpy_20230614_rclone_write.zip

This is a capture of rclone writing the biggest file I have in the dataset. It does look sequential in 32KB chunks. There seems to be a FASTIO_WRITE call that fails in there. I'll also get a capture of Windows Explorer writing the same file.

I also wonder if it is feasible to just capture the whole copy, then I could write a small tool to locate the problematic offsets in the capture.

lundman commented 1 year ago

Ignore the FASTIO errors, it tries it first to see if we support it, and if not, does a regular IRP call. We have the FASTIO patch in a branch, but it complicates things, so I was going for stable first.

2   17:10:10.318        rclone.exe  IRP_MJ_WRITE    00060A00    FFFFBB06C3C12088    D:\data2.crypt  STATUS_SUCCESS  Offset: 00000003-7B3C7000 ToWrite: 8000 Written: 8000 
4   17:10:10.318        rclone.exe  IRP_MJ_WRITE    00060A00    FFFFBB06C3C12088    D:\data2.crypt  STATUS_SUCCESS  Offset: 00000003-7B3CF000 ToWrite: 8000 Written: 8000 
6   17:10:10.318        rclone.exe  IRP_MJ_WRITE    00060A00    FFFFBB06C3C12088    D:\data2.crypt  STATUS_SUCCESS  Offset: 00000003-7B3D7000 ToWrite: 8000 Written: 8000 
8   17:10:10.318        rclone.exe  IRP_MJ_WRITE    00060A00    FFFFBB06C3C12088    D:\data2.crypt  STATUS_SUCCESS  Offset: 00000003-7B3DF000 ToWrite: 8000 Written: 8000

Hmm what does IrpSp->Parameters.Write.ByteOffset.HighPart being 3 mean? Another special hardcoded value?

I am aware of these two:

        if (IrpSp->Parameters.Write.ByteOffset.LowPart ==
            FILE_USE_FILE_POINTER_POSITION) {
            byteOffset = fileObject->CurrentByteOffset;
        } else if (IrpSp->Parameters.Write.ByteOffset.LowPart ==
            FILE_WRITE_TO_END_OF_FILE) { // APPEND

#define FILE_WRITE_TO_END_OF_FILE 0xffffffff
#define FILE_USE_FILE_POINTER_POSITION 0xfffffffe

lundman commented 1 year ago

Ah ok this is just a snippet of a very large file? Then maybe it is up to 0x000000037B3C7000. So as long as it started at 0, then 1, 2, and now 3 that is fine. All writes should then end with 0x7000 and 0xf000 as long as the writes are 0x8000. They certainly do in the snippet I have. You can tell FileSpy to monitor the drive letter of ZFS, then at the end, File->Save.

EchterAgo commented 1 year ago

Yes, the file is 150GB, that is a normal file offset. I checked with smaller files and the high part of the address is zero.

At the rate the FileSpy memory usage is growing when capturing I doubt I have enough memory to capture a whole rclone transfer. Is there any way to get FileSpy to log directly to disk?

EchterAgo commented 1 year ago

Initially I believed this to only affect larger files, but I've since seen this corruption in 1GB sized files as well.

EchterAgo commented 1 year ago

I can just copy the biggest file over and over again and capture its events.

lundman commented 1 year ago

So the io tests in MS test.exe highlight we have a gap:

Write to file at 0, PASS
Check position hasn't moved, PASS
Set position, FAIL (STATUS_NOT_IMPLEMENTED)
Check position has now moved, FAIL (CurrentByteOffset was 0, expected 6)
Read from file specifying offset, PASS
Check position, FAIL (CurrentByteOffset was 0, expected 6)
Read from file specifying offset as FILE_USE_FILE_POINTER_POSITION, PASS
Write to file specifying offset as FILE_USE_FILE_POINTER_POSITION, PASS
Set position to 0, FAIL (STATUS_NOT_IMPLEMENTED)
Write to file specifying offset as FILE_WRITE_TO_END_OF_FILE, PASS
Check position, PASS

So let me fix those before we test again

lundman commented 1 year ago

56eca2a

Unlikely to affect things, you would have seen not-implemented in FileSpy.

EchterAgo commented 1 year ago

I tested with the latest changes and still see corruption. I also tested different sources for the copy, including local NTFS drives, with the same result.

So far, anytime I copy enough data using rclone sync <src> <dst> and then verify it using rclone check --differ <differ.log> <src> <dst> I get differences in the differ.log, different files each time even though I create a fresh zpool each time. When I examine the files there is a zeroed section where the start and end offsets are 4k aligned. So far I haven't found a file that has multiple such sections, but I haven't examined each of them.

I haven't found any rclone / zfs / zpool options that make a difference yet and there are no reported checksum or R/W errors.

EchterAgo commented 1 year ago

Ok, I have a simple tool to reproduce:

import uuid
import random
import math
import os
from pathlib import Path

ROOTPATH = Path('D:\\testdir')
DATA_BLOCK_SIZE = 32 * 1024
DATA_NUM_BLOCKS = math.floor((1 * 1024 * 1024 * 1024) / DATA_BLOCK_SIZE)
#DATA_NUM_BLOCKS = 3

def gen_file():
    filename = ROOTPATH / str(uuid.uuid4())
    start_state = random.getstate()

    with open(filename, 'wb') as f:
        for bi in range(0, DATA_NUM_BLOCKS):
            data_block = random.randbytes(DATA_BLOCK_SIZE)
            f.write(data_block)

    random.setstate(start_state)

    with open(filename, 'rb') as f:
        for bi in range(0, DATA_NUM_BLOCKS):
            disk_data_block = f.read(DATA_BLOCK_SIZE)
            expected_data_block = random.randbytes(DATA_BLOCK_SIZE)

            if disk_data_block != expected_data_block:
                print(f'found mismatch in {filename} @ {bi * DATA_BLOCK_SIZE}')

def main():
    random.seed()

    os.makedirs(ROOTPATH, exist_ok=True)

    while True:
        gen_file()

if __name__ == "__main__":
    main()

it still needs a while to hit the errror. For my first try it wrote 888GB to hit the error. Same issue as with rclone but now isolated to only writing to ZFS.

EchterAgo commented 1 year ago

You can set a breakpoint on the mismatch print to stop the program when something goes wrong. I'll try to get a trace at that point tomorrow.

EchterAgo commented 1 year ago

I also tested with ZFS default options, same behavior. I'll check without cache devices, different redundancy options and just single device tomorrow.

EchterAgo commented 1 year ago

I also analyzed the file, it is the same zero filled 4k aligned corruption. I'll try this on a clean Windows install in a VM next.

EchterAgo commented 1 year ago

On another try I hit corruption after 200GB.

lundman commented 1 year ago

excellent, thanks for the producer. I'll also give it a go, it could be useful to save the cbuf soon after it happens https://openzfsonosx.org/wiki/Windows_BSOD#Debug_Print_Buffer so I can try to do that today.

lundman commented 1 year ago

OK so the first thing I noticed was that when I ran out of disk space, it did not fail with an error. Turns out, the producer code's style of writing data, advances the file first (call to zfs_freesp()) then fills in with data. Code was ignoring any errors from zfs_freesp(), like that of ENOSPC.

3346a5b

My pools are like 10G, so it didn't take long.

Now, there's no reason your tests would get a failure from zfs_freesp(), that I can think of (you don't run out of space after all). But you should pull in the commit anyway.

I need to grow my test disks to let it run for longer.

EchterAgo commented 1 year ago

I have everything set up to dump cbuf, now I'm just waiting to produce the error again. I ran it overnight and it hit the issue ~10 times and wrote 3.6TB of data, so it should be soon.

I'm not sure if the reproducer will also work when I delete the files after writing them, I'll try that after I dumped the debug buffer.

EchterAgo commented 1 year ago

I attached the buffer from right after I reproduced the issue. There also was more binary data at the end of the buffer which I cut off, is that needed too?

cbuf_cut.txt

lundman commented 1 year ago

Not at all, its a circular buffer, so it doesnt always fill the whole way. We could pre-filled it with spaces or something i guess.

lundman commented 1 year ago

Ah dang, that metaslab flushing thing is annoying. Looks like your zfs_flags isn't what we want:

https://github.com/openzfsonwindows/openzfs/blob/windows/module/zfs/spa_misc.c#L249-L258 https://github.com/openzfsonwindows/openzfs/blob/windows/man/man4/zfs.4#L1329-L1341

So you probably want 513. or maybe just 1. You can set it dynamically in kernel debugger, or, using registry \HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\OpenZFS. Or, if easier, change source and compile.

The cbuf should contain -EB- which is the marker for "end of buffer", and you read whats just above it for the last-few-lines etc.

EchterAgo commented 1 year ago

Yes, -EB- was just after the text, I cut it from the file. I'll try setting zfs_flags correctly and getting another log.

EchterAgo commented 1 year ago

This is the result with zfs_flags 513

cbuf.txt

Would it make sense to automate dumping cbuf to do it directly when my test hits the issue? Because there were maybe 5 minutes in between hitting it and me dumping cbuf.

Edit: There shouldn't have been any other activity on the pool after the test encountered the issue, I stopped everything with a breakpoint.

lundman commented 1 year ago

ok very interesting

-EB-A985AEBB4040:dsl_dir.c:1346:dsl_dir_tempreserve_impl(): error 28

error 28 is ENOSPC, doesn't have to mean disk full, but dsl_dir says something is full. So possible there is some error happening and we aren't handling the error fully. This is with todays commit https://github.com/openzfsonwindows/openzfs/commit/3346a5b78c2db15801ce54a70a323952fdf67fa5 ?

error 132 is ENOMEM - so it does seem its getting a bit constrained and we should handle errors better.

Side note, zfs_uiomove(): error 0 should not show, SET_ERROR is supposed to only print when it isnt zero. odd.

EchterAgo commented 1 year ago

The system has 64GB memory and when running the test the free memory seems to fluctuate between 13.5GB and 14.5GB. The modified set grows to a gigabyte and then back to zero. I am monitoring the memory usage of the system and it really should not have ran out of memory.

I am still using 0f83d31e288d789fb4e10a7e4b12e27887820498 but I will restart the test with 3346a5b78c2db15801ce54a70a323952fdf67fa5 now.

EchterAgo commented 1 year ago

The memory usage also rises quickly until it flattens at some point.

Edit: I took a screenshot but somehow it was all black :\ You can see the memory usage rising at about the rate I am writing to the drive and then stabilizing at around 44GB used. There are other processes on the system, so the usage isn't only ZFS.

lundman commented 1 year ago

Yeah, it's so much about your machine's memory, since I set a hard ceiling inside ZFS as to not eat everything, as we are not part of the OS memory handler, but separate. So you need to view the kstat to see how cramped it is inside ZFS. But we don't have to go that deep today.

EchterAgo commented 1 year ago

It looks like 3346a5b78c2db15801ce54a70a323952fdf67fa5 did actually change things, I'm getting a kernel crash now:

28: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
This is a very common BugCheck.  Usually the exception address pinpoints
the driver/function that caused the problem.  Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: ffffffff80000003, The exception code that was not handled
Arg2: fffff8048c1ca0c7, The address that the exception occurred at
Arg3: ffff8304fa88b778, Exception Record Address
Arg4: ffff8304fa88afb0, Context Record Address

Debugging Details:
------------------

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 2452

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 3838

    Key  : Analysis.Init.CPU.mSec
    Value: 4515

    Key  : Analysis.Init.Elapsed.mSec
    Value: 50754

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 118

    Key  : WER.OS.Branch
    Value: vb_release

    Key  : WER.OS.Timestamp
    Value: 2019-12-06T14:06:00Z

    Key  : WER.OS.Version
    Value: 10.0.19041.1

FILE_IN_CAB:  MEMORY.DMP

BUGCHECK_CODE:  7e

BUGCHECK_P1: ffffffff80000003

BUGCHECK_P2: fffff8048c1ca0c7

BUGCHECK_P3: ffff8304fa88b778

BUGCHECK_P4: ffff8304fa88afb0

EXCEPTION_RECORD:  ffff8304fa88b778 -- (.exr 0xffff8304fa88b778)
ExceptionAddress: fffff8048c1ca0c7 (OpenZFS!l2arc_write_buffers+0x0000000000000df7)
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 1
   Parameter[0]: 0000000000000000

CONTEXT:  ffff8304fa88afb0 -- (.cxr 0xffff8304fa88afb0)
rax=0000000000000001 rbx=ffff938f4d10e040 rcx=1288d5f429410000
rdx=000000000000004b rsi=fffff8048c1c1ae0 rdi=0000000000000000
rip=fffff8048c1ca0c7 rsp=ffff8304fa88b9b0 rbp=0000000000000080
 r8=000000000000004d  r9=0000000000000000 r10=0000000000000000
r11=0000000000000010 r12=0000000000000286 r13=0000000000000000
r14=ffff938f39167180 r15=ffffaa811a50f000
iopl=0         nv up ei ng nz na pe nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00040282
OpenZFS!l2arc_write_buffers+0xdf7:
fffff804`8c1ca0c7 cc              int     3
Resetting default scope

BLACKBOXBSD: 1 (!blackboxbsd)

BLACKBOXNTFS: 1 (!blackboxntfs)

BLACKBOXPNP: 1 (!blackboxpnp)

BLACKBOXWINLOGON: 1

PROCESS_NAME:  System

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE_STR:  80000003

EXCEPTION_PARAMETER1:  0000000000000000

EXCEPTION_STR:  0x80000003

STACK_TEXT:  
ffff8304`fa88b9b0 fffff804`8c1c1da1     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : OpenZFS!l2arc_write_buffers+0xdf7 [H:\dev\openzfs\module\zfs\arc.c @ 9445] 
ffff8304`fa88bb70 fffff804`3cd268f5     : ffff938f`4d10e040 fffff804`8c1c1ae0 00000000`00000000 000f8067`bcbbbdff : OpenZFS!l2arc_feed_thread+0x2c1 [H:\dev\openzfs\module\zfs\arc.c @ 9559] 
ffff8304`fa88bc10 fffff804`3ce04c68     : ffffaa81`1a500180 ffff938f`4d10e040 fffff804`3cd268a0 00000000`00000000 : nt!PspSystemThreadStartup+0x55
ffff8304`fa88bc60 00000000`00000000     : ffff8304`fa88c000 ffff8304`fa886000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x28

FAULTING_SOURCE_LINE:  H:\dev\openzfs\module\zfs\arc.c

FAULTING_SOURCE_FILE:  H:\dev\openzfs\module\zfs\arc.c

FAULTING_SOURCE_LINE_NUMBER:  9445

FAULTING_SOURCE_CODE:  
  9441:         return (0);
  9442:     }
  9443: 
  9444:     if (!dev->l2ad_first)
> 9445:         ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict);
  9446: 
  9447:     ASSERT3U(write_asize, <=, target_sz);
  9448:     ARCSTAT_BUMP(arcstat_l2_writes_sent);
  9449:     ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
  9450: 

SYMBOL_NAME:  OpenZFS!l2arc_write_buffers+df7

MODULE_NAME: OpenZFS

IMAGE_NAME:  OpenZFS.sys

STACK_COMMAND:  .cxr 0xffff8304fa88afb0 ; kb

BUCKET_ID_FUNC_OFFSET:  df7

FAILURE_BUCKET_ID:  0x7E_OpenZFS!l2arc_write_buffers

OS_VERSION:  10.0.19041.1

BUILDLAB_STR:  vb_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {e3cf841f-f24f-b707-1660-16a0d36cb323}

Followup:     MachineOwner
---------

************* Path validation summary **************
Response                         Time (ms)     Location
OK                                             C:\Windows\System32\drivers

I'll have to get a VM setup to do the testing soon, I can't restart this machine on every test failure.

EchterAgo commented 1 year ago

I've also seen a PAGE_FAULT_IN_NON_PAGED_AREA before this, but I rebooted the machine before it finished writing the dump :\

andrewc12 commented 1 year ago

I fixed the script that saves the stack and cbuf from a dump if you want to use it.

lundman commented 1 year ago

Awesome - yeah I now detect the error, but I do need to confirm that I do that correct thing when it happens. I'll try it more tomorrow - my pool went "degraded" which is odd, so it isnt right i dont think

EchterAgo commented 1 year ago

I removed the cache devices and so far everything is running fine.

EchterAgo commented 1 year ago

I've written 3TB without error so far, this never worked with the cache devices.

EchterAgo commented 1 year ago

I think the issue in my last memory dump is actually fixed by these commits:

https://github.com/openzfs/zfs/commit/bcd5321039c3de29c14eac1068d392c15ad7fe2c https://github.com/openzfs/zfs/commit/feff9dfed3df1bbae5dd74959a6ad87d11f27ffb

See https://github.com/openzfs/zfs/issues/14936

I'm going to try with those now.

lundman commented 1 year ago

yeah i can do another sync with upstream soon as well

EchterAgo commented 1 year ago

Ok, with those two upstream commits cherry-picked, the crash has gone away, but I still hit corruption.

It seems like the original issue only affects pools with cache devices.

lundman commented 1 year ago

OK, so corruption still happens, but only if using l2arc cache device(s) ? So I can stop looking around fs_write(), and maybe sync up the arc code

openzfsonwindows / openzfs

Copying files to a dataset and then verifying shows mismatches #227