Open EchterAgo opened 1 year ago
The hole in the file seems to start and end at a 4 KiB aligned offset and is 32 KiB in size. The data after the hole is identical again. I'm letting it continue scanning to see if there are further differences.
The harddrives have
LogicalSectorSize : 512
PhysicalSectorSize : 4096
and the cache SSDs have
LogicalSectorSize : 512
PhysicalSectorSize : 512
The corruption also remains when I reread the data.
I checked another file and the corruption start and end are again 4 KiB aligned but this time it is larger, 49152 bytes (48 KiB)
Thanks for looking into this, I'll make it higher priority since we definitely don't want any corruption. If it's sections of nulls, i wonder if it is a seek/hole issue. I will dig deeper. Is there a relatively easily to make it happen, like close to a one-liner?
I haven't found an easy way to reproduce yet, but it does keep happening when I transfer a big amount of data. I'm currently trying to see if any rclone options help, like --local-no-preallocate
and --local-no-sparse
. I'll also check different dataset options, maybe disable compression. It takes a while to test due to the big data size and the low amount of data actually zeroed, in ~5TB of data only 80 KiB were actually bad. I will report back here when I have more information.
I also checked for any errors in logs, but could not find anything, SMART values look good. Both systems I copy to/from have ECC memory, the source system is a Linux system running ZFS. Previous attempts to copy the same data to either NTFS / ReFS worked without issue. I'll also try to get some test running locally that writes random data from a fixed seed and compares the result.
I also tried copying data with Windows Explorer and haven't had any corruption yet. Windows Explorer copies files sequentially in a single thread, rclone copies multiple files simultaneously and can even use multiple streams for single files.
@EchterAgo commented 2 hours ago:
Both systems I copy to/from have ECC memory, the source system is a Linux system running ZFS. Previous attempts to copy the same data to either NTFS / ReFS worked without issue.
It would be nice to check if it can be reproduced by Linux or FreeBSD implementations of OpenZFS. At least Linux one contains some hard to reproduce bugs too. Eg: https://github.com/openzfs/zfs/issues/12014
I'm using the Linux ZFS port for a long time already and I've never encountered corruption like this. The pool options are almost the same, except for case sensitive, RAIDZ2 instead of RAIDZ1 and only one cache device. I also use rclone a lot to copy to the Linux pool and I do sanity check the data, no problems so far.
Yesterday I used Windows Explorer to copy the full 5TB to the Windows pool and it seems everything copied correctly. I also tried rclone with disabled preallocation and disabled sparse files and the corruption remained. I'm not sure I specified the options correctly for the remote though, so I'll retry this now.
I've given up on #223 for now, after 5 weeks, I need a break from it.
Sounds like I should try out rclone
on Windows and see if i can figure out what goes wrong. Maybe it is something in seek/write/holes that goes wrong.
Thanks @lundman. Wow, you're cool digger! Take rest and take care. :)
And thanks for referencing the FileSpy tool, which I haven't heard before. It's a bit sad it isn't an open source thing, but it's impressive anyway! Cool. :)
I've given up on #223 for now, after 5 weeks, I need a break from it.
Sounds like I should try out
rclone
on Windows and see if i can figure out what goes wrong. Maybe it is something in seek/write/holes that goes wrong.
I have also done more testing, none of --multi-thread-streams 0
, --local-no-sparse
, --local-no-preallocate
and --transfers 1
make rclone copy correctly and I'm sure the options are applied correctly. Interestingly I've done a few more copies with Windows Explorer and so far all those copies worked. I've looked at the WriteFile calls and rclone seems to use 32KB writes while Windows explorer starts with 256KB first, then increases to 768KB, then 1 MB.
I plan to try some different filesystem options next. I'll also look into adding code to log zero data in some places.
When rclone
runs, and using FileSpy
are all the writes in sequence?
Like 0..32KB 32KB..64KB 64KB..96KB. .....
etc? Or does it seek around a bit? I wonder if the "offset counter update" is wrong, or if it sends the offset each time
With the options I used it should just write sequentially with a single thread, but I'll verify this using FileSpy.
FileSpy_20230614_rclone_write.zip
This is a capture of rclone writing the biggest file I have in the dataset. It does look sequential in 32KB chunks. There seems to be a FASTIO_WRITE call that fails in there. I'll also get a capture of Windows Explorer writing the same file.
I also wonder if it is feasible to just capture the whole copy, then I could write a small tool to locate the problematic offsets in the capture.
Ignore the FASTIO errors, it tries it first to see if we support it, and if not, does a regular IRP call. We have the FASTIO patch in a branch, but it complicates things, so I was going for stable first.
2 17:10:10.318 rclone.exe IRP_MJ_WRITE 00060A00 FFFFBB06C3C12088 D:\data2.crypt STATUS_SUCCESS Offset: 00000003-7B3C7000 ToWrite: 8000 Written: 8000
4 17:10:10.318 rclone.exe IRP_MJ_WRITE 00060A00 FFFFBB06C3C12088 D:\data2.crypt STATUS_SUCCESS Offset: 00000003-7B3CF000 ToWrite: 8000 Written: 8000
6 17:10:10.318 rclone.exe IRP_MJ_WRITE 00060A00 FFFFBB06C3C12088 D:\data2.crypt STATUS_SUCCESS Offset: 00000003-7B3D7000 ToWrite: 8000 Written: 8000
8 17:10:10.318 rclone.exe IRP_MJ_WRITE 00060A00 FFFFBB06C3C12088 D:\data2.crypt STATUS_SUCCESS Offset: 00000003-7B3DF000 ToWrite: 8000 Written: 8000
Hmm what does IrpSp->Parameters.Write.ByteOffset.HighPart
being 3 mean? Another special hardcoded value?
I am aware of these two:
if (IrpSp->Parameters.Write.ByteOffset.LowPart ==
FILE_USE_FILE_POINTER_POSITION) {
byteOffset = fileObject->CurrentByteOffset;
} else if (IrpSp->Parameters.Write.ByteOffset.LowPart ==
FILE_WRITE_TO_END_OF_FILE) { // APPEND
#define FILE_WRITE_TO_END_OF_FILE 0xffffffff
#define FILE_USE_FILE_POINTER_POSITION 0xfffffffe
Ah ok this is just a snippet of a very large file? Then maybe it is up to 0x000000037B3C7000. So as long as it started at 0, then 1, 2, and now 3 that is fine. All writes should then end with 0x7000 and 0xf000 as long as the writes are 0x8000. They certainly do in the snippet I have. You can tell FileSpy to monitor the drive letter of ZFS, then at the end, File->Save.
Yes, the file is 150GB, that is a normal file offset. I checked with smaller files and the high part of the address is zero.
At the rate the FileSpy memory usage is growing when capturing I doubt I have enough memory to capture a whole rclone transfer. Is there any way to get FileSpy to log directly to disk?
Initially I believed this to only affect larger files, but I've since seen this corruption in 1GB sized files as well.
I can just copy the biggest file over and over again and capture its events.
So the io
tests in MS test.exe
highlight we have a gap:
Write to file at 0, PASS
Check position hasn't moved, PASS
Set position, FAIL (STATUS_NOT_IMPLEMENTED)
Check position has now moved, FAIL (CurrentByteOffset was 0, expected 6)
Read from file specifying offset, PASS
Check position, FAIL (CurrentByteOffset was 0, expected 6)
Read from file specifying offset as FILE_USE_FILE_POINTER_POSITION, PASS
Write to file specifying offset as FILE_USE_FILE_POINTER_POSITION, PASS
Set position to 0, FAIL (STATUS_NOT_IMPLEMENTED)
Write to file specifying offset as FILE_WRITE_TO_END_OF_FILE, PASS
Check position, PASS
So let me fix those before we test again
56eca2a
Unlikely to affect things, you would have seen not-implemented in FileSpy.
I tested with the latest changes and still see corruption. I also tested different sources for the copy, including local NTFS drives, with the same result.
So far, anytime I copy enough data using rclone sync <src> <dst>
and then verify it using rclone check --differ <differ.log> <src> <dst>
I get differences in the differ.log
, different files each time even though I create a fresh zpool each time. When I examine the files there is a zeroed section where the start and end offsets are 4k aligned. So far I haven't found a file that has multiple such sections, but I haven't examined each of them.
I haven't found any rclone
/ zfs
/ zpool
options that make a difference yet and there are no reported checksum or R/W errors.
Ok, I have a simple tool to reproduce:
import uuid
import random
import math
import os
from pathlib import Path
ROOTPATH = Path('D:\\testdir')
DATA_BLOCK_SIZE = 32 * 1024
DATA_NUM_BLOCKS = math.floor((1 * 1024 * 1024 * 1024) / DATA_BLOCK_SIZE)
#DATA_NUM_BLOCKS = 3
def gen_file():
filename = ROOTPATH / str(uuid.uuid4())
start_state = random.getstate()
with open(filename, 'wb') as f:
for bi in range(0, DATA_NUM_BLOCKS):
data_block = random.randbytes(DATA_BLOCK_SIZE)
f.write(data_block)
random.setstate(start_state)
with open(filename, 'rb') as f:
for bi in range(0, DATA_NUM_BLOCKS):
disk_data_block = f.read(DATA_BLOCK_SIZE)
expected_data_block = random.randbytes(DATA_BLOCK_SIZE)
if disk_data_block != expected_data_block:
print(f'found mismatch in {filename} @ {bi * DATA_BLOCK_SIZE}')
def main():
random.seed()
os.makedirs(ROOTPATH, exist_ok=True)
while True:
gen_file()
if __name__ == "__main__":
main()
it still needs a while to hit the errror. For my first try it wrote 888GB to hit the error. Same issue as with rclone but now isolated to only writing to ZFS.
You can set a breakpoint on the mismatch print to stop the program when something goes wrong. I'll try to get a trace at that point tomorrow.
I also tested with ZFS default options, same behavior. I'll check without cache devices, different redundancy options and just single device tomorrow.
I also analyzed the file, it is the same zero filled 4k aligned corruption. I'll try this on a clean Windows install in a VM next.
On another try I hit corruption after 200GB.
excellent, thanks for the producer. I'll also give it a go, it could be useful to save the cbuf soon after it happens https://openzfsonosx.org/wiki/Windows_BSOD#Debug_Print_Buffer so I can try to do that today.
OK so the first thing I noticed was that when I ran out of disk space, it did not fail with an error. Turns out, the producer code's style of writing data, advances the file first (call to zfs_freesp()
) then fills in with data. Code was ignoring any errors from zfs_freesp()
, like that of ENOSPC
.
3346a5b
My pools are like 10G, so it didn't take long.
Now, there's no reason your tests would get a failure from zfs_freesp()
, that I can think of (you don't run out of space after all). But you should pull in the commit anyway.
I need to grow my test disks to let it run for longer.
I have everything set up to dump cbuf
, now I'm just waiting to produce the error again. I ran it overnight and it hit the issue ~10 times and wrote 3.6TB of data, so it should be soon.
I'm not sure if the reproducer will also work when I delete the files after writing them, I'll try that after I dumped the debug buffer.
I attached the buffer from right after I reproduced the issue. There also was more binary data at the end of the buffer which I cut off, is that needed too?
Not at all, its a circular buffer, so it doesnt always fill the whole way. We could pre-filled it with spaces or something i guess.
Ah dang, that metaslab flushing thing is annoying. Looks like your zfs_flags
isn't what we want:
https://github.com/openzfsonwindows/openzfs/blob/windows/module/zfs/spa_misc.c#L249-L258 https://github.com/openzfsonwindows/openzfs/blob/windows/man/man4/zfs.4#L1329-L1341
So you probably want 513
. or maybe just 1
. You can set it dynamically in kernel debugger, or, using registry \HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\Services\OpenZFS
. Or, if easier, change source and compile.
The cbuf
should contain -EB-
which is the marker for "end of buffer", and you read whats just above it for the last-few-lines etc.
Yes, -EB- was just after the text, I cut it from the file. I'll try setting zfs_flags
correctly and getting another log.
This is the result with zfs_flags
513
Would it make sense to automate dumping cbuf
to do it directly when my test hits the issue? Because there were maybe 5 minutes in between hitting it and me dumping cbuf
.
Edit: There shouldn't have been any other activity on the pool after the test encountered the issue, I stopped everything with a breakpoint.
ok very interesting
-EB-A985AEBB4040:dsl_dir.c:1346:dsl_dir_tempreserve_impl(): error 28
error 28 is ENOSPC, doesn't have to mean disk full, but dsl_dir says something is full. So possible there is some error happening and we aren't handling the error fully. This is with todays commit https://github.com/openzfsonwindows/openzfs/commit/3346a5b78c2db15801ce54a70a323952fdf67fa5 ?
error 132 is ENOMEM
- so it does seem its getting a bit constrained and we should handle errors better.
Side note, zfs_uiomove(): error 0
should not show, SET_ERROR is supposed to only print when it isnt zero. odd.
The system has 64GB memory and when running the test the free memory seems to fluctuate between 13.5GB and 14.5GB. The modified set grows to a gigabyte and then back to zero. I am monitoring the memory usage of the system and it really should not have ran out of memory.
I am still using 0f83d31e288d789fb4e10a7e4b12e27887820498 but I will restart the test with 3346a5b78c2db15801ce54a70a323952fdf67fa5 now.
The memory usage also rises quickly until it flattens at some point.
Edit: I took a screenshot but somehow it was all black :\ You can see the memory usage rising at about the rate I am writing to the drive and then stabilizing at around 44GB used. There are other processes on the system, so the usage isn't only ZFS.
Yeah, it's so much about your machine's memory, since I set a hard ceiling inside ZFS as to not eat everything, as we are not part of the OS memory handler, but separate. So you need to view the kstat to see how cramped it is inside ZFS. But we don't have to go that deep today.
It looks like 3346a5b78c2db15801ce54a70a323952fdf67fa5 did actually change things, I'm getting a kernel crash now:
28: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
This is a very common BugCheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: ffffffff80000003, The exception code that was not handled
Arg2: fffff8048c1ca0c7, The address that the exception occurred at
Arg3: ffff8304fa88b778, Exception Record Address
Arg4: ffff8304fa88afb0, Context Record Address
Debugging Details:
------------------
KEY_VALUES_STRING: 1
Key : Analysis.CPU.mSec
Value: 2452
Key : Analysis.DebugAnalysisManager
Value: Create
Key : Analysis.Elapsed.mSec
Value: 3838
Key : Analysis.Init.CPU.mSec
Value: 4515
Key : Analysis.Init.Elapsed.mSec
Value: 50754
Key : Analysis.Memory.CommitPeak.Mb
Value: 118
Key : WER.OS.Branch
Value: vb_release
Key : WER.OS.Timestamp
Value: 2019-12-06T14:06:00Z
Key : WER.OS.Version
Value: 10.0.19041.1
FILE_IN_CAB: MEMORY.DMP
BUGCHECK_CODE: 7e
BUGCHECK_P1: ffffffff80000003
BUGCHECK_P2: fffff8048c1ca0c7
BUGCHECK_P3: ffff8304fa88b778
BUGCHECK_P4: ffff8304fa88afb0
EXCEPTION_RECORD: ffff8304fa88b778 -- (.exr 0xffff8304fa88b778)
ExceptionAddress: fffff8048c1ca0c7 (OpenZFS!l2arc_write_buffers+0x0000000000000df7)
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 1
Parameter[0]: 0000000000000000
CONTEXT: ffff8304fa88afb0 -- (.cxr 0xffff8304fa88afb0)
rax=0000000000000001 rbx=ffff938f4d10e040 rcx=1288d5f429410000
rdx=000000000000004b rsi=fffff8048c1c1ae0 rdi=0000000000000000
rip=fffff8048c1ca0c7 rsp=ffff8304fa88b9b0 rbp=0000000000000080
r8=000000000000004d r9=0000000000000000 r10=0000000000000000
r11=0000000000000010 r12=0000000000000286 r13=0000000000000000
r14=ffff938f39167180 r15=ffffaa811a50f000
iopl=0 nv up ei ng nz na pe nc
cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00040282
OpenZFS!l2arc_write_buffers+0xdf7:
fffff804`8c1ca0c7 cc int 3
Resetting default scope
BLACKBOXBSD: 1 (!blackboxbsd)
BLACKBOXNTFS: 1 (!blackboxntfs)
BLACKBOXPNP: 1 (!blackboxpnp)
BLACKBOXWINLOGON: 1
PROCESS_NAME: System
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.
EXCEPTION_CODE_STR: 80000003
EXCEPTION_PARAMETER1: 0000000000000000
EXCEPTION_STR: 0x80000003
STACK_TEXT:
ffff8304`fa88b9b0 fffff804`8c1c1da1 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : OpenZFS!l2arc_write_buffers+0xdf7 [H:\dev\openzfs\module\zfs\arc.c @ 9445]
ffff8304`fa88bb70 fffff804`3cd268f5 : ffff938f`4d10e040 fffff804`8c1c1ae0 00000000`00000000 000f8067`bcbbbdff : OpenZFS!l2arc_feed_thread+0x2c1 [H:\dev\openzfs\module\zfs\arc.c @ 9559]
ffff8304`fa88bc10 fffff804`3ce04c68 : ffffaa81`1a500180 ffff938f`4d10e040 fffff804`3cd268a0 00000000`00000000 : nt!PspSystemThreadStartup+0x55
ffff8304`fa88bc60 00000000`00000000 : ffff8304`fa88c000 ffff8304`fa886000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x28
FAULTING_SOURCE_LINE: H:\dev\openzfs\module\zfs\arc.c
FAULTING_SOURCE_FILE: H:\dev\openzfs\module\zfs\arc.c
FAULTING_SOURCE_LINE_NUMBER: 9445
FAULTING_SOURCE_CODE:
9441: return (0);
9442: }
9443:
9444: if (!dev->l2ad_first)
> 9445: ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict);
9446:
9447: ASSERT3U(write_asize, <=, target_sz);
9448: ARCSTAT_BUMP(arcstat_l2_writes_sent);
9449: ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);
9450:
SYMBOL_NAME: OpenZFS!l2arc_write_buffers+df7
MODULE_NAME: OpenZFS
IMAGE_NAME: OpenZFS.sys
STACK_COMMAND: .cxr 0xffff8304fa88afb0 ; kb
BUCKET_ID_FUNC_OFFSET: df7
FAILURE_BUCKET_ID: 0x7E_OpenZFS!l2arc_write_buffers
OS_VERSION: 10.0.19041.1
BUILDLAB_STR: vb_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
FAILURE_ID_HASH: {e3cf841f-f24f-b707-1660-16a0d36cb323}
Followup: MachineOwner
---------
************* Path validation summary **************
Response Time (ms) Location
OK C:\Windows\System32\drivers
I'll have to get a VM setup to do the testing soon, I can't restart this machine on every test failure.
I've also seen a PAGE_FAULT_IN_NON_PAGED_AREA
before this, but I rebooted the machine before it finished writing the dump :\
I fixed the script that saves the stack and cbuf from a dump if you want to use it.
Awesome - yeah I now detect the error, but I do need to confirm that I do that correct thing when it happens. I'll try it more tomorrow - my pool went "degraded" which is odd, so it isnt right i dont think
I removed the cache devices and so far everything is running fine.
I've written 3TB without error so far, this never worked with the cache devices.
I think the issue in my last memory dump is actually fixed by these commits:
https://github.com/openzfs/zfs/commit/bcd5321039c3de29c14eac1068d392c15ad7fe2c https://github.com/openzfs/zfs/commit/feff9dfed3df1bbae5dd74959a6ad87d11f27ffb
See https://github.com/openzfs/zfs/issues/14936
I'm going to try with those now.
yeah i can do another sync with upstream soon as well
Ok, with those two upstream commits cherry-picked, the crash has gone away, but I still hit corruption.
It seems like the original issue only affects pools with cache devices.
OK, so corruption still happens, but only if using l2arc cache device(s) ? So I can stop looking around fs_write(), and maybe sync up the arc code
I'm using
rclone sync
to copy ~5TB of data to a ZFS dataset created with these options:After the copy is done, I use
rclone sync -c
to fully verify the contents of the copied files and it always comes up with differences. So far I've only seen it with bigger files, mostly with a 150GB disk image file, but also with some ~30GB image files.I used HxD to compare the files and it turns out there are sections filled with zeroes instead of the original data.
I'll try to diagnose this further.