Closed pwolny closed 8 years ago
Yes, that's a good idea. We'll definitely disable this in the final 0.7.0 when tagging if the various issues aren't all wrapped up by then. Disabling it in master also makes sense now we've started putting out release candidates for wider testing. Opened PR #5099 to change the default.
Can someone watching this bug please review patch in #5099 which changes the default value.
I have experienced a silently corrupted snapshot by send/receive – a checksum of a single file on source and target do not match on the filesystem and on the snapshots. Repeated scrubs of the tested pools do not show any errors. Trying to replicate the source filesystem results in a modification of single file on the target pool, in the received filesystem and all snapshots. Source pool is a 6x1TB RAIDZ2 array on Debian 8.2.0 (Jessie, kernel 3.16.0-4-amd64, installed from DVDs, no additional updates) with version 0.6.5.3 of ZFS/SPL built from source (standard configure and no patches).
My source pool (“nas1FS’) was created on a non-ECC RAM machine and after filling up with data through a samba share (standard samba server, on the source fs sharesmb=off) was moved to a different computer with ECC RAM (I have used the same operating system on both machines). I can understand non-ECC RAM in the first computer causing permanent corruption of data that is visible on both source and target pool, but in this case the data is only silently changed during the transfer on a new computer with ECC RAM and source pool data seems to be fine. This corruption of data during send/receive is repeatable.
To better explain what I have done: First I have created a 6x1TB raidz2 pool on my old computer ("nas1FS"). After filling this pool with data I have moved the array from old computer to a new one and I have tried to backup the data on “nas1FS” pool to a different pool (“backup_raidz“).
“nas1FS” pool contained following snapshots that are of interest in this issue:
I have created a “backup_raidz” pool for backup (with compression turned on):
# zpool create -o ashift=12 -O compression=gzip-9 backup_raidz raidz1 /dev/disk/by-id/ata-SAMSUNG_HD103UJ_SerialNo1 /dev/disk/by-id/ata-SAMSUNG_HD103UJ_SerialNo2 /dev/disk/by-id/ata-HGST_HTS721010A9E630_SerialNo3 -f
Afterwards I have tried to replicate “nas1FS” pool.
# zfs send -R -vvvv "nas1FS@20160618" |zfs receive -vvvv -F "backup_raidz/nas1FS_bak"
this command finished successfully without any error.I have executed following commands to get a list of file checksums on both source and target:
and compared the resulting files
I have found a checksum mismatch on a single file:
Correct checksum is “3178c03d3205ac148372a71d75a835ec”, it was verified on the source used to populate the “nas1FS” filesystem.
This checksum mismatch was propagated through all snapshots in which the file was present on target pool:
Source pool has shown correct checksum on all snapshot that the offending file was accessible
Trying to access this file on a snapshot when it did not exist (“backup@20151121_1”) results in a “No such file or directory” on target pool (“backup_raidz” or “backup_raidz_test”).
When I have tried to access offending file on “nas1FS” with a command:
# md5sum /nas1FS/backup/.zfs/snapshot/20151121_1/samba_share/a/home/bak/aa/wx/wxWidgets-2.8.12/additions/lib/vc_lib/wxmsw28ud_propgrid.pdb
it resulted in a very hard system lockup, I could not get a system reaction on “Ctrl-Alt-SysRq-h” and similar key combinations, any IO to disks stopped completely and immediately, system stopped responding to ping and only a hard reset achieved any reaction out of the system. After hard reset everything was working, above mentioned file checksum results were unchanged.I have also tried a send/recieve to a different target pool (a single 1TB HGST disk):
# zfs send -R -vvvv "nas1FS/backup@20160618" |zfs receive -vvvv -F "backup_raidz_test/nas1FS_bak"
resulted with same md5sum mismatches.When sending only the latest snapshot with:
# zfs send -vvvv "nas1FS/backup@20160618" |zfs receive -vvvv -F "backup_raidz_test/backup"
I get a correct md5sum on the target filesystem.When trying to do an incremental send receive from the first available snapshot on source pool:
# zfs send -vvvv "nas1FS/backup@20151121_1" |zfs receive -vvvv -F "backup_raidz_test/backup"
Offending file not present on target and source pool and trying to access it on target pool does not cause any issues.# zfs send -vvvv -I "nas1FS/backup@20151121_1" "nas1FS/backup@20151124" |zfs receive -vvvv -F "backup_raidz_test/backup"
I get again a checksum mismatch.When trying to do an incremental send receive from the second available snapshot on source pool I get correct checksums on both snapshots on target pool ....
It is interesting to note that only a single block of 4096 bytes of data is corrupted at the end of the file (that has a size of 321 x 4096 bytes ) and only when transferring data with the first source snapshot ("nas1FS/backup@20151121_1") .
Binary comparison of the offending file:
I have also run zdb on the source pool, commands I did check did not find any errors:
To summarize used system configurations:
System used for “nas1FS” data fillup (old computer): Motherboard MSI X58 Pro (MSI MS-7522/MSI X58 Gold) with Intel Quad Core i7 965 3.2GHz and 14GB non-ECC RAM (MB, CPU and RAM and PS are about 6 years old). “/boot” : INTEL SSDMCEAW080A4 80GB SSD “nas1FS” pool: 6x1TB HGST Travelstar 7K1000 in a RAIDZ2 array (HDDs are ~6months old).
New system to which “nas1FS” was moved (all disks): Motherboard Supermicro A1SA7-2750F (8 core Intel Atom) with 32GB ECC RAM (MB and RAM and PS are new). “/boot” : INTEL SSDMCEAW080A4 80GB SSD “nas1FS” pool: 6x1TB HGST Travelstar 7K1000 in a RAIDZ2 array (moved from the old computer). “backup_raidz” pool: 2x1TB Samsung HD103UJ + 1x1TB HGST Travelstar 7K1000 (pool used for backup) “backup_raidz_test” pool: 1TB Samsung HD103UJ (pool with no parity, for additional tests)
Both systems were tested with memtest, cpuburn etc. without errors. I am using Debian Jessie booted from zfs pool (with a separate boot partition), same operating system on both machines used with “nas1FS” pool.
Kernel Command line:
BOOT_IMAGE=/vmlinuz-3.16.0-4-amd64 root=ZFS=/rootFS/1_jessie ro rpool=nas1FS bootfs=nas1FS/rootFS/1_jessie root=ZFS=nas1FS/rootFS/1_jessie rootfstype=zfs boot=zfs quiet
SPL and ZFS was built from source.
Some excerpts from dmesg (blocked messages are not connected to the hard lockup of the system):
I hope this information will be helpful, but feel free to let me know what other tests I can perform to diagnose this issue. I will be happy to provide any other info.