Closed pwolny closed 8 years ago
Could this be the same bug as https://github.com/zfsonlinux/zfs/issues/4050? IIRC 6.5.3 was affected and you seem to have the hole_birth
feature enabled, plus the problem seems to manifest in the same way: chunks of data at the end of the file that should be zeros filled instead with other data.
Anyway all these "ZFS send/recv corrupts data" bug reports are really scary, considering the fact it's a ZFS feature that is primarily used for replica/backups.
Thanks for the suggestion @loli10K, it would not occur to me to check that issue. I have checked output of commands suggested by @bprotopopov on Dec 3, 2015 and it seems that the system is affected by bug #4050.
# zfs create -o recordsize=4k nas1FS/test_fs
# zfs set compression=on nas1FS/test_fs
# truncate -s 1G /nas1FS/test_fs/large_file
# ls /nas1FS/test_fs/large_file -als
1 -rw-r--r-- 1 root root 1073741824 Jun 28 16:25 /nas1FS/test_fs/large_file
# dd if=/dev/urandom of=/nas1FS/test_fs/large_file bs=4k count=$((3*128)) seek=$((1*128))
384+0 records in
384+0 records out
1572864 bytes (1.6 MB) copied, 0.588248 s, 2.7 MB/s
# zfs snapshot nas1FS/test_fs@20160628_1
# truncate -s $((2*128*4*1024)) /nas1FS/test_fs/large_file
# dd if=/dev/urandom of=/nas1FS/test_fs/large_file bs=4k count=128 seek=$((3*128)) conv=notrunc
128+0 records in
128+0 records out
524288 bytes (524 kB) copied, 0.224695 s, 2.3 MB/s
# dd if=/dev/urandom of=/nas1FS/test_fs/large_file bs=4k count=10 seek=$((2*128)) conv=notrunc
10+0 records in
10+0 records out
40960 bytes (41 kB) copied, 0.0144285 s, 2.8 MB/s
# zfs snapshot nas1FS/test_fs@20160628_2
# zfs send nas1FS/test_fs@20160628_1 |zfs recv backup_raidz_test/test_fs_copy
# zfs send -i nas1FS/test_fs@20160628_1 nas1FS/test_fs@20160628_2 |zfs recv backup_raidz_test/test_fs_copy
# ls -als /nas1FS/test_fs/large_file /backup_raidz_test/test_fs_copy/large_file
1593 -rw-r--r-- 1 root root 2097152 Jun 28 16:33 /backup_raidz_test/test_fs_copy/large_file
2223 -rw-r--r-- 1 root root 2097152 Jun 28 16:33 /nas1FS/test_fs/large_file
# md5sum /nas1FS/test_fs/large_file /backup_raidz_test/test_fs_copy/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/large_file
7a8008b4cc398a0a4ad8a6c686aa0f6f /backup_raidz_test/test_fs_copy/large_file
# md5sum /nas1FS/test_fs/.zfs/snapshot/20160628_1/large_file /nas1FS/test_fs/.zfs/snapshot/20160628_2/large_file /nas1FS/test_fs/large_file /backup_raidz_test/test_fs_copy/.zfs/snapshot/20160628_1/large_file /backup_raidz_test/test_fs_copy/.zfs/snapshot/20160628_2/large_file /backup_raidz_test/test_fs_copy/large_file
0f41e9d5956532d434572b6f06d7b082 /nas1FS/test_fs/.zfs/snapshot/20160628_1/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160628_2/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/large_file
0f41e9d5956532d434572b6f06d7b082 /backup_raidz_test/test_fs_copy/.zfs/snapshot/20160628_1/large_file
7a8008b4cc398a0a4ad8a6c686aa0f6f /backup_raidz_test/test_fs_copy/.zfs/snapshot/20160628_2/large_file
7a8008b4cc398a0a4ad8a6c686aa0f6f /backup_raidz_test/test_fs_copy/large_file
I am not sure, but to trigger this same issue on the "nas1FS" don`t I need to modify the file? If I remember correctly the file was never modified, and did not exist in "nas1FS/backup@20151121_1" snapshot and was just added to fs just before snapshot "nas1FS/backup@20151124". Do you think that upgrading to v0.6.5.7 of spl/zfs would resolve this, and have no negative impact on the source pool?
I have no objections to a complete upgrade to 0.6.5.7.
Unfortunately I just do not have enough knowledge about zfs internals.
I was just curious if upgrading can make making a diagnose harder or harm the source pool in some way?
I was able to crash the system when doing an md5sum of the problematic file on a snapshot when it did not exist on source pool ("nas1FS/backup@20151121_1"), so I thought that there can also be some kind of problem with the source pool. Would any more testing be useful before upgrade (I will not be able to make a backup copy of the disks with the issue before the upgrade)?
You should probably try to replicate the lock/crash on 0.6.5.7 and/or master and then report that as a distinct problem, if it persists.
Presuming this is bug #4050, the bug only manifests in the stream when you transmit via incremental send, so the source pool's data is not affected, and upgrading your ZoL version should be safe.
@loli10K thanks for pointing out #4050. And I completely agree, these kinds of issues can't be allowed to happen and we should work on increasing our test coverage in order to prevent them.
@pwolny I'd also recommend upgrading to 0.6.5.7 and verifying that you can no longer reproduce the issue.
Thanks, I will upgrade to 0.6.5.7 and test it tonight or tomorrow.
I have upgraded to v0.6.5.7 and problem starts to be more interesting. The crash during the access to not existing file in snapshot does not happen anymore. The send/receive of nas1FS/backup (containing problematic file) gives same md5sums on receiving and sending side (compared all files on "nas1FS/backup" and "backup_raidz_test/backup"). When I have repeated commands suggested by bprotopopov on Dec 3, 2015 (with upgraded spl/zfs) I got same checksums on source ("nas1FS/test_fs_A") and target FS ("backup_raidz_test/test_fs_copy_A"):
# cat zfs_hole_transmit_test.sh
#!/bin/bash
date_today="20160701"
echo $date_today
zfs create -o recordsize=4k nas1FS/test_fs_A
zfs set compression=on nas1FS/test_fs_A
truncate -s 1G /nas1FS/test_fs_A/large_file
ls /nas1FS/test_fs_A/large_file -als
dd if=/dev/urandom of=/nas1FS/test_fs_A/large_file bs=4k count=$((3*128)) seek=$((1*128))
zfs snapshot nas1FS/test_fs_A@$date_today"_1"
truncate -s $((2*128*4*1024)) /nas1FS/test_fs_A/large_file
dd if=/dev/urandom of=/nas1FS/test_fs_A/large_file bs=4k count=128 seek=$((3*128)) conv=notrunc
dd if=/dev/urandom of=/nas1FS/test_fs_A/large_file bs=4k count=10 seek=$((2*128)) conv=notrunc
zfs snapshot nas1FS/test_fs_A@$date_today"_2"
zfs send nas1FS/test_fs_A@$date_today"_1" |zfs recv backup_raidz_test/test_fs_copy_A
zfs send -i nas1FS/test_fs_A@$date_today"_1" nas1FS/test_fs_A@$date_today"_2" |zfs recv backup_raidz_test/test_fs_copy_A
ls -als /nas1FS/test_fs_A/large_file /backup_raidz_test/test_fs_copy_A/large_file
md5sum /nas1FS/test_fs_A/large_file /backup_raidz_test/test_fs_copy_A/large_file
md5sum /nas1FS/test_fs_A/large_file /nas1FS/test_fs_A/.zfs/snapshot/$date_today"_1"/large_file /nas1FS/test_fs_A/.zfs/snapshot/$date_today"_2"/large_file /backup_raidz_test/test_fs_copy_A/large_file /backup_raidz_test/test_fs_copy_A/.zfs/snapshot/$date_today"_1"/large_file /backup_raidz_test/test_fs_copy_A/.zfs/snapshot/$date_today"_2"/large_file
# bash zfs_hole_transmit_test.sh
20160701
1 -rw-r--r-- 1 root root 1073741824 Jul 1 16:34 /nas1FS/test_fs_A/large_file
384+0 records in
384+0 records out
1572864 bytes (1.6 MB) copied, 0.589256 s, 2.7 MB/s
128+0 records in
128+0 records out
524288 bytes (524 kB) copied, 0.206688 s, 2.5 MB/s
10+0 records in
10+0 records out
40960 bytes (41 kB) copied, 0.0151903 s, 2.7 MB/s
1113 -rw-r--r-- 1 root root 2097152 Jul 1 16:34 /backup_raidz_test/test_fs_copy_A/large_file
2223 -rw-r--r-- 1 root root 2097152 Jul 1 16:34 /nas1FS/test_fs_A/large_file
046b704054d6e84353ec61b8665dfb9a /nas1FS/test_fs_A/large_file
046b704054d6e84353ec61b8665dfb9a /backup_raidz_test/test_fs_copy_A/large_file
046b704054d6e84353ec61b8665dfb9a /nas1FS/test_fs_A/large_file
c45ee26328f1b78079a155d4b04d76d3 /nas1FS/test_fs_A/.zfs/snapshot/20160701_1/large_file
046b704054d6e84353ec61b8665dfb9a /nas1FS/test_fs_A/.zfs/snapshot/20160701_2/large_file
046b704054d6e84353ec61b8665dfb9a /backup_raidz_test/test_fs_copy_A/large_file
c45ee26328f1b78079a155d4b04d76d3 /backup_raidz_test/test_fs_copy_A/.zfs/snapshot/20160701_1/large_file
046b704054d6e84353ec61b8665dfb9a /backup_raidz_test/test_fs_copy_A/.zfs/snapshot/20160701_2/large_file
# for i in $(zfs list -o mountpoint |grep test_fs); do echo $i; for j in $(ls $i/.zfs/snapshot/ -1 );do md5s/.zfs/snapshot/$j/large_file" ; done; done;
/backup_raidz_test/test_fs_copy_A
c45ee26328f1b78079a155d4b04d76d3 /backup_raidz_test/test_fs_copy_A/.zfs/snapshot/20160701_1/large_file
046b704054d6e84353ec61b8665dfb9a /backup_raidz_test/test_fs_copy_A/.zfs/snapshot/20160701_2/large_file
/nas1FS/test_fs
0f41e9d5956532d434572b6f06d7b082 /nas1FS/test_fs/.zfs/snapshot/20160628_1/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160628_2/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160629/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160629_1/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160629_2/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160629_3/large_file
/nas1FS/test_fs_A
c45ee26328f1b78079a155d4b04d76d3 /nas1FS/test_fs_A/.zfs/snapshot/20160701_1/large_file
046b704054d6e84353ec61b8665dfb9a /nas1FS/test_fs_A/.zfs/snapshot/20160701_2/large_file
When I have tried to transfer (on v0.6.5.7) a test fs ("nas1FS/test_fs") created with same commands on v0.6.5.3 I got a checksum mismatch:
# zfs send -R "nas1FS/test_fs@20160629_3" |zfs receive -F "backup_raidz_test/test_fs"
# for i in $(zfs list -o mountpoint |grep test_fs); do echo $i; for j in $(ls $i/.zfs/snapshot/ -1 );do md5s/.zfs/snapshot/$j/large_file" ; done; done;
/backup_raidz_test/test_fs
0f41e9d5956532d434572b6f06d7b082 /backup_raidz_test/test_fs/.zfs/snapshot/20160628_1/large_file
7a8008b4cc398a0a4ad8a6c686aa0f6f /backup_raidz_test/test_fs/.zfs/snapshot/20160628_2/large_file
7a8008b4cc398a0a4ad8a6c686aa0f6f /backup_raidz_test/test_fs/.zfs/snapshot/20160629/large_file
7a8008b4cc398a0a4ad8a6c686aa0f6f /backup_raidz_test/test_fs/.zfs/snapshot/20160629_1/large_file
7a8008b4cc398a0a4ad8a6c686aa0f6f /backup_raidz_test/test_fs/.zfs/snapshot/20160629_2/large_file
7a8008b4cc398a0a4ad8a6c686aa0f6f /backup_raidz_test/test_fs/.zfs/snapshot/20160629_3/large_file
/backup_raidz_test/test_fs_copy_A
c45ee26328f1b78079a155d4b04d76d3 /backup_raidz_test/test_fs_copy_A/.zfs/snapshot/20160701_1/large_file
046b704054d6e84353ec61b8665dfb9a /backup_raidz_test/test_fs_copy_A/.zfs/snapshot/20160701_2/large_file
/nas1FS/test_fs
0f41e9d5956532d434572b6f06d7b082 /nas1FS/test_fs/.zfs/snapshot/20160628_1/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160628_2/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160629/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160629_1/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160629_2/large_file
2b1b5792f6a45d45a5399ff977ca6fda /nas1FS/test_fs/.zfs/snapshot/20160629_3/large_file
/nas1FS/test_fs_A
c45ee26328f1b78079a155d4b04d76d3 /nas1FS/test_fs_A/.zfs/snapshot/20160701_1/large_file
046b704054d6e84353ec61b8665dfb9a /nas1FS/test_fs_A/.zfs/snapshot/20160701_2/large_file
How should I proceed? Will providing images of a pool created in v0.6.5.3 exhibiting this behavior or any other tests be helpful?
It seems that I have not tested enough. The issue is still present after the upgrade to v0.6.5.7. The crash during the access to not existing file in snapshot does not happen anymore but I can still trigger the send/receive snapshot corruption. Previously I have entered commands by hand or repeated from console history so I have introduced some significant delays between commands. Later I have automated the testing too much and thanks to that the bug was not triggered (even when I tried to trigger it on v0.6.5.3).
I have tested again with following script:
#!/bin/bash
for i in $(seq 0 1 10); # sleep time loop
do
for j in $(seq 1 1 10); #pool size loop
do
pool_name="zfs_pool_v0.6.5.7_A_"$i"s_$j"
src_fs="$pool_name/test_fs_A"
target_fs="$pool_name/test_fs_B"
date_today="20160702"
echo $pool_name ; echo $src_fs ; echo $target_fs; echo $date_today
dd if=/dev/zero of=/nas1FS/tmp/$pool_name bs=4k count=$(($j*128*128)) # pool size has an impact on bug triggeringl
zpool create -o ashift=12 $pool_name /nas1FS/tmp/$pool_name
zfs create -o recordsize=4k $src_fs
zfs set compression=on $src_fs
truncate -s 1G /$src_fs/large_file
ls -als /$src_fs/large_file
dd if=/dev/urandom of=/$src_fs/large_file bs=4k count=$((3*128)) seek=$((1*128))
zfs snapshot $src_fs@$date_today"_1"
truncate -s $((2*128*4*1024)) /$src_fs/large_file
dd if=/dev/urandom of=/$src_fs/large_file bs=4k count=128 seek=$((3*128)) conv=notrunc
sleep $i; #10; # <=this sleep is critical for triggering the bug when pool is large enough
dd if=/dev/urandom of=/$src_fs/large_file bs=4k count=10 seek=$((2*128)) conv=notrunc
zfs snapshot $src_fs@$date_today"_2"
zfs send -R $src_fs@$date_today"_2" |zfs recv $target_fs
ls -als /$src_fs/large_file /$target_fs/large_file
md5sum /$src_fs/.zfs/snapshot/$date_today"_1"/large_file /$target_fs/.zfs/snapshot/$date_today"_1"/large_file
md5sum /$src_fs/.zfs/snapshot/$date_today"_2"/large_file /$target_fs/.zfs/snapshot/$date_today"_2"/large_file
md5sum /$src_fs/large_file /$target_fs/large_file
md5sum /$src_fs/.zfs/snapshot/$date_today"_1"/large_file /$target_fs/.zfs/snapshot/$date_today"_1"/large_file >> /nas1FS/tmp/final_test_md5sums.txt>> /nas1FS/tmp/final_test_md5sums.txt
md5sum /$src_fs/.zfs/snapshot/$date_today"_2"/large_file /$target_fs/.zfs/snapshot/$date_today"_2"/large_file >> /nas1FS/tmp/final_test_md5sums.txt>> /nas1FS/tmp/final_test_md5sums.txt
md5sum /$src_fs/large_file /$target_fs/large_file >> /nas1FS/tmp/final_test_md5sums.txt>> /nas1FS/tmp/final_test_md5sums.txt
zpool export $pool_name
done;
done;
This script repeated the same pool create / file create / snapshot / transfer cycle for different pool sizes and with different delays between file modifications. It seems that the send/receive issue was only triggered when the delay was longer then 4seconds or when the size of the image holding the pool was small (2^26 bytes).
Matrix of tested values of delays / pool image size (corrupted snapshot transfer bug was triggered at values intersecting in areas marked in red):
I am only speculating, but is it possible that the issue is connected somehow to syncing data to the disk (syncing of data from memory to disk is not consistent)?
@pwolny
what are the S.M.A.R.T. stats for those drives ? anything unusual ?
can you post those somewhere ?
smartctl -x /dev/foo
Is there a possibility you can exchange
“backup_raidz_test” pool: 1TB Samsung HD103UJ (pool with no parity, for additional tests)
and/or cables
for a different drive ?
All drives are connected to the same harddrive controller ?
http://www.supermicro.com/products/motherboard/Atom/X10/A1SA7-2750F.cfm
LSI 2116 controller for 16x SATA3 / SAS2 ports plus 1x SATA3 SMCI SATA DOM
are any issues known with those and ZFS ?
firmware or driver updates ?
that's at least what comes to mind right now ...
@kernelOfTruth
all disks seem to be ok. nothing unusual about them from smartctl.
nas1FS drives are very similar in smartctl :
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: HGST Travelstar 7K1000
Device Model: HGST HTS721010A9E630
Firmware Version: JB0OA3J0
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Jul 3 08:41:18 2016 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM level is: 254 (maximum performance)
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 45) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 171) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate PO-R-- 100 100 062 - 0
2 Throughput_Performance P-S--- 100 100 040 - 0
3 Spin_Up_Time POS--- 114 114 033 - 2
4 Start_Stop_Count -O--C- 100 100 000 - 16
5 Reallocated_Sector_Ct PO--CK 100 100 005 - 0
7 Seek_Error_Rate PO-R-- 100 100 067 - 0
8 Seek_Time_Performance P-S--- 100 100 040 - 0
9 Power_On_Hours -O--C- 100 100 000 - 208
10 Spin_Retry_Count PO--C- 100 100 060 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 16
191 G-Sense_Error_Rate -O-R-- 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 13
193 Load_Cycle_Count -O--C- 100 100 000 - 57
194 Temperature_Celsius -O---- 222 222 000 - 27 (Min/Max 20/35)
196 Reallocated_Event_Count -O--CK 100 100 000 - 0
197 Current_Pending_Sector -O---K 100 100 000 - 0
198 Offline_Uncorrectable ---R-- 100 100 000 - 0
199 UDMA_CRC_Error_Count -O-R-- 200 200 000 - 0
223 Load_Retry_Count -O-R-- 100 100 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 1 Comprehensive SMART error log
0x03 GPL R/O 1 Ext. Comprehensive SMART error log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x10 GPL R/O 1 SATA NCQ Queued Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 256 (0x0100)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 27 Celsius
Power Cycle Min/Max Temperature: 26/27 Celsius
Lifetime Min/Max Temperature: 20/35 Celsius
Lifetime Average Temperature: 28 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/60 Celsius
Min/Max Temperature Limit: -40/65 Celsius
Temperature History Size (Index): 128 (106)
Index Estimated Time Temperature Celsius
107 2016-07-03 06:34 28 *********
... ..(114 skipped). .. *********
94 2016-07-03 08:29 28 *********
95 2016-07-03 08:30 ? -
96 2016-07-03 08:31 26 *******
97 2016-07-03 08:32 26 *******
98 2016-07-03 08:33 27 ********
... ..( 7 skipped). .. ********
106 2016-07-03 08:41 27 ********
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
Device Statistics (GP/SMART Log 0x04) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0009 2 3 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 3 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
All disks are connected to the same controller (LSI) on A1SA7-2750F, the motherboard is as received (no FW updates).
I have repeated the tests with the pool backed by a ramdisk file, so no bad cables or controllers and I can still trigger the bug with script from https://github.com/zfsonlinux/zfs/issues/4809#issuecomment-230123528
I have retested also on a laptop:
Linux 4.5.0-1-amd64 #1 SMP Debian 4.5.1-1 (2016-04-14) x86_64 GNU/Linux
with spl/zfs v0.6.5.7 with the pool backed by a ramdisk file with above mentioned script and I can also trigger it.
I have retested on another (much older) computer :
Linux 2.6.38-2-amd64 #1 SMP Sun May 8 13:51:57 UTC 2011 x86_64 GNU/Linux
with spl/zfs v0.6.0 with the pool backed by a ramdisk file with above mentioned script and I can not trigger it.
@pwolny I'm half-asleep so I might not be reading this right, but this looks like the same bug, but with a somewhat different behavior wrt. size and delay: https://gist.github.com/lnicola/1e069f2abaee1dbaaeafc05437b0777a
@lnicola It seems that your script output exhibits the same checksum mismatch footprint as mine. What kernel / zfs version were you using?
Sorry for not mentioning those. I'm on 4.6.3-1-ARCH with 0.6.5.7_4.6.3_1-1, i.e. the latest versions.
That's pretty concerning, I'm mostly backing up my files only with zfs send nowadays, too & also running 4.6.3 - ZFS is on top of cryptsetup for me,
@pwolny are you using any encryption or partition "alteration" mechanism (cryptsetup, lvm, etc.) ?
@pwolny @lnicola are your pools running with the latest ZFS feature set ?
what does the zpool upgrade command say ?
(I can't currently provide sample output of my pool since I'm in Windows, haven't upgraded it in a while so as far as a I know 2 are still missing on my /home pool)
edit:
I'm currently occupied otherwise but will see if I can reproduce it, too, here later in the day
referencing https://github.com/zfsonlinux/zfs/pull/4754 (6513 partially filled holes lose birth time) which is in master: https://github.com/zfsonlinux/zfs/commit/bc77ba73fec82d37c0b57949ec29edd9aa207677
@kernelOfTruth yes, I do have hole_birth
active.
@pwolny you're relying to show that issue with the samsung drive and gzip-9 compression,
I have created a “backup_raidz” pool for backup (with compression turned on):
# zpool create -o ashift=12 -O compression=gzip-9 backup_raidz raidz1 /dev/disk/by-id/ata-SAMSUNG_HD103UJ_SerialNo1 /dev/disk/by-id/ata-SAMSUNG_HD103UJ_SerialNo2 /dev/disk/by-id/ata-HGST_HTS721010A9E630_SerialNo3 -f
does this also happen when you explicitly set lz4 compression ?
or one of the other types?
referencing: https://github.com/zfsonlinux/zfs/issues/4530 Data Corruption During ZFS send/receieve
@pwolny please take a look at comment: https://github.com/zfsonlinux/zfs/issues/4530#issuecomment-211996617 especially the part related to destroying the snapshot
@kernelOfTruth
About the #4530 (comment):
@pwolny What if you replace the use of truncate
with dd
?
I have retested multiple versions of spl/zfs (built from source) on a virtual machine running Knoppix 7.4.2 with kernel
Linux Microknoppix 3.16.3-64 #10 SMP PREEMPT Fri Sep 26 02:00:22 CEST 2014 x86_64 GNU/Linux
and I can very reliably reproduce this issue (with script mentioned in https://github.com/zfsonlinux/zfs/issues/4809#issuecomment-230123528 ).
During testing I have seen three possible footprints:
To summarize tested versions (different software and hardware): v0.6.0-rc8 - is ok v0.6.3.1-1.3 - is ok v0.6.4 - exhibits the bug partially v0.6.4.1 - exhibits the bug partially v0.6.4.2 - exhibits the bug partially v0.6.5 - exhibits the bug fully v0.6.5.2 - exhibits the bug fully v0.6.5.3 - exhibits the bug fully v0.6.5.7 - exhibits the bug fully
So it seems that the issue (or possibly two) was introduced just before v0.6.4 and was possibly aggravated in v0.6.5.
@lnicola If I replace:
truncate -s 1G /$src_fs/large_file
with
dd if=/dev/zero of=/$src_fs/large_file bs=1 count=1 seek=1G
I still can trigger the bug fully.
What about the other truncate
usage? My thinking is that it could prove this is not related to the hole_birth
feature.
I have tested few modifications of the testing script. This is the relevant part of this testing :
dd if=/dev/zero of="$pool_file_path/$pool_name" bs=4k count=$(($j*128*128)) # pool size has an impact on bug triggeringl
zpool create -o ashift=12 $pool_name "$pool_file_path/$pool_name"
zfs create -o recordsize=4k $src_fs
zfs set compression=on $src_fs
#========================1st truncate/dd=====================================#
truncate -s 1G /$src_fs/large_file #possible to trigger the bug
#dd if=/dev/zero of=/$src_fs/large_file bs=1G count=1 #possible to trigger the bug
#dd if=/dev/zero of=/$src_fs/large_file bs=1G count=1 conv=notrunc #possible to trigger the bug
#========================1st truncate/dd=====================================#
ls -als /$src_fs/large_file
dd if=/dev/urandom of=/$src_fs/large_file bs=4k count=$((3*128)) seek=$((1*128))
zfs snapshot $src_fs@$date_today"_1"
#========================2nd truncate/dd=====================================#
truncate -s $((2*128*4*1024)) /$src_fs/large_file #possible to trigger bug
#truncate -s $((2*128*4*1024+1)) /$src_fs/large_file #not triggerring the bug
#dd if=/dev/zero of=/$src_fs/large_file bs=1 count=1 seek=$((2*128*4*1024)) #not triggering the bug
#dd if=/dev/zero of=/$src_fs/large_file bs=1 count=1 seek=$((2*128*4*1024-1)) #possible to trigger bug
#dd if=/dev/zero of=/$src_fs/large_file bs=1 count=1 seek=$((2*128*4*1024-1)) conv=notrunc #with this the bug never triggers
#========================2nd truncate/dd=====================================#
ls -als /$src_fs/large_file
dd if=/dev/urandom of=/$src_fs/large_file bs=4k count=128 seek=$((3*128)) conv=notrunc
sleep $i; #10; # <=this sleep is critical for triggering the bug when pool is large enough
dd if=/dev/urandom of=/$src_fs/large_file bs=4k count=10 seek=$((2*128)) conv=notrunc
zfs snapshot $src_fs@$date_today"_2"
It seems that only second truncate/dd command has any impact on triggering of the issue.
Maybe someone will find this version of the script useful: zfs_test3.sh.txt
It will try to create a "test_matrix_output.txt" file in "/tmp/zfs" directory (script does not create this directory). If in this file are any "E" it means that there was a checksum mismatch between source and target transfered snapshot. If the matrix is filled only with "o" everything went ok.
Example "test_matrix_output.txt" with errors will look like this (output with spl/zfs v0.6.5.7)
EEEEEEEEEEE
oooooEEEEEE
oooooEEEEEE
oooooEEEEEE
oooooEEEEEE
Hi, pwolny,
in the interests of clarity, can the bug be reproduced by running one version of one script with reasonable frequency, and if so, could you please publish the shortest version of such script ?
Ideally, the script would include creation and cleanup of a pool based on file vdevs. I am trying to abstract away the site specifics. To my knowledge, the original issues with partly filled holes and reused dnodes (Illumos 6370, Illumos 6513) can be reproduced on such pools.
Best regards, Boris.
From: pwolny notifications@github.com Sent: Tuesday, July 5, 2016 7:35:00 PM To: zfsonlinux/zfs Cc: Boris Protopopov; Mention Subject: Re: [zfsonlinux/zfs] Silently corrupted file in snapshots after send/receive (#4809)
Maybe someone will find this version of the script useful: zfs_test3.sh.txthttps://github.com/zfsonlinux/zfs/files/348819/zfs_test3.sh.txt
It will try to create a "test_matrix_output.txt" file in "/tmp/zfs" directory (script does not create this directory). If in this file are any "E" it means that there was a checksum mismatch between source and target transfered snapshot. If the matrix is filled only with "o" everything went ok.
Example "test_matrix_output.txt" with errors will look like this (output with spl/zfs v0.6.5.7)
EEEEEEEEEEE oooooEEEEEE oooooEEEEEE oooooEEEEEE oooooEEEEEE
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/4809#issuecomment-230633478, or mute the threadhttps://github.com/notifications/unsubscribe/ACX4uUHtdxskZcz_gzXobTL9IrTNjSpWks5qSuokgaJpZM4I-nnL.
Hello @bprotopopov, Basically this script contains commands suggested by you on Dec 3, 2015 in bug report #4050. Only critical additions needed for triggering are a variable time delay and pool backing file size, the rest was only needed for test automation.
Does it not trigger on your system? Are there any "E" letters in the "test_matrix_output.txt" file in "/tmp/zfs" directory? For me script in "zfs_test3.sh.txt" triggers always on affected systems. It also can help differentiate between 2 types of triggered behavior (please take a look at previous posts with images depicting different footprints generated by this script for different zfs / spl versions).
If you need to minimize test time then just set: sleep time in loop to 6 second delay " $(seq 6 1 6)" and pool size loop to 1st size multiplier "$(seq 1 1 1)" it should always trigger on an affected system, but that way you will loose a possibility of differentiation between the triggered effects.
Anyway the latest script version sets up a pool backed by a file and removes it afterwards, in fact it does that in total about 55 times in the testing loop.
It returns even a non zero exit status on error.
If you want to use it in an automatic and headless testing environment you will probably need to comment out the 3 lines with "less" at the end.
Maybe it leaves some text files in the "/tmp/zfs/" folder but I wanted them for verification.
Best regards
I have done some regression testing and it seems that the error was introduced in this commit (zfs-0.6.3-27):
b0bc7a84d90dcbf5321d48c5b24ed771c5a128b0 Illumos 4370, 4371 (4370 avoid transmitting holes during zfs send ;4371 DMU code clean up)
It generates a following footprint:
oooooEEEEEE
oooooEEEEEE
oooooEEEEEE
oooooEEEEEE
oooooEEEEEE
While a previous commit (zfs-0.6.3-26)
fa86b5dbb6d33371df344efb2adb0aba026d097c Illumos 4171, 4172
does not generate any errors
ooooooooooo
ooooooooooo
ooooooooooo
ooooooooooo
ooooooooooo
Both were tested with spl-0.6.3-10, commit :
e3020723dc43af2bc22af0d68571a61daf9b44d0 Linux 3.16 compat: smp_mb__after_clear_bit()
Could someone take a look at those code changes? Unfortunately that is as far as I can go with debugging this.
OK, great,
let me take a quick look.
Boris.
From: pwolny notifications@github.com Sent: Thursday, July 7, 2016 12:09:41 AM To: zfsonlinux/zfs Cc: Boris Protopopov; Mention Subject: Re: [zfsonlinux/zfs] Silently corrupted file in snapshots after send/receive (#4809)
Hello @bprotopopovhttps://github.com/bprotopopov, Basically this script contains commands suggested by you on Dec 3, 2015 in bug report #4050https://github.com/zfsonlinux/zfs/issues/4050. Only critical additions needed for triggering are a variable time delay and pool backing file size, the rest was only needed for test automation.
Does it not trigger on your system? Are there any "E" letters in the "test_matrix_output.txt" file in "/tmp/zfs" directory? For me script in "zfs_test3.sh.txt" triggers always on affected systems. It also can help differentiate between 2 types of triggered behavior (please take a look at previous posts with images depicting different footprints generated by this script for different zfs / spl versions).
If you need to minimize test time then just set: sleep time in loop to 6 second delay " $(seq 6 1 6)" and pool size loop to 1st size multiplier "$(seq 1 1 1)" it should always trigger on an affected system, but that way you will loose a possibility of differentiation between the triggered effects.
Anyway the latest script version sets up a pool backed by a file and removes it afterwards, in fact it does that in total about 55 times in the testing loop. It returns even a non zero exit status on error.
If you want to use it in an automatic and headless testing environment you will probably need to comment out the 3 lines with "less" at the end. Maybe it leaves some text files in the "/tmp/zfs/" folder but I wanted them for verification.
Best regards
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/4809#issuecomment-230974519, or mute the threadhttps://github.com/notifications/unsubscribe/ACX4uWNV7Z2Y0pLNkPsWO7i2EXiMfrJdks5qTHwFgaJpZM4I-nnL.
Well that seems appropriate - the commit in question seems to introduce support for the hole_birth feature.
From: pwolny notifications@github.com Sent: Thursday, July 7, 2016 8:26:34 AM To: zfsonlinux/zfs Cc: Boris Protopopov; Mention Subject: Re: [zfsonlinux/zfs] Silently corrupted file in snapshots after send/receive (#4809)
I have done some regression testing and it seems that the error was introduced in this commit (zfs-0.6.3-27):
b0bc7a8https://github.com/zfsonlinux/zfs/commit/b0bc7a84d90dcbf5321d48c5b24ed771c5a128b0 Illumos 4370, 4371 (4370 avoid transmitting holes during zfs send ;4371 DMU code clean up) It generates a following footprint:
oooooEEEEEE oooooEEEEEE oooooEEEEEE oooooEEEEEE oooooEEEEEE
While a previous commit (zfs-0.6.3-26)
fa86b5dhttps://github.com/zfsonlinux/zfs/commit/fa86b5dbb6d33371df344efb2adb0aba026d097c Illumos 4171, 4172 does not generate any errors
ooooooooooo ooooooooooo ooooooooooo ooooooooooo ooooooooooo
Both were tested with spl-0.6.3-10, commit :
e302072https://github.com/zfsonlinux/zfs/commit/e3020723dc43af2bc22af0d68571a61daf9b44d0 Linux 3.16 compat: smp_mb__after_clear_bit()
Could someone take a look at those code changes? Unfortunately that is as far as I can go with debugging this.
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/4809#issuecomment-231062968, or mute the threadhttps://github.com/notifications/unsubscribe/ACX4uaT4Rqoxawi8RNa5Fxdtaz6slTWCks5qTPB6gaJpZM4I-nnL.
For anyone who's worried, this script might help finding sparse files: http://unix.stackexchange.com/a/86446.
EDIT: While the script works, it will only find sparse files, not files affected by this issue. Depending on your workload, data and compression setting, you might have very few sparse files on a drive. In such cases, it's better to only worry about those specific files.
Also, if compression is off, making a copy of a spare file will yield a non-sparse file. This can be used to fix the underlying corruption on the source pool. Once the bug/bugs are fixed, making a copy will also work when compression is enabled.
Hi, @pwolny,
I ran your script (for a while) with a 0.6.3-based ZFS plus the above-mentioned fixes cherry-picked:
commit c183d75b2bf163fa5feef301dfc9c2db885cc652 Author: Paul Dagnelie pcd@delphix.com Date: Sun May 15 08:02:28 2016 -0700
OpenZFS 6513 - partially filled holes lose birth time
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>a
Ported by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Boris Protopopov <bprotopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6513
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0
If a ZFS object contains a hole at level one, and then a data block is
created at level 0 underneath that l1 block, l0 holes will be created.
However, these l0 holes do not have the birth time property set; as a
result, incremental sends will not send those holes.
Fix is to modify the dbuf_read code to fill in birth time data.
commit a29427227090abdaa2d63bfdb95746362a0504c1 Author: Alex Reece alex@delphix.com Date: Thu Apr 21 11:23:37 2016 -0700
Illumos 6844 - dnode_next_offset can detect fictional holes
6844 dnode_next_offset can detect fictional holes
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
dnode_next_offset is used in a variety of places to iterate over the
holes or allocated blocks in a dnode. It operates under the premise that
it can iterate over the blockpointers of a dnode in open context while
holding only the dn_struct_rwlock as reader. Unfortunately, this premise
does not hold.
When we create the zio for a dbuf, we pass in the actual block pointer
in the indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the
bp is now inconsistent from the perspective of dnode_next_offset: the bp
will appear to be a hole until zio_dva_allocate finally finishes filling
it in. In the meantime, dnode_next_offset can detect a hole in the dnode
when none exists.
I was able to experimentally demonstrate this behavior with the
following setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the
first L0 block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the
middle of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent
hole.
The fix is to ensure that it is valid to iterate over the indirect
blocks in a dnode while holding the dn_struct_rwlock by passing the zio
a copy of the BP and updating the actual BP in dbuf_write_ready while
holding the lock.
References:
https://www.illumos.org/issues/6844
https://github.com/openzfs/openzfs/pull/82
DLPX-35372
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4548
commit f506a3da14d7ef75752df9311354e4ea4dc3354d Author: Paul Dagnelie pcd@delphix.com Date: Thu Feb 25 20:45:19 2016 -0500
Illumos 6370 - ZFS send fails to transmit some holes
6370 ZFS send fails to transmit some holes
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Stefan Ring <stefanrin@gmail.com>
Reviewed by: Steven Burgess <sburgess@datto.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6370
https://github.com/illumos/illumos-gate/commit/286ef71
In certain circumstances, "zfs send -i" (incremental send) can produce
a stream which will result in incorrect sparse file contents on the
target.
The problem manifests as regions of the received file that should be
sparse (and read a zero-filled) actually contain data from a file that
was deleted (and which happened to share this file's object ID).
Note: this can happen only with filesystems (not zvols, because they do
not free (and thus can not reuse) object IDs).
Note: This can happen only if, since the incremental source (FromSnap),
a file was deleted and then another file was created, and the new file
is sparse (i.e. has areas that were never written to and should be
implicitly zero-filled).
We suspect that this was introduced by 4370 (applies only if hole_birth
feature is enabled), and made worse by 5243 (applies if hole_birth
feature is disabled, and we never send any holes).
The bug is caused by the hole birth feature. When an object is deleted
and replaced, all the holes in the object have birth time zero. However,
zfs send cannot tell that the holes are new since the file was replaced,
so it doesn't send them in an incremental. As a result, you can end up
with invalid data when you receive incremental send streams. As a
short-term fix, we can always send holes with birth time 0 (unless it's
a zvol or a dataset where we can guarantee that no objects have been
reused).
Ported-by: Steven Burgess <sburgess@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4369
Closes #4050
I could not reproduce any issues. All tests passed.
I am running on a Centos-6.7 machine.
I will try the tip of the zfsonlinux master now.
Everything tested OK for ZFS 0.6.5-329_g5c27b29 and the matching SPL 0.6.5-63_g5ad98ad.
@bprotopopov is it possible this is a problem with some code miscompiling on certain platforms? I can't see anyone else in the thread running on CentOS.
e: on further reflection, I find myself thinking it's more likely to be a kernel-specific issue than a miscompiling issue.
Well, there went that theory. Got "NOT ok" for SPL/ZFS 0.6.5.7-1 on Centos 6.6 (kernel 2.6.32-642.1.1) in a VM (using the {spl,zfs}-dkms packages on zfsonlinux.org).
@rincebrain not sure what you mean by miscompiling. If this is a timing issue of some kind, then we need to understand it better. If you or @pwolny can publish zdb -ddddd pool/fs object output for the files in question, I could get a better idea of the nature of the issue.
@bprotopopov I mean, I can hand you the emitted send streams and the tiny files backing the pools, if that would be of use.
How would you suggest I get the appropriate object reference to hand zdb?
Barring that, since the entire -ddddd output of the demonstration pool's two FSes is tiny, I'll just attach both of those. test_fs_A.txt test_fs_B.txt
And, as a bonus, here's the sparse file containing the entire pool that's output from, since the compressed file is only 2.5 MB (it unpacks to a 192MB sparse file that occupies about 32MB)
https://www.dropbox.com/s/ir8b6ot07k420mn/zfs_pool_0.6.5.6-1_A_7s_3.xz
Just so other people don't go down the rabbit hole when someone else already got there...
@pcd1193182 concluded this is "just" Illumos 6513.
So, let me expand on that slightly. The fix for illumos 6513 will solve this problem, but not retroactively. In other words, any pools that already have this state will not be fixed; on such pools, the hole BPs already have their birth times set to zero, and zfs has no way of telling which ones are bugged and which are right. Any new pools, created after the fix has been applied, will work.
For people who are stuck with their old pools, a workaround could be created that would basically disable hole birth. Ideally this would be a tunable so that it could be applied selectively. On illumos, the patch looks like this: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48 . I understand that on linux, making variables tunable requires some fiddling. Whether to set the default to TRUE or FALSE, I leave to others.
On Thu, 7 Jul 2016, Laurentiu Nicola wrote:
For anyone who's worried, this script might help finding sparse files: http://unix.stackexchange.com/a/86446.
Are the holes still going to exist (albeit incorrectly sized) on the recieved filesystem, for those of us who have already done our migrations via 'send|recv' and discarded the original media?
Tim Connors
@pcd1193182
this is a good point to keep in mind, yes, the snapshots created with buggy code will not be cured by the fixed code because the on-disk block pointers do not contain proper (non-zero) birth epochs.
However, @pwolny has published a script that he claims re-creates the issue with entirely new pools created during the test. I ran it, could not see the issue, but there is zdb dump above of the new pool with the problem I will take a look at.
Hi, @rincebrain zdb lets you list filesystems such that you can see which file is which dnode number. that number can be used as 'obejct' argument to dump just that file. For zvols, this is simpler, as zvol (data) object is always 1 as I recall. BTW, this bug affects zvols just as well.
OK, I looked at the zdb output, and assuming that I interpreted the notation use by @rincebrain correctly - A is for the source fs (to be sent) and B is for target (received), I do see that that on the source side, the L1 block that is partly filled with 10 L0 (non-hole) block pointers still has the L0 hole block pointers with 0 epochs. The range 10a000-180000 is not shown in the zdb output. The 'final' confirmation could be obtained by dumping the L1 block in question using zdb. I can dig up the command to do that, but I think it is not needed as we have enough evidence.
This causes those holes not being transmitted by the send, and therefore on the receive side, the filesystem still contains non-zero data. That was the specific scenario created by the test script.
So, here is my recommendation.
Can @rincebrain and @pwolny please amend the script that was used to reproduce the issue with 'sync' commands after every dd/truncate/other updates (or, alternatively, before taking snapshots at least), such that we could be somewhat sure this is not the 'linux page cache' issue ?
@bprotopopov the script that he linked above is the one that I used to test. The bug does not appear on illumos, because we have the fix for illumos 6513. Linux does not yet have that fix integrated, so anyone running trunk or older ZoL will encounter the issue.
EDIT: Apparently I'm mistaken; the fix is in trunk, but not in any release yet.
Hi, @pcd1193182 I have already ported 6513 fix to ZoL, and it is on the master branch as of https://github.com/zfsonlinux/zfs/commit/bc77ba73fec82d37c0b57949ec29edd9aa207677 I assumed that release 0.6.5.7 mentioned above included that fix. If it did not :) then the issue is quite obvious!
So, @pcd1193182 nailed it :)
@pwolny, @rincebrain, zfs-0.6.5.7 definitely does not include the fix for 6513, and that is why you see the issue.
Another mystery solved, I guess.
some instruction for dummy about how to deal with the aftermath will be helpful,
I am quite confuse about how should I take care of my system, don't have enough resource to pull out my entire pool and restore,
which make me have no way to get out of this until next time I upgrade the system.
Briefly, for anyone with a pool with hole_birth enabled.
@rincebrain
I have a question,
may I just inplace rewrite or do a copy of those sparse files after patch #4754 to make them free from hole birth issue?
I have experienced a silently corrupted snapshot by send/receive – a checksum of a single file on source and target do not match on the filesystem and on the snapshots. Repeated scrubs of the tested pools do not show any errors. Trying to replicate the source filesystem results in a modification of single file on the target pool, in the received filesystem and all snapshots. Source pool is a 6x1TB RAIDZ2 array on Debian 8.2.0 (Jessie, kernel 3.16.0-4-amd64, installed from DVDs, no additional updates) with version 0.6.5.3 of ZFS/SPL built from source (standard configure and no patches).
My source pool (“nas1FS’) was created on a non-ECC RAM machine and after filling up with data through a samba share (standard samba server, on the source fs sharesmb=off) was moved to a different computer with ECC RAM (I have used the same operating system on both machines). I can understand non-ECC RAM in the first computer causing permanent corruption of data that is visible on both source and target pool, but in this case the data is only silently changed during the transfer on a new computer with ECC RAM and source pool data seems to be fine. This corruption of data during send/receive is repeatable.
To better explain what I have done: First I have created a 6x1TB raidz2 pool on my old computer ("nas1FS"). After filling this pool with data I have moved the array from old computer to a new one and I have tried to backup the data on “nas1FS” pool to a different pool (“backup_raidz“).
“nas1FS” pool contained following snapshots that are of interest in this issue:
I have created a “backup_raidz” pool for backup (with compression turned on):
# zpool create -o ashift=12 -O compression=gzip-9 backup_raidz raidz1 /dev/disk/by-id/ata-SAMSUNG_HD103UJ_SerialNo1 /dev/disk/by-id/ata-SAMSUNG_HD103UJ_SerialNo2 /dev/disk/by-id/ata-HGST_HTS721010A9E630_SerialNo3 -f
Afterwards I have tried to replicate “nas1FS” pool.
# zfs send -R -vvvv "nas1FS@20160618" |zfs receive -vvvv -F "backup_raidz/nas1FS_bak"
this command finished successfully without any error.I have executed following commands to get a list of file checksums on both source and target:
and compared the resulting files
I have found a checksum mismatch on a single file:
Correct checksum is “3178c03d3205ac148372a71d75a835ec”, it was verified on the source used to populate the “nas1FS” filesystem.
This checksum mismatch was propagated through all snapshots in which the file was present on target pool:
Source pool has shown correct checksum on all snapshot that the offending file was accessible
Trying to access this file on a snapshot when it did not exist (“backup@20151121_1”) results in a “No such file or directory” on target pool (“backup_raidz” or “backup_raidz_test”).
When I have tried to access offending file on “nas1FS” with a command:
# md5sum /nas1FS/backup/.zfs/snapshot/20151121_1/samba_share/a/home/bak/aa/wx/wxWidgets-2.8.12/additions/lib/vc_lib/wxmsw28ud_propgrid.pdb
it resulted in a very hard system lockup, I could not get a system reaction on “Ctrl-Alt-SysRq-h” and similar key combinations, any IO to disks stopped completely and immediately, system stopped responding to ping and only a hard reset achieved any reaction out of the system. After hard reset everything was working, above mentioned file checksum results were unchanged.I have also tried a send/recieve to a different target pool (a single 1TB HGST disk):
# zfs send -R -vvvv "nas1FS/backup@20160618" |zfs receive -vvvv -F "backup_raidz_test/nas1FS_bak"
resulted with same md5sum mismatches.When sending only the latest snapshot with:
# zfs send -vvvv "nas1FS/backup@20160618" |zfs receive -vvvv -F "backup_raidz_test/backup"
I get a correct md5sum on the target filesystem.When trying to do an incremental send receive from the first available snapshot on source pool:
# zfs send -vvvv "nas1FS/backup@20151121_1" |zfs receive -vvvv -F "backup_raidz_test/backup"
Offending file not present on target and source pool and trying to access it on target pool does not cause any issues.# zfs send -vvvv -I "nas1FS/backup@20151121_1" "nas1FS/backup@20151124" |zfs receive -vvvv -F "backup_raidz_test/backup"
I get again a checksum mismatch.When trying to do an incremental send receive from the second available snapshot on source pool I get correct checksums on both snapshots on target pool ....
It is interesting to note that only a single block of 4096 bytes of data is corrupted at the end of the file (that has a size of 321 x 4096 bytes ) and only when transferring data with the first source snapshot ("nas1FS/backup@20151121_1") .
Binary comparison of the offending file:
I have also run zdb on the source pool, commands I did check did not find any errors:
To summarize used system configurations:
System used for “nas1FS” data fillup (old computer): Motherboard MSI X58 Pro (MSI MS-7522/MSI X58 Gold) with Intel Quad Core i7 965 3.2GHz and 14GB non-ECC RAM (MB, CPU and RAM and PS are about 6 years old). “/boot” : INTEL SSDMCEAW080A4 80GB SSD “nas1FS” pool: 6x1TB HGST Travelstar 7K1000 in a RAIDZ2 array (HDDs are ~6months old).
New system to which “nas1FS” was moved (all disks): Motherboard Supermicro A1SA7-2750F (8 core Intel Atom) with 32GB ECC RAM (MB and RAM and PS are new). “/boot” : INTEL SSDMCEAW080A4 80GB SSD “nas1FS” pool: 6x1TB HGST Travelstar 7K1000 in a RAIDZ2 array (moved from the old computer). “backup_raidz” pool: 2x1TB Samsung HD103UJ + 1x1TB HGST Travelstar 7K1000 (pool used for backup) “backup_raidz_test” pool: 1TB Samsung HD103UJ (pool with no parity, for additional tests)
Both systems were tested with memtest, cpuburn etc. without errors. I am using Debian Jessie booted from zfs pool (with a separate boot partition), same operating system on both machines used with “nas1FS” pool.
Kernel Command line:
BOOT_IMAGE=/vmlinuz-3.16.0-4-amd64 root=ZFS=/rootFS/1_jessie ro rpool=nas1FS bootfs=nas1FS/rootFS/1_jessie root=ZFS=nas1FS/rootFS/1_jessie rootfstype=zfs boot=zfs quiet
SPL and ZFS was built from source.
Some excerpts from dmesg (blocked messages are not connected to the hard lockup of the system):
I hope this information will be helpful, but feel free to let me know what other tests I can perform to diagnose this issue. I will be happy to provide any other info.