Data corruption in cluster environment that uses zfs and shared storage.

openzfs / zfs

OpenZFS on Linux and FreeBSD

https://openzfs.github.io/openzfs-docs

Other

10.48k stars 1.74k forks source link

Data corruption in cluster environment that uses zfs and shared storage. #3577

Closed kolczykMichal closed 6 years ago

kolczykMichal commented 9 years ago

I have issues with data corruption in cluster environment that uses zfs and shared storage.

Environment description:

two physical machines each connected to JBOD by HBA SAS card
- HBA LSI SAS 9205-8e PCIe 3.0 x8 6Gb/s Dual external SFF8088 Mini-SAS
- HBA LSI SAS 9201-16e PCIe 2.0 x8 6Gb/s Quad external SFF8088 Mini-SAS
zpool created using disks from JBOD, zpool uses only mirror vdevs, two disks for each vdev, zpool failmode is set to panic
cluster is set up using corosync and pacemaker
client machine connects to storage using iSCSI, SCST 3.0 is used to configure iSCSI targets, there are two targets and each have two LUN's, each LUN is created using underlying zvol created on zpool
zpool and all zvols has set property sync to value always
client machine is Windows OS that runs bst5 test on connected iSCSI storage, each LUN is used as NTFS formatted disk on Windows side
SCST 3.0 is configured to use block IO, write through is enabled for each LUN, write through is also enabled on all hard drives from JBOD (by setting property in /sys)

Test description: Test uses bst5 to write data to iSCSI connected storage from Windows OS. During write I remove SAS cable from HBA card. When this happens zpool enters I/O suspended state and machine get frozen because zpool failmode property is set to panic. This is intended behavior because when one machine freezes then cluster environment set up using corosync and pacemaker moves resources (zpool and iSCSI targets) to second node. Pool is imported on second node and iSCSI targets are set up. Bst5 is able to continue write to LUN's without braking the test. When resources are moved to second node I plug back SAS cable and machine is rebooted. When it is back online resources are moved back to this machine without braking bst5 test. I repeat this procedure (cable plug off and resource move) several times during bst5 sequential write. My LUN's has 100 GiB and bst5 is set up to write 40 GiB of data. When sequential write finishes I stop removing cable and I wait until bst5 read data with compare. If during test I remove cable at least 10 times (procedure described above is repeated 10 times) then bst5 reports data mismatch error when it read data with compare. On Windows side test isn't stopped at any point and when I run scrub on zpool it doesn't reports any errors.

Other test: I have also performed different test. Except using bst5 I have prepared 40 GiB file. It was filled using same bytes: 0b00001111. Next I copied this file on disks created from LUN's using robocopy tool. When files were copied I was performing my test with SAS cable removal. After file copy finished I examined each file using script that read it and calculate MD5 check sum on 4 MiB chunks of a file. Using this method I have also found data mismatches. In some places files had random data except pattern that was in source file. Those random data holes looks for me like there wasn't any data written, and space was just skipped in file. Corrupted data was took sometimes 1 MiB.

I have no idea why data corruption occur in mentioned tests. As I have described above on Windows side test isn't broken at any point, all the time there is access to iSCSI resources. Data is moved from one node to another in time that is shorter than Windows initiator time out. I'm wondering if those data mismatches are caused by zfs or maybe some other part of my test environment. Does anybody have any idea what can cause this kind of data corruption? I'm also interested if there is any possibility to avoid this kind of issues with data corruption. How can I protect my data from being corrupted in cluster environment that is exposed to circumstances mentioned in test?

kernelOfTruth commented 9 years ago

ECC Ram ? any MCE or EDAC errors ?

what processor ? what mainboard ?

what kernel ? what distribution ?

This needs more info

Thanks

dswartz commented 9 years ago

On 2015-07-08 09:33, kernelOfTruth aka. kOT, Gentoo user wrote:

ECC Ram ? any MCE or EDAC errors ?

what processor ? what mainboard ?

what kernel ? what distribution ?

This needs more info

Yeah. Sounds like something is 'lying' about being synchronous. Maybe one or more drives aren't really flushing cache when told to?

kolczykMichal commented 9 years ago

I have zpool on Debian Jessie with kernel 3.10. Both cluster nodes uses same hardware:

CPU: Intel Xeon E31230 @ 3.20GHz
Mainboard: Intel Server Board S1200 BTL
RAM: 2x Good RAM DDR3 8GB

kernelOfTruth commented 9 years ago

@kolczykMichal is this ECC RAM ?

they seem to offer both ECC or non-ECC Ram

in any way - if you can - run memtest86+ for several hours on your RAM and see if there's any issues with it

what does

mcelog

and

edac-util -v -r

say ?

Any error messages or suspicious things in your log files, dmesg, etc. ?

pyavdr commented 9 years ago

Hi,

2013 there was a disscussion on the list: https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/uEDi2ie3nYg

Basically there is a big chance you hit some splitbrain situation which caused yr data corruption. It may be better to try the same with gluster fs. Corosync and Pacemaker alone may be insufficent for a correct failover.

kolczykMichal commented 9 years ago

@kernelOfTruth in test machines are following RAM modules: GoodRam W-MEM1333E38G DDR3 1333MHz CL9 ECC 8GB According to the full name of this modules I think this is ECC memory.

Unfortunately I don't have mcelog and edac-util installed on my test machines. I will consider installation of those tools before future tests.

I have examined logs a couple of times after running tests but I didn't see anything suspicious in them.

I think that it might not be issue with particular hardware because it was repeated in different environment. It was different hardware but same cluster setup with shared storage and zfs as I described at the beginning. Because of that I think that it is not strictly related to particular hardware but on the other hand I don't reject that this issue might be on hardware level.

I will run memtest if I will find time to do so - those test machines are used all the time for other tests in my company and sometimes it is hard to run tests that takes several hours. But that way or another memtest is in my schedule for both test machines.

kolczykMichal commented 9 years ago

@pyavdr I'm almost 100% sure that split-brain isn't a cause of data corruption that I have experienced. When first machine has SAS cable plugged off second machine doesn't have any access to zpool. When first machine is rebooted then resources are moved. And it is almost impossible that both machines write data do zpool simultaneously.

pyavdr commented 9 years ago

Since bst5 ist a pretty mature utility, there maybe some problems with it.I would like to try your config on my boxes too, do you have some more details of the configs for pacemaker and corosync?

kolczykMichal commented 9 years ago

@pyavdr As I described in first post I also made test using different tool. I have copied file filled with pattern using robocopy tool on Windows. Next I examined file and there was incorrect data that looked like there isn't written anything for a while because there was just random bytes for around 1 MiB.

ab-oe commented 9 years ago

Hello, I did similar tests but I used Linux machine as client.

The first test was performed with two ZVOLs exported over iSCSI. Both volumes were zeroed before the tests and then formatted with the XFS filesystem and mounted with sync option. To copy the test file (of size 10GB) I used the dd command with block size of 1M and oflag=sync. During the test failover occured about 7 times. At some failover points there were gaps of size 512K filled with zeros. I monitored also transaction groups but the size of txgs doesn't match the size of missing data.

txgs history on the first node (where the cable was removed):

cat /proc/spl/kstat/zfs/Pool/txgs
33 0 0x01 4 448 178450175395 230065814305
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
193409   211701381470     C     51109888     8704         56842240     1        494      4999966507   4365         9622         773292312   
193410   216701347977     C     50675712     30208        58258944     1        540      4999962109   4245         20707        954959016   
193411   221701310086     C     51036160     0            56189440     0        493      5000057688   4040         23029        762925573   
193412   226701367774     O     0            0            0            0        0        0            0            0            0

txgs history on the second node (after the failover):

cat /proc/spl/kstat/zfs/Pool/txgs 
925 0 0x01 34 3808 12388807589177 12446144092330
txg      birth            state ndirty       nread        nwritten     reads    writes   otime        qtime        wtime        stime       
193413   12390668717489   C     0            6656         1912320      1        90       131665978    3727         82180        221865919   
193414   12390800383467   C     0            1536         958976       1        68       222165163    3885         66353        199584529   
193415   12391022548630   C     0            0            513536       0        31       199930280    3718         67110        117607303   
193416   12391222478910   C     0            0            1011712      0        52       117913177    3695         4697         174182938   
193417   12391340392087   C     0            0            516096       0        35       174208764    3443         3945         124485851   
193418   12391514600851   C     0            0            1501184      0        65       168831515    4143         75267        139136411   
193419   12391683432366   C     0            0            469504       0        33       139289843    4027         69966        116596850   
193420   12391822722209   C     0            0            531456       0        22       116676806    3130         18456        116328808   
193421   12391939399015   C     0            0            542720       0        25       128046329    3740         24772        113530525   
193422   12392067445344   C     0            0            516096       0        32       113569627    3760         4210         109180593   
193423   12392181014971   C     0            0            493056       0        37       109303094    3225         112026       108270555   
193424   12392290318065   C     0            0            1524224      0        53       160977358    3682         9175         122536502   
193425   12392451295423   C     0            0            512512       0        35       122559176    3448         4232         108603322   
193426   12392573854599   C     0            0            1517568      0        51       108617596    2750         6587         132965245   
193427   12392682472195   C     0            0            530944       0        29       133580876    2632         9238         99472155    
193428   12392816053071   C     0            0            528896       0        23       99492865     3272         4157         91765254    
193429   12392915545936   C     0            0            510976       0        34       91777886     2680         7132         91584347    
193430   12393007323822   C     17252352     0            17523200     0        158      127121444    4600         56350        231430021   
193431   12393134445266   C     0            541696       519168       8        31       251901828    3414         10321        163034715   
193432   12393386347094   C     0            24576        532480       3        31       163150489    3913         70717        108120890   
193433   12393549497583   C     0            0            514560       0        35       108206378    3557         4177         108027454   
193434   12393657703961   C     15810560     512          15972864     1        172      150433189    4788         87297        232600704   
193435   12393808137150   C     294912       0            724480       0        65       4992035487   4604         68742        124120485   
193436   12398800172637   C     180224       0            633344       0        41       4999890361   4335         104354       116147209   
193437   12403800062998   C     0            0            0            0        0        5000022753   4039         17742        2822382     
193438   12408800085751   C     819200       7680         1137152      1        46       3383161202   3436         15522        133323953   
193439   12412183246953   C     589824       0            1346048      0        68       133353593    3845         8535         168178090   
193441   12417310123696   C     48283648     0            54031360     0        492      4999996642   4343         10584        807213132   
193442   12422310120338   C     51240960     0            58329600     0        542      4999962044   3648         7933         901399455

The next test was performed with two datasets shared over NFS. Shares were exported with sync option. At the first time wsize in the mount options was set to the default size of 1MB. The test results were the same except that the gap with missing data has size of 1MB instead of 512K. I repeated this test but this time wsize mount option was set to 8K. Test data were consistent after the copy.

Of course sync property was set to always on all zvols and datasets.

UPDATE: I performed one more test. Just like the first mentioned (over iSCSI) but instead of ZFS I used raw device. Copied data are consistent. There must be something in ZFS that causes missing data in copied files. Do you have any ideas? Thank you in advance.

ab-oe commented 9 years ago

A few days ago I tested this with the build b39c22b73c0e8016381057c2240570f7af992def. It seems that submit_bio with WRITE_SYNC flag resolves this but unfortunatelly it introduces another issue #3821.

To reproduce this issue it is sufficient to configure NFS share on the dataset. The failmode property on pool is set to panic and kernel.panic in sysctl is set to -1 to reboot immediately on I/O failure. Share should be mounted on the client side with wsize=1048576 and large timeo value (to cover the reboot time of machine). On the client machine run at least two processes that will copy the test file to that share. When copy process is running unplug the SAS cable. The system will reboot and when it boots again the copy processes will continue. This should be repeated a few times. Compare copied files with the original one they should differ.

behlendorf commented 9 years ago

@ab-oe the WRITE_SYNC/READ_SYNC change almost certainly fixed this because it had the side effect of making the writes synchronous. Great for data integrity, bad for performance. The ZIL in designed to solve this issue and get you data integrity AND good performance but it sounds like some part of the stack is either lying or buggy. Here's where I'd start investigating, maybe someone can do the leg work on this to narrow down the problem.

Does this happen just with ZVOLs or with NFS exported filesystem too? Both rely on the same core machinery for the ZIL so if it's only happening on ZVOLs then the core ZIL code likely working correctly.
Are log records being written out for ZVOLs correctly? You can check this using the zdb -i -vvvvv pool/dataset prior to re-importing the pool after a fail over event. It will dump the log records which must be replayed, do they match up with the wrong data?
Are the logs being replayed properly? This ones harder to check since there aren't any existing exported statistics, you might need to add some code to zvol_replay_write() to verify the replay.

dweeezil commented 9 years ago

Here's a semi-crazy idea if zvols are involved: Could the storage stack possibly be doing BLKDISCARD operations on the zvols? If so, those (the discards) aren't logged. I added logging of zvol discards to the nexenta TRIM patch I've been working on but it would be pretty simple to add as an independent patch.

ab-oe commented 9 years ago

@behlendorf

ad. 1. Yes this happens also for NFS shares.

ad. 2. I'll test today. Currently the output of zdb command for whole Pool looks as follows:

Dataset mos [META], ID 0, cr_txg 4, 3.60M, 149 objects
Dataset Pool-0/zv1 [ZVOL], ID 55, cr_txg 32, 4.00G, 2 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0

    Block seqno 733, won't claim, DVA[0]=<0:4024dcd000:20000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=20000L/20000P birth=247995L/247995P fill=0 cksum=cd0a8ee63b81ddca:b6fa6ac3f7173b47:37:2dd

Dataset Pool-0/share2 [ZPL], ID 124, cr_txg 194494, 13.8G, 14 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0

    Block seqno 1, won't claim, DVA[0]=<0:91415c000:20000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=20000L/20000P birth=209357L/209357P fill=0 cksum=8913f97cefbe76b5:d3f1a7407ac6e806:7c:1

Dataset Pool-0/zvx [ZVOL], ID 135, cr_txg 242156, 64K, 2 objects
Dataset Pool-0/sys [ZVOL], ID 42, cr_txg 6, 132M, 2 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0

    Block seqno 58, won't claim, DVA[0]=<0:4012c7c000:3000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=3000L/3000P birth=247911L/247911P fill=0 cksum=ffa4a188978824d6:4939cbbc530be288:2a:3a

Dataset Pool-0/share1 [ZPL], ID 118, cr_txg 194485, 20.0G, 14 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0

    Block seqno 1, won't claim, DVA[0]=<0:90fe9c000:20000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=20000L/20000P birth=209356L/209356P fill=0 cksum=275bdb427e84040c:823c36b0a9dc61a7:76:1

Dataset Pool-0/zvY [ZVOL], ID 141, cr_txg 242344, 260K, 2 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0

    Block seqno 1, won't claim, DVA[0]=<0:28b54a0000:20000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=20000L/20000P birth=242621L/242621P fill=0 cksum=41fe43bb04af9f16:4a3d278ed489213f:8d:1

Dataset Pool-0/test [ZVOL], ID 147, cr_txg 242612, 64K, 2 objects
Dataset Pool-0/zv0 [ZVOL], ID 49, cr_txg 27, 4.01G, 2 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0

    Block seqno 741, won't claim, DVA[0]=<0:4024b4d000:20000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=20000L/20000P birth=247994L/247994P fill=0 cksum=d287155879d1960b:20b171367379e60a:31:2e5

Dataset Pool-0/zv4 [ZVOL], ID 67, cr_txg 48, 25.5G, 2 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0

    Block seqno 1, won't claim, DVA[0]=<0:ac20246000:20000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=20000L/20000P birth=210649L/210649P fill=0 cksum=ff9ade2301f499b4:7e41b1c869d9f88:43:1

Dataset Pool-0/zv3 [ZVOL], ID 61, cr_txg 44, 22.5G, 2 objects

    ZIL header: claim_txg 0, claim_blk_seq 0, claim_lr_seq 0 replay_seq 0, flags 0x0

    Block seqno 1, won't claim, DVA[0]=<0:ac1f396000:20000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=20000L/20000P birth=210629L/210629P fill=0 cksum=8807005b9935cb0d:db1b62504bb4d35c:3d:1

Dataset Pool-0 [ZPL], ID 21, cr_txg 1, 1.48M, 10 objects
Verified large_blocks feature refcount is correct (0)
space map refcount mismatch: expected 53 != actual 51

the worrying thing is space map refcount mismatch.

ad. 3. I think that it is currently not needed because issue occurs on zvols and datasets.

ab-oe commented 9 years ago

@behlendorf as you suggested I performed test with monitoring ZVOL log records before the import. The last reported data block is the block just before the missing data (where it ends there is a block of 512K of missing data). Reported log records were replayed correctly. The end of zdb output looks as follows. By the way wouldn't it be better for analysis when all data are in hexadecimal form instead of mixing printable characters with hexadecimal?

    Block seqno 304, will claim, DVA[0]=<0:14086ef000:20000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=20000L/20000P birth=2424L/2424P fill=0 cksum=4d1b4d89c9360d1f:6d243ee5f5ce31ab:37:130
        TX_WRITE            len   1696, txg 2424, seq 303
            foid 1, offset 5a28fa20, length 5e0
            A1L h FD9C14j 9 DDK B989D2X A8A00 1910} AAE2C08D1CCEE3A > 16AFB3BF88D4J k 887FP  A 0D5Y 97, 9BC5A593 2C E210; Q J DE87 46 F2BDC9D9\ 1FD5C3D281E35 EE85t 9C: 1BF29018& D6u Y  0F6/ f B78499o ~ D8C9B5F9 E6 5 B5F396[ U I 8E 6871DF9F 6 B2N ~  A87EFE25 19E3A5C0 08 C8C1, C586T 0 5 ] C682+ 90R N D7! A7AE1BB488F4: B6a 9E` E09EM  FCBv X ]  E8DFFABEC` DC9 E586N E417CE14} DDJ N  AC2. E917F4^ & F518BA1 D2B4A2C 97F59FB4  E 9B4  AC91F8 EF 0> 96b # D < ! 87969CBB! AA 6D0A086F5] 1Ed  495 DD8EA# B71EE2E2{ F3A21DD A2F2D3x 82A2BB1A' D2Q C2, 81D3 7) E699< DDBDFED8a ` ] CF 71D8A88C21F7 " C7x - F5l  4131DD2? CFE3E2# E82 DAD1K FBD0 C- 1FC FFo 85E9C69Bf BAp 9E1A` 1 E7C2m CDCC130 F6F4B689EAFCv F0CFV E3C5D3V T ACF6EA 7F7CCAA4 ( g 1D92ED 8D3BEj N C18312] { l = FE8F97F4 1CB| E699v AF2 97H CAT ]  DDBB0D0FF93N F0N [ X y " ; B8A1f 7FP T 1DF4CC92g AFr } 87EB111Ay 8F8AM @ 16( : FBB79F8295w F1F0I BBC5B1ECi 2 c BDz C5{ FBA6CC8C81A8f EF82f 9EE4D6A8 CA0971AB997ABA8B < F1q DBC2% EAB3F79790N A79EA0R 849EAE98L I X F115o b 8B82d S D2 3D9A2F0D8FA88) @  686878E1B_ B6CCFEA1C5h 1 18A7V C4U AACC 82 DD179CA894BF 29FAF80g F1FDAD 01782A385m E9: A5K 131Cg D0n _ 9BCF) C7121889- CBR A0> 8BF5B6D2( b D2] D38C 1CC* n E89B82a j 8E< A9F 2 ACEB8C$ BF80 Ew 15` E3E3 Ft EFy EFFE8BD0> FAg b T 19A8CEDA= C3A5p DC8BP F0B n DCE4] A097D7D7c z V 14818999e A6EE84i 17DF9D14t B0- 1BB2< 8D Dr C5H E8F9b D8A3B080CFC9D6} 1FB918 3C56 T /  CB8ABC7A4G - CAB81794L J / '  C9EF7E58F1 D710C983C2; C4u 9D 8; R FD82199 F7CAD190E4* 81DBo C1o A6D5> | 19 5CBA2( E1f U 83x DE9B Fw DBB7BFD2D8FC88 F@ B2E587B1> B t A398D8( 8 CD8ABD| 9C94` 6 D3i ` C0B4@ 4 \  3} I D383> A u J } FF 8k D1@ 9493M BFAD$ 87A4E0CBB6C3AFB8A 88d A3Z X W / DA0 E5o CDBFK 9A9Cx  CE58A^ x S E0 4E7AAA5| 91 D 88 7FA9C # C381} B4U D3A4D1Y C4DE858E" D181l 97EBE011ECE 1F809CAD16o B9 Ed 84k  2= 90r CEK BDB6;   D614r F2CDD8J 13CA; s F4D3ECB7C38A18V ` B7BFDAv Z 4 82EE+ EE$ B9FDC1' E7? 8A& p j 1 BB 4A \ C  D17 4] E ADCDBAD08 w  117$ 84= D985EAE9149DA8-  5@ I  4CDD28Cz C3# A59D16N D513R O . N 2 90V v ) CB_ ! D7E498  1A9EE1E0; CCF4E5A6AAE6FD9 . ECF7& % 89U C4n 9Ay CBEFEADB8EDA98EFq 18F3A792CB$ A4EBAC> C7C78FFE1AC4FBg 1DE7 CBDG f C 1EC490w F7u 1DBFD1} CFD2q 96876 ! C6e EF 7W 89B3B0W 5 DBAC:  7 Fp 85F8G 8B~ ABI C596 9A51 8 93BDCA81A394C3~ C2 BBC_ v BACFFFE8M = 81Y Q 19EAF79EE 11B884829DCB8C8B86l 8CE817@ 1AE  AH E2! 194 8110p [ A2BCF487l 8ECA( B8ACCC+ t 7F7 z F3 CC BC87958B87B54 j DE DDAD1q , B0@ D6 0r i CAA8A9DBA8ADp 17B39CA691R ^ 19k 92ABEA- B54 K , ? Y CE90G ' j C4 B FEFI Q EAP F8BCF v K U  3D1F91Ci A4D 97 CD2E913>  61AF49Fw E7180 A9878Bv 17A3e B1f 94CDx [ c C5D5U F387; - 8789O C58CN 98D33 [ A7C68 B8FB9D868 1117Q  99Cl E5AA} D0C2EC? BBBBFC183 EB9E. Y D71A/ F 18F1B5E6i ( 83B6D0C2 F Ek 2 8889A06 D5D1D9DB1394P & C3D9" 9CW m ( A487A2A15 F7EDF1A8AFn FF7FC0Q ? AD86D90 978Cm * { ADZ  F< K 85FE97E8CF9116A4I e DBC8$ S BBA7DFE0Q 83x   D6B7~ r 1F 7^ d C6 1FD? + C091DF{ Z  FB2& I E08BAA|  5 EF4  83C4W 87D89C, BC90B5D3. CCE89F85A74 B5$ o N C1[ 8Ft O  2F8A5BDF5" ' p B816C3m B7a 90} A08B- ~ 9FC4M DF- 16
    Block seqno 305, will claim, DVA[0]=<0:140870f000:20000> [L0 ZIL intent log] zilog2 uncompressed LE contiguous unique single size=20000L/20000P birth=2424L/2424P fill=0 cksum=4d1b4d89c9360d1f:6d243ee5f5ce31ab:37:131

        Total               60
        TX_CREATE            0
        TX_MKDIR             0
        TX_MKXATTR           0
        TX_SYMLINK           0
        TX_REMOVE            0
        TX_RMDIR             0
        TX_LINK              0
        TX_RENAME            0
        TX_WRITE            60
        TX_TRUNCATE          0
        TX_SETATTR           0
        TX_ACL_V0            0
        TX_ACL_ACL           0
        TX_CREATE_ACL        0
        TX_CREATE_ATTR       0
        TX_CREATE_ACL_ATTR   0
        TX_MKDIR_ACL         0
        TX_MKDIR_ATTR        0
        TX_MKDIR_ACL_ATTR    0
        TX_WRITE2            0

arturpzol commented 7 years ago

I experienced similar data corruption in cluster environment (corosync, pacemaker) with shared storage after force power off one of the cluster node (tested on kvm and vmware).

System information

Type	Version/Name
Distribution Name	Debian Jessie
Distribution Version	8
Linux Kernel	4.4.45
Architecture	x86_64
ZFS Version	0.7.1-1
SPL Version	0.7.1-1

I have one pool:

zpool status
  pool: Pool-0
 state: ONLINE
  scan: none requested
config:

        NAME                                          STATE     READ WRITE CKSUM
        Pool-0                                        ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-0-4  ONLINE       0     0     0
            scsi-0QEMU_QEMU_HARDDISK_drive-scsi3-0-3  ONLINE       0     0     0

with one zvol (primarycache=metadata, sync=always, logbias=throughput) which is shared to client host.

It is repeatable only with ZOL 0.7. With ZOL 0.6.5 corruption did not happen. If I add ZIL to the pool, corruption also did not happen.

Summarizing if zpool uses only mirror vdevs without ZIL something with synchronization is broken.

Do you know which changes have an impact on synchronization in ZOL 0.7?

behlendorf commented 6 years ago

Resolved by #6620.