markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
795 stars 80 forks source link

Strange btrfs errors after running duperemove #244

Closed ubenmackin closed 3 years ago

ubenmackin commented 3 years ago

I'm trying to figure out how the below btrfs errors cropped up.

I started with a formatted single drive btrfs volume. I then copied a bunch of DD images from SD cards that I have to test out dedup. I ran duperemove 7 times, once for each folder of DD images using the command (the last part of the folder path changed and the hashfile changed):

duperemove -r -d --dedupe-options=same --hashfile=/mnt2/scratch/dupe_pikvm.hash /mnt2/extbackup/backup/images/pikvm

I just happened to look at my dmesg output and came across the following. sdm is the device of my single drive btrfs volume:

[375842.599342] BTRFS critical (device sdm): corrupt leaf: block=16142514962432 slot=129 extent bytenr=17444728877056 len=3149860573819708156 invalid extent data ref hash, item has 0x2bb68d80d28822fb key has 0x2bb68d80d28822fc
[375842.602514] BTRFS info (device sdm): leaf 16142514962432 gen 17494 total ptrs 307 free space 12 owner 2
[375842.602517]     item 0 key (17444728877056 178 3149860573819160371) itemoff 16255 itemsize 28
[375842.602519] extent data backref root 20693 objectid 14180 offset 14488535040 count 1
[375842.602521]     item 1 key (17444728877056 178 3149860573819162033) itemoff 16227 itemsize 28
[375842.602523] extent data backref root 20693 objectid 14308 offset 15242969088 count 1
[375842.602525]     item 2 key (17444728877056 178 3149860573819163414) itemoff 16199 itemsize 28

The above then repeats for 300 item keys, the numbers change between each key. Then the following:

[375842.603308] BTRFS error (device sdm): block=16142514962432 write time tree block corruption detected
[375842.605268] BTRFS critical (device sdm): corrupt leaf: block=16142515011584 slot=71 extent bytenr=17444728877056 len=3149860570438861369 invalid extent data ref hash, item has 0x2bb68d8009048238 key has 0x2bb68d8009048239
[375842.607473] BTRFS info (device sdm): leaf 16142515011584 gen 17494 total ptrs 298 free space 489 owner 2
[375842.607474]     item 0 key (17444728877056 178 3149860570438600004) itemoff 16255 itemsize 28
[375842.607474] extent data backref root 20693 objectid 14192 offset 15133229056 count 1
[375842.607475]     item 1 key (17444728877056 178 3149860570438603931) itemoff 16227 itemsize 28
[375842.607476] extent data backref root 20693 objectid 14180 offset 14483685376 count 1

And finally I see the following errors:

[375842.607821] BTRFS error (device sdm): block=16142515011584 write time tree block corruption detected
[375847.447049] BTRFS error (device sdm): parent transid verify failed on 16142514962432 wanted 17494 found 17413
[375847.456069] BTRFS error (device sdm): parent transid verify failed on 16142514962432 wanted 17494 found 17413
[375847.462324] BTRFS error (device sdm): parent transid verify failed on 16142514962432 wanted 17494 found 17413
[375847.463246] BTRFS error (device sdm): parent transid verify failed on 16142514962432 wanted 17494 found 17413
[375847.463945] BTRFS: error (device sdm) in __btrfs_inc_extent_ref:1486: errno=-5 IO failure
[375847.464681] BTRFS info (device sdm): forced readonly
[375847.464684] BTRFS: error (device sdm) in btrfs_run_delayed_refs:2173: errno=-5 IO failure
[393348.989020] Process accounting resumed
[409793.136351] BTRFS error (device sdm): parent transid verify failed on 16142561280000 wanted 17494 found 17414
[409793.150214] BTRFS error (device sdm): parent transid verify failed on 16142561280000 wanted 17494 found 17414
[409793.449358] BTRFS error (device sdm): parent transid verify failed on 16142561280000 wanted 17494 found 17414
[409793.451789] BTRFS error (device sdm): parent transid verify failed on 16142561280000 wanted 17494 found 17414

I'm guessing things are not good at this point. But what I wonder is what caused this. I haven't had any strange lockups, reboots, or power losses. Something locked the drive to readonly, after a bunch of other errors. Pretty much the only thing accessing these files was duperemove.

Is there anything that duperemove does that could have caused these issues?

[storage@nas ~]# uname -a
Linux rocknas 5.8.10-1.el7.elrepo.x86_64 #1 SMP Thu Sep 17 10:40:48 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
lorddoskias commented 3 years ago

These errors are not caused by duperemove. By the looks of it it seems you have faulty ram because the keys reported do exhibit bitflips

"0x2bb68d80d28822fb key has 0x2bb68d80d28822fc" there are bitflips in the lowest 4 bits 1011 vs 1100.

The good news is that this was caused during write time : [375842.603308] BTRFS error (device sdm): block=16142514962432 write time tree block corruption detected

So btrfs refused to write the bad data to disk. In this case I highly recommend you stress test your ram. And to re-iterate - this was not caused by duperemove per-se, rather the fact that duperemove does rewrite files stressed your system and it so happened that the ram corrupted data structures which are validated by btrfs.

ubenmackin commented 3 years ago

Good to know! This is actually a new system build, just bought the RAM, so I'll do a test of it to see if there is anything wrong.

Any tools you'd recommend to stress test ram?

lorddoskias commented 3 years ago

I'd advise using http://memtest.org/ (memtest86+) . Generally distirbution come with it preloaded so when you boot (at least on ubuntu it's like that) you can choose to boot into memtest86+.

ubenmackin commented 3 years ago

So I just finished a 4 pass memtest, and it found no errors.

Could this be an OOM type of issue? I ask, because I have run into a few scenarios recently where apps get killed with out of memory errors.

If it helps, I'm running this on a Ryzen 3400G, with 16 GB of RAM. I just ordered another 32 GB, and will load that in when it arrives on Monday. I'm going to wipe that drive and start the process again once the new memory arrives. I'll report back if I see any more btrfs errors in a new issue.

lorddoskias commented 3 years ago

It's unlikely that OOM could have caused this. What I've seen in the wild is that particular memory corruptions could occur when more than a single component is being stressed i.e ram + cpu or ram + cpu + io (which what duperemove is doing, really). What I could suggest is you go and report this on the upstream btrfs mailing list where more people could have ideas how to further debug this.