Closed ubenmackin closed 3 years ago
These errors are not caused by duperemove. By the looks of it it seems you have faulty ram because the keys reported do exhibit bitflips
"0x2bb68d80d28822fb key has 0x2bb68d80d28822fc"
there are bitflips in the lowest 4 bits 1011
vs 1100
.
The good news is that this was caused during write time :
[375842.603308] BTRFS error (device sdm): block=16142514962432 write time tree block corruption detected
So btrfs refused to write the bad data to disk. In this case I highly recommend you stress test your ram. And to re-iterate - this was not caused by duperemove per-se, rather the fact that duperemove does rewrite files stressed your system and it so happened that the ram corrupted data structures which are validated by btrfs.
Good to know! This is actually a new system build, just bought the RAM, so I'll do a test of it to see if there is anything wrong.
Any tools you'd recommend to stress test ram?
I'd advise using http://memtest.org/ (memtest86+) . Generally distirbution come with it preloaded so when you boot (at least on ubuntu it's like that) you can choose to boot into memtest86+.
So I just finished a 4 pass memtest, and it found no errors.
Could this be an OOM type of issue? I ask, because I have run into a few scenarios recently where apps get killed with out of memory errors.
If it helps, I'm running this on a Ryzen 3400G, with 16 GB of RAM. I just ordered another 32 GB, and will load that in when it arrives on Monday. I'm going to wipe that drive and start the process again once the new memory arrives. I'll report back if I see any more btrfs errors in a new issue.
It's unlikely that OOM could have caused this. What I've seen in the wild is that particular memory corruptions could occur when more than a single component is being stressed i.e ram + cpu or ram + cpu + io (which what duperemove is doing, really). What I could suggest is you go and report this on the upstream btrfs mailing list where more people could have ideas how to further debug this.
I'm trying to figure out how the below btrfs errors cropped up.
I started with a formatted single drive btrfs volume. I then copied a bunch of DD images from SD cards that I have to test out dedup. I ran duperemove 7 times, once for each folder of DD images using the command (the last part of the folder path changed and the hashfile changed):
duperemove -r -d --dedupe-options=same --hashfile=/mnt2/scratch/dupe_pikvm.hash /mnt2/extbackup/backup/images/pikvm
I just happened to look at my dmesg output and came across the following. sdm is the device of my single drive btrfs volume:
The above then repeats for 300 item keys, the numbers change between each key. Then the following:
And finally I see the following errors:
I'm guessing things are not good at this point. But what I wonder is what caused this. I haven't had any strange lockups, reboots, or power losses. Something locked the drive to readonly, after a bunch of other errors. Pretty much the only thing accessing these files was duperemove.
Is there anything that duperemove does that could have caused these issues?