sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.86k stars 128 forks source link

detection of reflinks not working on btrfs #590

Closed samothx closed 1 year ago

samothx commented 1 year ago

I am experimenting with rmlint version 2.10.1 on a btrfs filesystem, running a brnd new kernel 6.0.0. I ran rmlint -g -T "none +df" -o pretty -c sh:reflink . on a folder with some duplicates and then ran the rmlink.sh script successfully. I have checked some of the former duplicates with a piece of software I found (https://github.com/pwaller/fienode) and it claims they are are reflinked. When I run the above command again it shows me the same results as before - meaning it claims there are several duplicates. When I run rmlint --is-reflink on one of the reflinked files I get an exit code 5. So it seems rmlint does not detect the reflinked files due to 'fiemaps can't be read' . What could be the reason for that ?

cebtenzzre commented 1 year ago

There are a number of outstanding issues with --is-reflink that I have been meaning to release fixes for (see #531 and the issues it references). Could you show me the output of filefrag -vb1 on the files that you ran --is-reflink on, or files that produce the same result (exit code 5)? Usually exit code 5 indicates that the the files are too small to be reflinked ("inline extents"), but on current master it can also indicate some other failure of --is-reflink - those conditions are represented by separate exit codes on the 'develop' branch.

Also, the best you can expect from running rmlint on reflinks right now is skip_reflink in the generated shell script which will print Leaving as-is (already reflinked to original) instead of trying to re-reflink the files. There are plans to potentially improve this in the future.

samothx commented 1 year ago

Thanks for the quick response. I just ran rmlint on a rust development directory and the files I looked at were indeed rather small:

filefrag -vb1 assets/index.html 
Filesystem type is: 9123683e
File size of assets/index.html is 1246 (4096 block of 1 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    4095:          0..      4095:   4096:             last,not_aligned,inline,eof
assets/index.html: 1 extent found                             

My main concern was that after running rmlint.sh with the reflink config and then running rmlint again I was still seeing the same amount of duplicate files. So does rmlint actually still count reflinked files & their content as duplicates ? In this case it would be a case of 'works as designed' but the outcome is slightly confusing because the tool does not actually report the improvement it has created. My expectation was that after running rmlint.sh rmlint would report zero or at least significantly less duplicates..

cebtenzzre commented 1 year ago

Some files are small enough that they cannot be reflinked. This is because small files are stored using inline extents - by default, the threshold is 2KB. Your filefrag output shows inline so it is an example of this situation.

As far as rmlint is concerned, reflinked files are still duplicates, because aside from the space savings, they are independent files. Reflinks are not detected early enough to affect any printed statistics, as reflinks should not (normally) prevent matching files from being identified in outputs like CSV or JSON, which have broader uses than freeing up disk space.

You can tell that a previous rmlint run created reflinks by looking for skip_reflink in a new rmlint.sh, running ./rmlint.sh -n (dry run) and looking for mentions of reflinks, or using a tool like compsize and comparing the "Disk Usage" column to the "Referenced" column for a directory that contains at least two files that are reflinked together - lower "Disk Usage" means more reflinks.

If you are looking for a quick heuristic that lets rmlint skip files that appear to be reflinked, this is tracked in issue #328. There is a work-in-progress implementation of --keep-reflinked available in my fork, but all of that code is experimental right now.

samothx commented 1 year ago

Thanks for the clarification. I will play around with that some more and find out if it will work for me..

cebtenzzre commented 1 year ago

Alright. Feel free to reopen if you think there is something that should be improved.