pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.87k stars 70 forks source link

take advantage of already existing in place compression #156

Open kapitainsky opened 1 year ago

kapitainsky commented 1 year ago

Some filesystems allow for in place compression (BTRFS, APFS, ZFS) meaning that some files can be stored using transparent compression. At the moment fclones does not take advantage of it.

Let's say I have three indentical files (below example is using APFS) and some of them are stored using in place compression:

# ls -lhO
-rw-r--r--  1 kptsky  staff  -           7.5M  4 Sep 12:00 hello.txt
-rw-r--r--  1 kptsky  staff  compressed  7.5M  4 Sep 12:00 hello1.txt
-rw-r--r--  1 kptsky  staff  compressed  7.5M  4 Sep 12:00 hello2.txt

# du -hs **/*.txt   
7.9M    hello.txt
528K    hello1.txt
528K    hello2.txt

# shasum **/*.txt
b240aa0d893b7829faa3eae9197411990f10bb1f  hello.txt
b240aa0d893b7829faa3eae9197411990f10bb1f  hello1.txt
b240aa0d893b7829faa3eae9197411990f10bb1f  hello2.txt

but after I dedupe them with fclones:

# ls -lhO                                                                                           
-rw-r--r--  1 kptsky  staff  -  7.5M  4 Sep 12:00 hello.txt
-rw-r--r--  1 kptsky  staff  -  7.5M  4 Sep 12:00 hello1.txt
-rw-r--r--  1 kptsky  staff  -  7.5M  4 Sep 12:00 hello2.txt

# du -hs **/*.txt                                                                                   
7.5M    hello.txt
7.5M    hello1.txt
7.5M    hello2.txt

any existing before compression is gone. Of course they are now all identical clones and thx to ref-link use 7.5M disk space. But could use 500k if already existing compression was retained.

What is happening is that first file from identical files group is used to create clones. If it happens that it is using in place compression then all clones will too:

# ls -lhO              
-rw-r--r--  1 kptsky  staff  compressed  7.5M  4 Sep 12:00 hello.txt
-rw-r--r--  1 kptsky  staff  -           7.5M  4 Sep 12:00 hello1.txt
-rw-r--r--  1 kptsky  staff  -           7.5M  4 Sep 12:00 hello2.txt

# fclones group . | fclones dedupe

# ls -lhO                                                                                           
-rw-r--r--  1 kptsky  staff  compressed  7.5M  4 Sep 12:00 hello.txt
-rw-r--r--  1 kptsky  staff  compressed  7.5M  4 Sep 12:00 hello1.txt
-rw-r--r--  1 kptsky  staff  compressed  7.5M  4 Sep 12:00 hello2.txt

# du -hs **/*.txt                                                                                   
528K    hello.txt
528K    hello1.txt
528K    hello2.txt

What in ideal world should happen is to check if any of identical files already is uing in place compression and in this case use it as a master to create clones.

It is nice to have category thing - pushing deduplication gains to maximum if possible.

pkolaczk commented 1 year ago

Thank you for this idea. Is there a common API on Linux to access compression information?

kapitainsky commented 1 year ago

I am looking into this now - and exactly it is where the challange might be. Unfortunately looks like it would be OS/file system specific. So might be more trouble/risk to implement it than is to gain. But thought I will suggest it anyway - it is on the wish list - maybe somebody will come up with some clever way.

kapitainsky commented 1 year ago

My interest in this project comes from practical challange I faced. Messy person I am helping with many years of backups on multiple mechanical disks NTFS,ext4 even FAT- many TB of data, millions of files. To find some specific old document was close to impossible task . Thanks to compression and deduplication I managed to squeeze all on one SSD (14 TB on 4TB - and everything is still there thx to modern filesystem features), indexed it and made search easy. And because 4TB mechanical disk data transfer takes many hours I had time to look around for best tools to deal with it. For deduplication I think fclones is the best right now - and I tried quite few.

kapitainsky commented 1 year ago

Thank you for this idea. Is there a common API on Linux to access compression information?

I only realized that the answer is in my original question... It is du command:)

e.g. two identical files but one is stored using in place compression:

$ ls -lO test*
-rw-------@ 1 kptsky  staff  -                    1611137024 Mar 17 10:28 test1
-rw-------@ 1 kptsky  staff  compressed 1611137024 Mar 17 10:28 test2

$ shasum test*
f5495201dbc0f72d7d3ce2801298d342cd8f59b6  test1
f5495201dbc0f72d7d3ce2801298d342cd8f59b6  test2

but:

$ du test*
3146752 test1
1318064 test2

du displays disk usage statistics. And is part of any *nix OS.

So we could have option prefer_compressed. If true fclones would check disk usage and used file with the least space used as a base for de duplication when using dedupe or link.

At the end deduplication is usually used to safe disk space and this option would allow to push even one step further.

John-Gee commented 11 months ago

You could apply the same logic to fragmentation, it may be best to pick the least fragmented file to keep. Maybe instead of a switch on/off a variable with different values would be better.