Open kapitainsky opened 1 year ago
Thank you for this idea. Is there a common API on Linux to access compression information?
I am looking into this now - and exactly it is where the challange might be. Unfortunately looks like it would be OS/file system specific. So might be more trouble/risk to implement it than is to gain. But thought I will suggest it anyway - it is on the wish list - maybe somebody will come up with some clever way.
My interest in this project comes from practical challange I faced. Messy person I am helping with many years of backups on multiple mechanical disks NTFS,ext4 even FAT- many TB of data, millions of files. To find some specific old document was close to impossible task . Thanks to compression and deduplication I managed to squeeze all on one SSD (14 TB on 4TB - and everything is still there thx to modern filesystem features), indexed it and made search easy. And because 4TB mechanical disk data transfer takes many hours I had time to look around for best tools to deal with it. For deduplication I think fclones is the best right now - and I tried quite few.
Thank you for this idea. Is there a common API on Linux to access compression information?
I only realized that the answer is in my original question... It is du
command:)
e.g. two identical files but one is stored using in place compression:
$ ls -lO test*
-rw-------@ 1 kptsky staff - 1611137024 Mar 17 10:28 test1
-rw-------@ 1 kptsky staff compressed 1611137024 Mar 17 10:28 test2
$ shasum test*
f5495201dbc0f72d7d3ce2801298d342cd8f59b6 test1
f5495201dbc0f72d7d3ce2801298d342cd8f59b6 test2
but:
$ du test*
3146752 test1
1318064 test2
du
displays disk usage statistics. And is part of any *nix OS.
So we could have option prefer_compressed
. If true fclones would check disk usage and used file with the least space used as a base for de duplication when using dedupe
or link
.
At the end deduplication is usually used to safe disk space and this option would allow to push even one step further.
You could apply the same logic to fragmentation, it may be best to pick the least fragmented file to keep. Maybe instead of a switch on/off a variable with different values would be better.
Some filesystems allow for in place compression (BTRFS, APFS, ZFS) meaning that some files can be stored using transparent compression. At the moment fclones does not take advantage of it.
Let's say I have three indentical files (below example is using APFS) and some of them are stored using in place compression:
but after I dedupe them with fclones:
any existing before compression is gone. Of course they are now all identical clones and thx to ref-link use 7.5M disk space. But could use 500k if already existing compression was retained.
What is happening is that first file from identical files group is used to create clones. If it happens that it is using in place compression then all clones will too:
What in ideal world should happen is to check if any of identical files already is uing in place compression and in this case use it as a master to create clones.
It is nice to have category thing - pushing deduplication gains to maximum if possible.