sahib / rmlint

Extremely fast tool to remove duplicates and other lint from your filesystem
http://rmlint.rtfd.org
GNU General Public License v3.0
1.87k stars 130 forks source link

Some thoughts on deduplicating containers (not really related to VMs) #440

Closed vvs- closed 3 years ago

vvs- commented 3 years ago

You can look at this problem from a different angle. Consider situation where you have several huge similar files which you want to keep intact but nevertheless to save the disk space, e.g. software distributions or historical software. This is in particular what Debian is doing with Jigdo. But Jigdo is limited to iso images only. Also, there are other formats which could benefit from their similarities, e.g. bin/cue CD/DVD images or even big archives. They all are just simple formats with scattered blocks of bytes inside, but there are still enough differences between those formats to make it difficult to support them all with just a single tool.

I found that using several different approaches allows to successfully deduplicate all of them. But combining them manually makes it quite tedious and error prone. So, here is an idea to use some existing framework to automate such tasks. For example it could try several delta compression approaches, like jigdo and rdiff, on a limited set of selected container files and keep the resulting patch, so it's ready to apply it in the future in order to automatically restore originals from their deduplicated copies.

As this might look like too alien for current rmlint goals maybe there will be interest in implementing it as a plugin or even a sidekick utility.

This is somewhat related to #355 which was rejected.

sahib commented 3 years ago

As this might look like too alien for current rmlint goals maybe there will be interest in implementing it as a plugin or even a sidekick utility.

If somebody really needs this util, having a sidekick util makes sense to me. I don't see this as rmlint's core strength.