snowtrack / snowfs

SnowFS v1 - a fast, scalable version control file storage for graphic files :art:
https://www.snowtrack.io
GNU General Public License v3.0
1.3k stars 43 forks source link

De-duplicate shared bytes as git does for texty files #293

Open danimesq opened 1 year ago

danimesq commented 1 year ago

@sebastianrath I was expecting snow-fs already had this

sebastianrath commented 1 year ago

SnowFS supports copy-on-write for certain file systems like APFS, but it does not yet have deduplication implemented in the application layer. Currently, the main reason for this is performance, as fragmentation in binaries can have a higher impact on CPU and I/O. For the first implementation of SnowFS speed had a higher priority over disk space. However, we are considering adding this as an opt-in option, as these impacts may not be relevant for every project.

danimesq commented 1 year ago

I'm here cheering for this to become an opt-in feature (personally ASAP but for y'all no pressure)

sebastianrath commented 1 year ago

Could you share some background info? What type of projects would that be beneficial to? How many files, and what are the overall file sizes? Thanks!

danimesq commented 1 year ago

@sebastianrath

What type of projects would that be beneficial to? How many files, and what are the overall file sizes? Thanks!

To have an idea, I have tons of GB of screenshots both on mobile and on desktop. And it is sad to know that most of the GB of these files have shared bytes that could be dedupliced.

Imagine a screenshot of a notepad, where most of its pixels are white; so all of that could be dedupliced (for example, Windows start menu icon on these screenshots wouldn't be repeated). I imagine GIFs and video file formats uses a similar approach for overlapping frames.

danimesq commented 1 year ago

BTW I'm working at a new symlink daemon that will support to form a single file from shared objects. Its here: https://github.com/Floflis/witchlink

danimesq commented 1 year ago

@sebastianrath do you know libraries that finds duplicate bytes on files and moves these duplicates into separate files?

I would love if git natively had more than 1 object per file, so there wouldn't be "foo", "bar" and "foobar" objects but only "foo" and "bar".