(draft) Common Bytes - standard for data deduplication

danimesq commented 4 years ago

This is a standard proposal for deduplicing common bytes on different versions of same kind of a file. It takes inspiration on git objects, but takes more approaches to ensure the right content parts are organized. This is not only for deduplicate data, but also for linking the same data which is represented in different kinds of files.

consider jpg, gif, png, svg and base64 as different variations from the same file
take as example the one-file html savings from dev.to, which easily reached limit on pinata. should only store their common bytes instead of duplicing, and also consider common bytes when mirroring and also read modes from 2read extension
store each line of a text, and know the text-format of each part (no duplications when comparing text from a 2read saved page or its full HTML site mirror), and same for PDF/ePub/MOBI
use references, for example, in which lines of a single-page HTML is the same content of a JS/CSS file; also works for SVG and base64 images
windows/other screenshots, keeping same bytes in objects, for example, parts of taskbar and window frame
different qualities from same video; version inside video files and know when frames are similar then diff it
all kinds of compressed files (and partition/disk images), also the supported by 7Zip and Linux archiver
deb, aur, rpm and other linux/bsd packages
midi and 8bit sounds
other wave-based audio/music
.exe, .msi, .appimage and other executables
git packs: consider their content, same as git objects; http://web.archive.org/web/20191205203745/https://github.com/radicle-dev/radicle/issues/689 new for dedup/plugz/download.json: instead of downloading lots of dupliced appimages, DEB and RPM, get their internal files and deduplicate them. Make these files internally symlink the common files. Should also support the browser downloads, with a API to get internal files hash and verify if local device already haves them.

danimesq commented 4 years ago

discuss.ipfs post: https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813

danimesq commented 4 years ago

It could also have i/o deduplicing, by generating different versions of same file by applying their common bytes.

hsanjuan commented 4 years ago

Thanks for posting to discuss! I'll close this.

ipfs / ipfs

(draft) Common Bytes - standard for data deduplication #444