git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication

avatar-lavventura commented 5 years ago

Motivation:

When a file is updated and resync again, decrement its block duplication on nodes all over the world and decrease communication cost (only downloaded the updated blocks) and save storage (only updated section of the file as blocks will be stored).

Problem:

Example, there is a .tar.gz file, which contains a data.txt file, file.tar.gz (~100 GB) stored in my IPFS-repo, which is pulled from another Node-a.

I open the data.txt file and added a single character in a random locations in the file (beginning of the file, middle of the file, and end of the file), and compress it again as file.tar.gz and store it in my IPFS-repo. Here update is only few kilobytes.

[*] When I deleted a single character at the beginning of a file, since the hash of all 124kb-blocks will be altered, which will lead to download complete file to be downloaded.

As a result, when node-a wants to re-get the updated tar.gz file a re-sync will take place and whole file will be downloaded all over again. As a result there will be duplication of blocks (~100 GB in this example ) even the change is made only for few kilobytes. And iteratively this duplication will be distributed to all over the peers, which is very inefficient and consumes high amount of storage and additional communication cost over time.

Solution:

Other clouds are try to solve this problem using Block-Level File Copying. On their case like IPFS, since blocklist is considered for "Block-Level File Copying"; when a file is updated (a character is added at the beginning of the file), Dropbox, One-Drive will re-upload the whole file since the first block's hash will be change and it will also affect/change the hash of all the consequent blocks. This doesn't solve the problem.

=> I believe better soluiton is to consider between each commits of the files, approach that git-diff uses could be considered. This will only uploads the changed (diff) parts of the file, that will be few kilobytes on the example I give, and its diffed blocks will be merged when other nodes pull that file. So as communication cost only few kilobytes of that will be transferes and that amount of data will be added to storage will be only few kilobytes as well.

I know that it will be difficult to re-design IPFS's design, but this could be done as a wrapper solution that combines IPFS and git, and users can use it for very large files based on their needs.

This problem is not considered as priority by IPFS team but at least it should be on the priority.

IPFS team is considering adding that eventually, but it’s not a priority.

Please see discussion I have already opened. Please feel free to add your ideas in to them.

=> Does IPFS provide block-level file copying feature?

=> Efficiency of IPFS for sharing updated file

ivan386 commented 5 years ago

Test this: ipfs add -s=rabin filename Rabin is different chunking algorithm. It more effective for big files with small chages.

dbaarda commented 5 years ago

Note that *.tar.gz files are compressed by default in a way that typically means a single byte change in an internal data file will result in nearly every byte changing in the compressed output.

This is why things like rsync cannot efficiently update standard *.gz files. Some (all?) gzip implementation support an --rsyncable argument that will sacrifice a little bit of compression to minimise the differences in the compressed output. Interestingly it does this using something similar to rabin chunking under the hood, though I think it predates rabin.

So you will need gzip --rsyncable when creating your tar.gz and use ipfs add -s rabin to get any deduping.

MatthewSteeples commented 5 years ago

I know this goes against most of what IPFS plans to do (as it's a layer up) but it would interesting to do some research into whether "understanding" compressed formats would have any benefit. I'm thinking something along the lines of the following:

For .tar files, IPFS could be smart enough to store each file individually and rebuild the tar container itself when requested. For people using tar files as backups, this may result in a considerable saving if the files don't change much over time.
For .gz files, IPFS could be smart enough to decompress the file before processing. The storage layer could have some form of compression applied to it anyway (so this doesn't take any additional disk space), as could the network layer (so no additional bandwidth is consumed)
For zip files you get a combination of the above. Effectively you treat the zip file as a folder.

These would enable small changes in files to only require that file to be retransferred (as the rest could be rebuilt).

Caveats/downsides:

Would only work on unprotected files
Files can be compressed more or less efficiently, so would need to consider whether that detail is persisted
You're no longer transferring the actual blocks of a file, but are rebuilding a file on demand
Additional complexity, having to understand file formats

vmx commented 5 years ago

For .tar files, IPFS could be smart enough to store each file individually and rebuild the tar container itself when requested. For people using tar files as backups, this may result in a considerable saving if the files don't change much over time.

That reminds me of some work @mib-kd743naq is doing, where you store a hybrid in IPFS which can served up individual files, but also as tar archives. I couldn't find a good link to it, but I'm sure @mib-kd743naq can tell more about this.

momack2 commented 4 years ago

@ribasushi - thought this thread might be interesting to you.

RubenKelevra commented 4 years ago

I think the tar command of ipfs should be changed to allow any folder stored via ipfs-mfs to be cat out as tar and each tar container to be imported as ipfs-mfs folder.

Compression is a different story, since it would be best to support this on the storage layer side. If data is stored compressed, it might be wise to be able to export individual files compressed as well, or as a container which supports multiple files with individual compressions (tar can't handle this).

Deduplication works already fine, you just have to switch to rabin or the new buzhash. See https://github.com/ipfs/go-ipfs/issues/6841#issuecomment-576747685

So I think this ticket can actually be closed, since it's already implemented :)

ipfs / notes

git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication #392