Open avatar-lavventura opened 5 years ago
Test this:
ipfs add -s=rabin filename
Rabin is different chunking algorithm. It more effective for big files with small chages.
Note that *.tar.gz files are compressed by default in a way that typically means a single byte change in an internal data file will result in nearly every byte changing in the compressed output.
This is why things like rsync cannot efficiently update standard *.gz files. Some (all?) gzip implementation support an --rsyncable
argument that will sacrifice a little bit of compression to minimise the differences in the compressed output. Interestingly it does this using something similar to rabin chunking under the hood, though I think it predates rabin.
So you will need gzip --rsyncable
when creating your tar.gz and use ipfs add -s rabin
to get any deduping.
I know this goes against most of what IPFS plans to do (as it's a layer up) but it would interesting to do some research into whether "understanding" compressed formats would have any benefit. I'm thinking something along the lines of the following:
These would enable small changes in files to only require that file to be retransferred (as the rest could be rebuilt).
Caveats/downsides:
- For .tar files, IPFS could be smart enough to store each file individually and rebuild the tar container itself when requested. For people using tar files as backups, this may result in a considerable saving if the files don't change much over time.
That reminds me of some work @mib-kd743naq is doing, where you store a hybrid in IPFS which can served up individual files, but also as tar archives. I couldn't find a good link to it, but I'm sure @mib-kd743naq can tell more about this.
@ribasushi - thought this thread might be interesting to you.
I think the tar command of ipfs should be changed to allow any folder stored via ipfs-mfs to be cat out as tar and each tar container to be imported as ipfs-mfs folder.
Compression is a different story, since it would be best to support this on the storage layer side. If data is stored compressed, it might be wise to be able to export individual files compressed as well, or as a container which supports multiple files with individual compressions (tar can't handle this).
Deduplication works already fine, you just have to switch to rabin or the new buzhash. See https://github.com/ipfs/go-ipfs/issues/6841#issuecomment-576747685
So I think this ticket can actually be closed, since it's already implemented :)
When a file is updated and resync again, decrement its block duplication on nodes all over the world and decrease communication cost (only downloaded the updated blocks) and save storage (only updated section of the file as blocks will be stored).
Example, there is a .tar.gz file, which contains a data.txt file, file.tar.gz (~100 GB) stored in my IPFS-repo, which is pulled from another Node-a.
I open the data.txt file and added a single character in a random locations in the file (beginning of the file, middle of the file, and end of the file), and compress it again as file.tar.gz and store it in my IPFS-repo. Here update is only few kilobytes.
[*] When I deleted a single character at the beginning of a file, since the hash of all 124kb-blocks will be altered, which will lead to download complete file to be downloaded.
As a result, when node-a wants to re-get the updated tar.gz file a re-sync will take place and whole file will be downloaded all over again. As a result there will be duplication of blocks (~100 GB in this example ) even the change is made only for few kilobytes. And iteratively this duplication will be distributed to all over the peers, which is very inefficient and consumes high amount of storage and additional communication cost over time.
Other clouds are try to solve this problem using Block-Level File Copying. On their case like IPFS, since blocklist is considered for "Block-Level File Copying"; when a file is updated (a character is added at the beginning of the file), Dropbox, One-Drive will re-upload the whole file since the first block's hash will be change and it will also affect/change the hash of all the consequent blocks. This doesn't solve the problem.
=> I believe better soluiton is to consider between each commits of the files, approach that git-diff uses could be considered. This will only uploads the changed (diff) parts of the file, that will be few kilobytes on the example I give, and its diffed blocks will be merged when other nodes pull that file. So as communication cost only few kilobytes of that will be transferes and that amount of data will be added to storage will be only few kilobytes as well.
I know that it will be difficult to re-design IPFS's design, but this could be done as a wrapper solution that combines
IPFS
andgit
, and users can use it for very large files based on their needs.This problem is not considered as priority by IPFS team but at least it should be on the priority.
Please see discussion I have already opened. Please feel free to add your ideas in to them.
=> Does IPFS provide block-level file copying feature?
=> Efficiency of IPFS for sharing updated file