ipfs / boxo

A set of reference libraries for building IPFS applications and implementations in Go.
https://github.com/ipfs/boxo#readme
Other
210 stars 90 forks source link

content aware chunking #508

Open Jorropo opened 11 months ago

Jorropo commented 11 months ago

One hard thing with deduplication over WORM merkle-tree datastructures like unixfs is that we are extremely limited when finding backrefs. Usually deduplication code is able to search for backrefs freely and fixup the old and new datastructure so that they point to a shared piece of data. Here we can't, consider this case:

# file 1
| chunk1       |
| aaaabbbbcccc | 
# file 2
| chunk1 |
| bbbb   | 

Assuming file 1 already exists we can't do any deduplication when chunking file 2. We would have needed to chunk file 1 this way originally:

# file 1
|  c1  |  c2  |  c3  |
| aaaa | bbbb | cccc | 

However when we are chunking file 1 we don't already know what file 2's structure is.


It would be nice to have a chunker which use applicative usecase to do so. We could try to match known file formats and change strategy based on the file. We should apply this logic recursively based on contextual hints.

Various strategy I can think off (mainly container formats):


The chunker would need to be resistant and tolerant, if a file is incorrectly classified or does not follow the underlying format correctly. The chunker must still produce a valid unixfs file which once deserialized gives byte for byte identical copy of the original input. In other words, if someone tries to add an invalid tar file, the chunker must create a unixfs file which encode this invalid tar file.


This solution is simpler than allowing arbitrary backrefs in other blocks because it ensure files don't grow by adding more backrefs. We don't want to find ourself in the situation where a file is bigger to download from scratch than a non de-duped alternative because we assumed that a bunch of blocks would already be cached. It also does not require synchronizing matches, we can have deduplication without ever comparing old and new versions.


In this case we are not only considering where are leaves are placed but also how theses are linked together. If I add a tar it's not good enough that it use the same leaves as the file outside of the tar, we want pinning the tar to be equivalent to also pin the file outside of the tar, so the root cid of the file must be linked from the tar's unixfs dag.


The resulting files would be compatible with existing unixfs decoder and not require out-of-band file format knowledge, this is purely an improvement in how we convert from non unixfs to unixfs. The goal is not to add a webm or tar codec which everyone would need to be updated to understand.

hacdias commented 10 months ago

Related to https://github.com/ipfs/kubo/issues/3604