content aware chunking - Githubissues

One hard thing with deduplication over WORM merkle-tree datastructures like unixfs is that we are extremely limited when finding backrefs. Usually deduplication code is able to search for backrefs freely and fixup the old and new datastructure so that they point to a shared piece of data. Here we can't, consider this case:

# file 1
| chunk1       |
| aaaabbbbcccc | 
# file 2
| chunk1 |
| bbbb   |

Assuming file 1 already exists we can't do any deduplication when chunking file 2. We would have needed to chunk file 1 this way originally:

# file 1
|  c1  |  c2  |  c3  |
| aaaa | bbbb | cccc |

However when we are chunking file 1 we don't already know what file 2's structure is.

It would be nice to have a chunker which use applicative usecase to do so. We could try to match known file formats and change strategy based on the file. We should apply this logic recursively based on contextual hints.

Various strategy I can think off (mainly container formats):

tar, cpio, iso, overlay2fs (used by docker), chunk each file recursively, embed headers in the root blocks. Deduplicate files inside and outside the archive. As well as deduplicate partial updates where only some files in the archive changed.
car, chunk each block using raw leaves, embed headers in the root blocks. Deduplicate blocks inside and outside the car. As well as support partial updates where only some blocks in the archive changed. Pinning a car should be the same as pinning the content of this car plus some metadata pixie dust.
video containers, chunk streams and sections recursively. Deduplicate alternative tracks like audio or captions, deduplicate across splices (*if the container allow low level concatenation) Maybe would require a custom video encoder which takes care of reusing bytes as-is in the most amount of possible cases to be really effective.
video codecs, chunk around keyframe and delta Avoid read amplifications when seeking.
gzip, zstd, (other low level concatenable compression streams), chunk around sections Formats like stargz compress each file within an archive individually and then concatenate all the resulting streams. This works because for theses decompress(compress(a) + compress(b)) is equal to decompress(compress(a + b)). Deduplicate compressed files inside and outside the archive, as well as partial updates where only some files changed.
zip, jar, chunk compressed sections individually Deduplicate partial updates where only some files in the archive changed.
... ideas welcome

The chunker would need to be resistant and tolerant, if a file is incorrectly classified or does not follow the underlying format correctly. The chunker must still produce a valid unixfs file which once deserialized gives byte for byte identical copy of the original input. In other words, if someone tries to add an invalid tar file, the chunker must create a unixfs file which encode this invalid tar file.

This solution is simpler than allowing arbitrary backrefs in other blocks because it ensure files don't grow by adding more backrefs. We don't want to find ourself in the situation where a file is bigger to download from scratch than a non de-duped alternative because we assumed that a bunch of blocks would already be cached. It also does not require synchronizing matches, we can have deduplication without ever comparing old and new versions.

In this case we are not only considering where are leaves are placed but also how theses are linked together. If I add a tar it's not good enough that it use the same leaves as the file outside of the tar, we want pinning the tar to be equivalent to also pin the file outside of the tar, so the root cid of the file must be linked from the tar's unixfs dag.

The resulting files would be compatible with existing unixfs decoder and not require out-of-band file format knowledge, this is purely an improvement in how we convert from non unixfs to unixfs. The goal is not to add a webm or tar codec which everyone would need to be updated to understand.

ipfs / boxo

content aware chunking #508