Content depending chunker

There is a need for content dependant chunker.

Therefore I welcome objective, non-opinionated discussion, proofs of concepts, benchmarks, and other around this subject.

Questionable is what technique, what polynomial (if applicable), and what parameters to use for given chunker, so that it is performant across the board and effectively helps everyone.

Also the question is what files/data to chunk by this chunker. Given that compressed data would likely not benefit at all from content dependent chunking. That is unless the file is archive with non-solid compression, or alike. Should this be decided automatically by some heuristics and for say text files that are not minified versions of js/css, etc. to use such chunker, else select regular chunker? By file headers?

This can have a great impact for (distributed) archival of knowledge (think Archive.org, except with dedup, better compression, and that can be distributed easily). Which also does raise question if chunks should be stored compressed. But that is partialy side-tracking this issue.

One reference implementation with focus on storage saving (faster convergence of chunk boundaries): https://github.com/Tarsnap/tarsnap/blob/master/tar/multitape/chunkify.h

Other references: https://en.wikipedia.org/wiki/MinHash https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm https://en.wikipedia.org/wiki/Rolling_hash https://moinakg.github.io/pcompress/

Have a look at https://github.com/ipfs/archives/issues/142
Also when it comes to compression, ZPAQ and Nanozip beats pcompress by a long shot.

Came across this while looking for semantic chunking (e.g. images chunked by their r, g, b, alpha planes separately). Wondering if this is supported. The discussions in the archive seem to have stalled.

Is there anyway, the chunking mechanism can be customized by the developers (when using ipfs js, libp2p etc.)

This apparently is being objected to by the go-ipfs developers (@warpfork & @jorropo) who seem to have some prior experience that leads them to believe content dependent chunking does not provide benefits. Details on the #ipld channel.

My personal opinion is that:

We need to have pluggable codecs architecture in all the code in the main repos implementing IPFS stack (Go, JS). This is not the case today apparently. Without that, this type of conversations will always stall.
The design goals for content-dependent chunking are somewhat varied.
- One possible goal is to optimize for speed of retrieval (or ingestion, or both) in which case algorithm choices are limited.
- Another possible goal is to optimize for content lineage (e.g. edits, derivative works) in which case you want to spend maximum time understanding the "natural" substructure and make a DAG out of that spending perhaps more time on ingestion (very much inline with what issue text is).
The selection of chunker (codec) is very much data and use case dependent and IMO is unlikely to have good common ground, so there should be a spectrum of possibilities controlled by configuration dials.

This apparently is being objected to by the go-ipfs developers (warpfork & Jorropo)

I am a go-ipfs dev now ? I'm not and never said I was. Plus even assuming I was, that only two persons, not the full team.

This apparently is being objected to by ...

Pls do not tell what other person things in their place without quoting them and linking to the full picture. Missrepresenting the ideas of others is very irritating.

I'm actually planning to experiment on some content based chunking myself soonish.

What I was saying in the conversation we had on the discord is that content based chunking is very inefficient at reducing bundle sizes and saving space. Compression does that role way better for less efforts.

Content based chunking is only really effective in cases of files formats that contain others and you want to store both the container and the files, for example .tar that contain files, .car that contain blocks, .webm that contain vp9 and aac. In most other cases it's less effective than wrapping your files in zstd, lzma2, ...

Note it has positive impact on latency of files if you know their access layout and where the access seeks to a precise place but it's really tiny given the current blocksize.

@Jorropo the reason I tagged you is you can correct the record of what you personally said (as you did). Also I did provide a link to the content on discord. But enough about that, please presume good intentions until proven otherwise.

I think compression serves only one of the goals (data size, storage/retrieval speed), but not the other (content lineage, large number of variations). These are basically different use cases. When we know nothing about the data itself, Rabin-Karp is probably best followed by some GP compression (ZSTD is great). I posted some Windows Server deduplication stats (which uses Rabin-Karp) elsewhere and they look pretty good on data which has VMs for instance, but also other non-specific.

The more important point I was trying to make is that we need some dials to select codec(s), chunk size, etc. to accommodate different use cases. That also presumes pluggable codecs architecture in all the main stacks, so we can avoid debates for what is and isn't included in the stack itself. I would like to choose the codecs and their parameters to pre-load on my IPFS node.

Drawback is that pluggable non-default codecs partitions the data space into those who can read it and those who can't, but this can also be remedied by providing codec addresses (which should also be in IPFS, under some trust hierarchies) and default binary decoder if someone doesn't trust the particular codec or it isn't available.

The more important point I was trying to make is that we need some dials to select codec(s), chunk size, etc. to accommodate different use cases. That also presumes pluggable codecs architecture in all the main stacks, so we can avoid debates for what is and isn't included in the stack itself. I would like to choose the codecs and their parameters to pre-load on my IPFS node.

@andrey-savov from reading your post and the channel you seem to be conflating multiple different types of extensibility here: 1) Changing how a bag of bytes are chunked up as a file (i.e. chunking) vs changing how a bag of bytes are interpreted as a DAG (i.e. custom IPLD representations)

TLDR: People have generally found more productivity with chunking if they can get away with it, otherwise they go through custom IPLD representations 2) Read vs write support for different data types within the IPFS ecosystem
TLDR: Read support is generally more important than write support since content generally has more readers than writers
3) Adding code for a new feature to an existing binary that's shipped to users vs having an easily pluggable place in code that would allow for a forked or totally independent binary that supports the new feature
- Writing and maintaining code takes work, the more you try to get existing users to upstream your code the more it matters what other people's opinion of your code/approach is. People like VMs as a mechanism for including untrusted/external code, but there are a lot of questions and work that go into this kind of thing. Issues with concrete proposals here seem like a good place to discuss these.

Taking a look here there are generally three different extensibility points you could use here. I think they were largely covered in the Matrix thread, but for posterity/future discussion it's likely easier to track here.

Note that in general developers I've encountered within the IPFS ecosystem try to make things as extensible as they can without making UX/DX miserable. If you find an area is insufficiently extensible and have a concrete proposal for how to make things better feel free to open up an issue focused on your particular proposal.

Using a custom (UnixFS) chunker

UnixFS is widely supported all over the IPFS ecosystem. You can write a custom chunker that will take a file and chunk it up in a way that existing IPFS implementations can easily deal with. For example, it's very doable to do myUnixfsChunker file.tar > ipfs dag import and your resulting CID bafymyunixfschunkedtarfile will happily be processed everywhere in the IPFS ecosystem that a fixed size chunked UnixFS file would end up.

If you're writing in Go you can even make your chunker fulfill the interfaces from https://github.com/ipfs/go-ipfs-chunker and then try and upstream your changes into projects like go-ipfs. In the meanwhile though even while nothing is upstreamed your changes are easily usable within the ecosystem

Using a custom IPLD Representation

Utilizing a common existing IPLD codec to represent your file type

Like the above this is very doable and you can do myDagCborTarFormatter file.tar | ipfs dag import. Unlike with the UnixFS chunker you haven't written any code that defines how to render your DAG as a file so some custom code will be needed to take bafymydagcborformattedtarfile and turn it into a file. This means you'll need to do something like ipfs dag export bafymydagcborformattedtarfile | myDagCborTarFormatter > mytar.tar

Advantages compared to using UnixFS: You can use IPLD to query things like the headers and other format particulars of your file type. Disadvantages compared to using UnixFS: Lack of compatibility with built in tooling (e.g. ipfs get bafymydagcborformattedtarfile or dweb.link/ipfs/bafymydagcborformattedtarfile won't work).

People definitely use non-UnixFS IPLD regularly. Historically people have gotten the most benefit doing this for data types that aren't well expressed as large single flat files (e.g. things with hash links in them, encrypted/signed pieces, etc.)

Note: There was an early attempt at such a thing with the ipfs tar command, which represents TAR files using a non-UnixFS type of DAG-PB. This has not taken off, with people preferring to use standard UnixFS importing (even without UnixFS chunking specific to tar files). IMO having a UnixFS tar chunker would likely have much higher adoption, but that's just me 🤷.

Utilizing a new IPLD codec to represent your file type

Like the above you can do myTarCodecFormatter file.tar | ipfs dag import. However, unlike the above you'll likely have to do more work to get ipfs dag export bafymytarfile | myTarCodecFormatter > mytar.tar since your IPFS node may need to understand how to traverse your CID links to download the whole DAG.

The options here are then to: 1) Write a new codec and make sure it's in the places you want to use it. go-ipfs makes it relatively straightforward to build a custom binary with new codecs in them (https://github.com/ipfs/go-ipfs/blob/master/docs/plugins.md#ipld), however if you want it to be used by default in more existing tooling you'll then have to push for that. For example, the creators of DAG-JOSE recently did this for upstreaming their codecs into go and js-ipfs. There's some interesting ideas around using VMs to allow dynamically loading codecs over IPFS, but there's a bunch of technical work that would need to happen to unlock that which seems like a weird thing to block development of a new file importer over. 2) Utilize your new codec within your own tooling and make some wrapper graph in a common codec like DAG-CBOR to move around your data between existing IPFS tools. This is nice as basically a bridge while you're working to make your codec more prevalent, but it's not super fun.

Historically new codecs have mostly been used as compatibility with existing hash linked formats (e.g. Git), but there are certainly other use cases as well (e.g. DAG-JOSE).

Advantages compared to using a supported codec: You can save some bytes by making more things implied by the codec identifier rather than explicit as fields in some format like DAG-CBOR or DAG-JSON. Disadvantages compared to using a supported codec: You need to build more tooling with awareness of your new codec

@aschmahmann Appreciate the thoughtful comment. While I am studying it in more detail to form concrete proposals, I wanted to make some high level comments:

Instead of "pushing" to upstream plugins, which is very cumbersome (and unnecessary) process, there should be a pass/fail test harness in the reference implementations (go, JS, rust) allowing to decide which plugins are shipped with them and which are preloaded. The burden of plugin maintenance, of course, falls onto plugin developers and plugins that do not pass tests can be excluded from releases. Whether or not a plugin is preloaded can be decided locally at the node level and reference implementations can make that decision based on the "network level" benefits. The point is the plugins are available to nodes in test-qualified form and they will work if loaded.
There is an opportunity to ship plugins in IPFS itself with some sort of a bootstrap mechanism as opposed to distributing them with released code/containers. This decouples plugin development from IPFS reference stacks development.
I completely agree that read-everywhere support is pretty much a must-have, but my understanding is that it's always possible to default to raw binary when specific codec is not available. Please correct me if I'm wrong.

In general, as we move toward decentralization of the Internet, we should rely less and less on centralized decisions, especially made by a group of people. In that respect, maintainers of the IPFS reference implementations should be rule-setters, not arbiters of merit.

ipfs / notes