Open donothesitate opened 8 years ago
Have a look at https://github.com/ipfs/archives/issues/142
Also when it comes to compression, ZPAQ and Nanozip beats pcompress by a long shot.
Came across this while looking for semantic chunking
(e.g. images chunked by their r, g, b, alpha planes separately). Wondering if this is supported. The discussions in the archive seem to have stalled.
Is there anyway, the chunking mechanism can be customized by the developers (when using ipfs js, libp2p etc.)
This apparently is being objected to by the go-ipfs
developers (@warpfork & @jorropo) who seem to have some prior experience that leads them to believe content dependent chunking does not provide benefits. Details on the #ipld
channel.
My personal opinion is that:
This apparently is being objected to by the go-ipfs developers (warpfork & Jorropo)
I am a go-ipfs dev now ? I'm not and never said I was. Plus even assuming I was, that only two persons, not the full team.
This apparently is being objected to by ...
Pls do not tell what other person things in their place without quoting them and linking to the full picture. Missrepresenting the ideas of others is very irritating.
I'm actually planning to experiment on some content based chunking myself soonish.
What I was saying in the conversation we had on the discord is that content based chunking is very inefficient at reducing bundle sizes and saving space. Compression does that role way better for less efforts.
Content based chunking is only really effective in cases of files formats that contain others and you want to store both the container and the files, for example .tar
that contain files, .car
that contain blocks, .webm
that contain vp9 and aac.
In most other cases it's less effective than wrapping your files in zstd, lzma2, ...
Note it has positive impact on latency of files if you know their access layout and where the access seeks to a precise place but it's really tiny given the current blocksize.
@Jorropo the reason I tagged you is you can correct the record of what you personally said (as you did). Also I did provide a link to the content on discord. But enough about that, please presume good intentions until proven otherwise.
I think compression serves only one of the goals (data size, storage/retrieval speed), but not the other (content lineage, large number of variations). These are basically different use cases. When we know nothing about the data itself, Rabin-Karp is probably best followed by some GP compression (ZSTD is great). I posted some Windows Server deduplication stats (which uses Rabin-Karp) elsewhere and they look pretty good on data which has VMs for instance, but also other non-specific.
The more important point I was trying to make is that we need some dials to select codec(s), chunk size, etc. to accommodate different use cases. That also presumes pluggable codecs architecture in all the main stacks, so we can avoid debates for what is and isn't included in the stack itself. I would like to choose the codecs and their parameters to pre-load on my IPFS node.
Drawback is that pluggable non-default codecs partitions the data space into those who can read it and those who can't, but this can also be remedied by providing codec addresses (which should also be in IPFS, under some trust hierarchies) and default binary decoder if someone doesn't trust the particular codec or it isn't available.
The more important point I was trying to make is that we need some dials to select codec(s), chunk size, etc. to accommodate different use cases. That also presumes pluggable codecs architecture in all the main stacks, so we can avoid debates for what is and isn't included in the stack itself. I would like to choose the codecs and their parameters to pre-load on my IPFS node.
@andrey-savov from reading your post and the channel you seem to be conflating multiple different types of extensibility here: 1) Changing how a bag of bytes are chunked up as a file (i.e. chunking) vs changing how a bag of bytes are interpreted as a DAG (i.e. custom IPLD representations)
Taking a look here there are generally three different extensibility points you could use here. I think they were largely covered in the Matrix thread, but for posterity/future discussion it's likely easier to track here.
Note that in general developers I've encountered within the IPFS ecosystem try to make things as extensible as they can without making UX/DX miserable. If you find an area is insufficiently extensible and have a concrete proposal for how to make things better feel free to open up an issue focused on your particular proposal.
UnixFS is widely supported all over the IPFS ecosystem. You can write a custom chunker that will take a file and chunk it up in a way that existing IPFS implementations can easily deal with. For example, it's very doable to do myUnixfsChunker file.tar > ipfs dag import
and your resulting CID bafymyunixfschunkedtarfile will happily be processed everywhere in the IPFS ecosystem that a fixed size chunked UnixFS file would end up.
If you're writing in Go you can even make your chunker fulfill the interfaces from https://github.com/ipfs/go-ipfs-chunker and then try and upstream your changes into projects like go-ipfs. In the meanwhile though even while nothing is upstreamed your changes are easily usable within the ecosystem
Like the above this is very doable and you can do myDagCborTarFormatter file.tar | ipfs dag import
. Unlike with the UnixFS chunker you haven't written any code that defines how to render your DAG as a file so some custom code will be needed to take bafymydagcborformattedtarfile
and turn it into a file. This means you'll need to do something like ipfs dag export bafymydagcborformattedtarfile | myDagCborTarFormatter > mytar.tar
Advantages compared to using UnixFS: You can use IPLD to query things like the headers and other format particulars of your file type.
Disadvantages compared to using UnixFS: Lack of compatibility with built in tooling (e.g. ipfs get bafymydagcborformattedtarfile
or dweb.link/ipfs/bafymydagcborformattedtarfile won't work).
People definitely use non-UnixFS IPLD regularly. Historically people have gotten the most benefit doing this for data types that aren't well expressed as large single flat files (e.g. things with hash links in them, encrypted/signed pieces, etc.)
Note: There was an early attempt at such a thing with the ipfs tar
command, which represents TAR files using a non-UnixFS type of DAG-PB. This has not taken off, with people preferring to use standard UnixFS importing (even without UnixFS chunking specific to tar files). IMO having a UnixFS tar chunker would likely have much higher adoption, but that's just me 🤷.
Like the above you can do myTarCodecFormatter file.tar | ipfs dag import
. However, unlike the above you'll likely have to do more work to get ipfs dag export bafymytarfile | myTarCodecFormatter > mytar.tar
since your IPFS node may need to understand how to traverse your CID links to download the whole DAG.
The options here are then to: 1) Write a new codec and make sure it's in the places you want to use it. go-ipfs makes it relatively straightforward to build a custom binary with new codecs in them (https://github.com/ipfs/go-ipfs/blob/master/docs/plugins.md#ipld), however if you want it to be used by default in more existing tooling you'll then have to push for that. For example, the creators of DAG-JOSE recently did this for upstreaming their codecs into go and js-ipfs. There's some interesting ideas around using VMs to allow dynamically loading codecs over IPFS, but there's a bunch of technical work that would need to happen to unlock that which seems like a weird thing to block development of a new file importer over. 2) Utilize your new codec within your own tooling and make some wrapper graph in a common codec like DAG-CBOR to move around your data between existing IPFS tools. This is nice as basically a bridge while you're working to make your codec more prevalent, but it's not super fun.
Historically new codecs have mostly been used as compatibility with existing hash linked formats (e.g. Git), but there are certainly other use cases as well (e.g. DAG-JOSE).
Advantages compared to using a supported codec: You can save some bytes by making more things implied by the codec identifier rather than explicit as fields in some format like DAG-CBOR or DAG-JSON. Disadvantages compared to using a supported codec: You need to build more tooling with awareness of your new codec
@aschmahmann Appreciate the thoughtful comment. While I am studying it in more detail to form concrete proposals, I wanted to make some high level comments:
In general, as we move toward decentralization of the Internet, we should rely less and less on centralized decisions, especially made by a group of people. In that respect, maintainers of the IPFS reference implementations should be rule-setters, not arbiters of merit.
There is a need for content dependant chunker.
Therefore I welcome objective, non-opinionated discussion, proofs of concepts, benchmarks, and other around this subject.
Questionable is what technique, what polynomial (if applicable), and what parameters to use for given chunker, so that it is performant across the board and effectively helps everyone.
Also the question is what files/data to chunk by this chunker. Given that compressed data would likely not benefit at all from content dependent chunking. That is unless the file is archive with non-solid compression, or alike. Should this be decided automatically by some heuristics and for say text files that are not minified versions of js/css, etc. to use such chunker, else select regular chunker? By file headers?
This can have a great impact for (distributed) archival of knowledge (think Archive.org, except with dedup, better compression, and that can be distributed easily). Which also does raise question if chunks should be stored compressed. But that is partialy side-tracking this issue.
One reference implementation with focus on storage saving (faster convergence of chunk boundaries): https://github.com/Tarsnap/tarsnap/blob/master/tar/multitape/chunkify.h
Other references: https://en.wikipedia.org/wiki/MinHash https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm https://en.wikipedia.org/wiki/Rolling_hash https://moinakg.github.io/pcompress/