multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
341 stars 205 forks source link

feat: add MSET "hashing" function spec #272

Closed Jorropo closed 1 year ago

Jorropo commented 2 years ago

The goal of this new value "hashing" function is to encode padding data blocks very cheaply.

My planned use case is to save ~512 bytes per entry in .tar files, because each entry has 512 zeros.

rvagg commented 2 years ago

@Jorropo could you explain the usecase a bit more? What is the nature of the tar files where (a) you're using multicodec and (b) that you have duplicate blank entries?

Can you expand on other possible uses of this? As you know, the real-estate in the single byte range is quite valuable and while experimentation should be encouraged, using up an entry down there is high cost for the table.

Jorropo commented 2 years ago

@rvagg every entry in a tar file include 512 zero bytes for futureproofness.

The theory is that tar is made for tapes (it's named Tape ARchive after all). And that moving data on tape is hell expensive. So if in the future you wanted to add more header information you just write it in that dead zone and keep your data inplace.

Turns out almost no one use this and many tar just caries around 512 zeros per entry.

Right now, I can make a single 512 zero raw block and link that. But that is big, require hashing zeros (which is wastefull).

At first I just wanted a multihash that zero filled some length, (imagine a varuint in the digest that encodes the number of zeros). However thinking it through I've realised I could make it more usefull at very minimal cost using this scheme.

That means if someone decided to use 0xdeadbeaf for padding instead well it would work just as well.

Jorropo commented 2 years ago

Note, even assuming you hammertize the cost of that one 512 zero block away.

You still need to store the hashes for it, which are likely gonna be 32 bytes+ digests. The digests for 512 zeros with this multihash is 2 bytes. Saving at least 30 bytes per entry.

rvagg commented 2 years ago

OK, so you want to make a custom chunker specifically for TAR files that creates a DAG where these 512 byte zero blocks are turned into inline-style CIDs that are only ~7 bytes long (0x01557f03800400, or bafkx6a4aaqaa is my calculation of what one of these would look like). This would then place limitations on how the chunker could process the rest of the TAR, by adding awkward boundaries that otherwise wouldn't be there. Have you done much work on a practical implementation to see how much of a saving there might be for real-world, large TAR files? I would assume this breaks down quickly as the size of the entries inside the TAR get smaller because you can't aggregate multiple entries into one because of the chunking boundaries.

Assuming you're going to want to use this for storage in IPFS & Filecoin, it's also going to require support through the stack wherever the DAG needs to be walked (which is a surprising number of places).

@aschmahmann want to add any thoughts in here?

It's an interesting exercise, this really is just about the cost of sacrificing another number under 128 and whether it's worth it.

Jorropo commented 2 years ago

OK, so you want to make a custom chunker specifically for TAR files that creates a DAG where these 512 byte zero blocks are turned into inline-style CIDs that are only ~7 bytes long

Yes.

(0x01557f03800400, or bafkx6a4aaqaa is my calculation of what one of these would look like)

It's actually 6 bytes: 0x01557f02fd03

This would then place limitations on how the chunker could process the rest of the TAR, by adding awkward boundaries that otherwise wouldn't be there.

I don't think it's awkward boundaries, but anyway that sounds offtopic, this sounds like a me problem not you.

Have you done much work on a practical implementation to see how much of a saving there might be for real-world, large TAR files?

No, but it's easy to compute, ~500 + ~500 bytes per entry (the first one is the future proof zeros, the second one is the 0x400 allignment of headers). The point here is not just to save space, but also to save CPU time. Hashing is an expensive operation, zerofilling some piece of memory can be done faster than the time it would check to hash the received content.

Assuming you're going to want to use this for storage in IPFS & Filecoin, it's also going to require support through the stack wherever the DAG needs to be walked (which is a surprising number of places).

I don't think so, thoses DAGs will be valid unixfs. The only thing other apps need is to be able to read thoses multihash, which just need to be updated in go-multihash and an update to be bubbled everywhere. I don't know if this was what you had in mind, but it's a drop in solution, we don't need to change all the walking code everywhere, just make it use a newer lib version.

It's an interesting exercise, this really is just about the cost of sacrificing another number under 128 and whether it's worth it.

I've tried to make it as versatile as possible, many binary data formats include padding data, unless you use a non repeating cyclic or random padding. This should be able to take care of it.

This would also be somewhat easly integrated to other things like rabin chunker, there is no reason my content aware chunker is the only program that could take advantage of it.

vmx commented 2 years ago

I want to support experiments, so putting it in some higher range sounds good to me. Though I certainly wouldn't put into the 1 byte range.

Have you done much work on a practical implementation to see how much of a saving there might be for real-world, large TAR files?

No, but it's easy to compute, ~500 + ~500 bytes per entry (the first one is the future proof zeros, the second one is the 0x400 allignment of headers). The point here is not just to save space, but also to save CPU time. Hashing is an expensive operation, zerofilling some piece of memory can be done faster than the time it would check to hash the received content.

When you know that it is 512 zeros, you could easily hard-code the hash value for common hash functions. This would save the CPU time for the hashing. I also personally prefer if identity CIDs are only used for "data smaller than the actual hash" and here it's not the case. Storage wise you will get savings as the data will be stored only once as it always has the same hash. So the overall savings won't be that large.

Jorropo commented 2 years ago

Since it seems like the short varuint is the only blocking issue here.

What do you want ? A 2 or 3 byte one ?

rvagg commented 2 years ago

@Jorropo there's tons of room in the two byte range, pick something under 0x3fff and you're good to go. If it turns out that this is an absolutely awesome strategy for saving space and CPU and we want to roll it out everywhere perhaps we allocate a single byte for it to make it even better and migrate it down there (it could have two, old and new, which would be very unfortunate but not the worst thing if you consider this the prototype phase perhaps?). I'm very keen to see results from this, keep us in the loop!

rvagg commented 2 years ago

Any last objections before merging?

Jorropo commented 2 years ago

I guess the question isn't for me, but I feel target after doing 3 force pushes. :D

Personally I'm good

ribasushi commented 2 years ago

Any last objections before merging?

Not so much an objection, but rather clarification: is this a multicodec for local use, or is there an expectation that "spec compliant cat implementations" will need to be able to understand this? If the latter I feel like this has not been given sufficient consideration...

ribasushi commented 2 years ago

Moreover - there seems to be no discussion of any limits for implementers. One of the main issues with the identity multihash is that a parsimonious implementation can not know how much memory to allocate ( ideally on-stack ) for processing. This proposal has the same conceptual problem: what do I do as an implementation when I receive a "repeat this empty space 16GiB times" ?

rvagg commented 2 years ago

@Jorropo some good questions raised above by @ribasushi that might be good to address.

Regarding the second point of limitations—there are a suite of issues that have come up with identity which have plagued us a bit and maybe wouldn't have if we had specced common understanding of limitations. Maybe you could put some things in your spec doc about this that place reasonable limitations on this since this is going over the same kind of territory.

Jorropo commented 2 years ago

is there an expectation that "spec compliant cat implementations"

Yes if I can show it is useful in go-ipfs. I would rescope this more precisely as all implementations that implements identity multihash, if you architecturate a multihash implementation around encoding hashes (and not around hashes of data), they both don't make sense.

Moreover - there seems to be no discussion of any limits for implementers. One of the main issues with the identity multihash is that a parsimonious implementation can not know how much memory to allocate ( ideally on-stack ) for processing. This proposal has the same conceptual problem: what do I do as an implementation when I receive a "repeat this empty space 16GiB times" ?

Very good question.

What I would propose:

One of the main issues with the identity multihash is that a parsimonious implementation can not know how much memory to allocate ( ideally on-stack ) for processing.

I think people would implement that using infinite data structures.

A "repeat this pattern" is one of the easiest infinite data structure to implement. (the only easier one being "repeat this element").

The only case where this is not doable is if your API is made to pass buffers instead of virtual functions pointers or streams. (in go term that []byte VS io.ReaderAt or io.Reader).

If the latter I feel like this has not been given sufficient consideration...

I don't really know what else there is to be. (limits ? ok for that more thought about that is needed) My personal goal is just represent runs of zeros cheaply.

I think this is already over engineered for what it's worth. Assuming 32 bytes digests, you can use any pattern up to 23 bytes long (and still filing 2**63-1 bytes). Except cyclic patterns (which by design are not compressible with simple repeats), I don't know of data formats that use patterns bigger than this. Most of them use zeros, or word (4 or 8 bytes) big patterns.

I would say for a "give me zeros" multihash it's quite complex:

The only potential issue I see is that even assuming no limits, this algorithm has blind spot (it cannot encode things that has 0 or 1 repeats). So implementations might need to return inline CID when being asked to few repeats. This might be architecturally challenging.

what do I do as an implementation when I receive a "repeat this empty space 16GiB times" ?

What do you if you receive a 16GiB block over the wire ? Well you do that.

BigLep commented 1 year ago

@Jorropo : is there any known pressing usecases that is driving this work? If not, I'm apt to close this out for now.

Jorropo commented 1 year ago

I have an other idea about how to do this.