zero fill multihash and multicodec

mikeal commented 2 years ago

This one is gonna require some thought, so I wanted to get it out there and spend some time discussing before “IPFS thing.”

@ribasushi has been pushing hard to get [Filecoin] piece CID’s used more broadly across the protocols. The benefit of using this over CAR CID’s for multi-block identifiers is that it will work as an inclusion proof down the line.

The problem I’m having fitting it into service layers is the padding (proofs trees have to be calculated over min sizes with a factor of 2, so one unlucky byte over the limit and the payload has to double). We can injest CAR files like w0ah now, but it’s all built on CARs and CAR CIDs (hash of the full CAR) and if the payload potentially doubles in size to cover the padding I can’t switch to piece CID based protocols.

One idea I had is, why not a multihash that is just a 64b integer that validates as true if the payload is entirely zeros of that length? Combined with a zero fill codec, the CID would be a fixed size and you’d just write the length into the multihash.

We could then write that CID into the end of the CAR as a zero length block which any protocol that needs to produce a Piece CID could easily expand.

This would be useful, but I’m a little worried that this will create a vulnerability up the stack given that it compresses a lot of predictable data into payloads and proofs without a user ever actually having to pass the data, which could mean they get to skip all the “work” in a proof. I haven’t considered it enough to know for sure, and am surely lacking the full extent to which it could potentially wreak havoc, which is why I’ve posted the idea for discussion.

willscott commented 2 years ago

if the payload potentially doubles in size to cover the padding I can’t switch to piece CID based protocols.

Is it the memory footprint on the client while receiving that you're worried about here? transport compression should mean that in practice the padding isn't a problem for the data transfer.

ribasushi commented 2 years ago

@mikeal see https://github.com/multiformats/multicodec/pull/272/files?short_path=8af5da8#diff-8af5da83f102e1adbe4b65cfdd2bd6ed91a51c4172c65f8e17fc455c30e98375

I did raise the exact objection of "it opens an amplification attack on receivers needing to allocate absurd amounts of memory"

Broadly I am with @willscott: transport and storage compression are your friend here...

ribasushi commented 2 years ago

@mikeal also - why would you ever store or from-client-transfer the padding? Can you clarify the use case more?

rvagg commented 2 years ago

@mikeal you should read #272 first, which is kind of what this is.

@ribasushi when you fetch a whole "piece" by Piece CID, does that come with or without the fr32 padding included? Are we getting whole power-of-two pieces, or power-of-two-minus-1/127? i.e. doesn't fr32 add another complication here for addressing whole pieces? The Piece CID is a proof across the zero-pad + fr32-pad, so it's not just a matter of being able to address two thing separately, the original CAR isn't even present in the bytes of the whole piece, it's spread out with those extra bits interspersed. The relationship between Piece CID and Payload CID is entirely absent unless it's recorded elsewhere (which I believe it optionally is, correct?).

ribasushi commented 2 years ago

@rvagg no, the point here as a client producing car files to upload to Web3Storage or any other aggregator to establish full proof of custody. I have not written a full spec for this, lack of time but the gist is:

As a client:

I generate my car file via client/browser/whatever, to benefit from the IPFS goodness
I take the payload of the car ( just the blocks, throw away the header ) and feed that into a commp hasher
When my "payload" finishes, I virtually add extra padding that looks like car frames (see below why) to the end of that stream I feed to the commp hasher ensure that the stuff I uploaded is divisible by exactly 127 * 2^n
I hold on to this commp that I myself calculated, it is now as important to me as the payload dag CID itself

Now when an aggregator 2.0 gets all these individual car files that were uploaded it knows to:

order them smallest to largest
add a bogus header with some clear rootbafythisisanunorderedcar
simply concatenate all the "payloads" one after the other including the now reified padding ( you never stored it anywhere prior, it is there just for the car-for-dealmaking )
create commp merkle inclusion proofs for each individual user-car-payload into the larger bytestream

Now you are ready to make a filecoin deal with a bytestream which:

is a properly formed .car file
contains the blocks from each individual upload
is understood by all indexing infrastructure with no modification
has sections of that larger bytestream match perfectly with what the user gave you ahead of time

Result: actual, every 24h proofs that your upload is on the network, since you can trivially see that your own commp is within the larger one by virtue of the inclusion proof that was recorded when everything was smooshed together.

Cue derivative financial instruments over non-interactive proofs that data is alive 😉

mikeal commented 2 years ago

We can’t point to features in the transport layer to solve these sorts of problems in the payload. It works for parts of our system but not others because in a large distributed system like ours you often aren’t accepting payloads in a synchronously scriptable location accepting the data (we often write directly to S3 and then notify async processing of the location).

Given the constraints (thanks all for the pointers to issues with this approach) I think what we’re going to end up with is a relatively complicated but ultimately quite performant slab allocator that you write blocks too. When you “complete” it’ll generate commp (slab is already zero filled) generate a CAR header with commp in it, then combine the header and the slice of the slab to make a CAR w/o zero fill. We’ll use the CAR header to signal “this is a secure transaction that should match this piece CID” and when the system receives the CAR and validates the piece CID (with added zero fill bytes) it’ll send a signed message validating that CAR CID’s block data matches the piece CID the CAR header claimed. Those signed messages will produce a map of CAR CIDs to piece CIDs (and vice versa) we can just maintain and even publish in realtime via the signed messages. That’s enough for us to leverage each identifier when it’s most useful and we’ll be able to keep the padding out of the actual CAR we move around.

ribasushi commented 2 years ago

@mikeal you can not outright 0x00 fill - you need to fill with valid car frames. Something like the following will do: 0x 04 01 55 00 00 - you repeat these 5 bytes over and over again - this is your padding. Without doing so systmes that expect a "well formed car file" will puke / abort early when encountering a stream of 0x00s

rvagg commented 2 years ago

Which happens to be precisely the origin of the ZeroLengthSectionAsEOF option in go-car: https://github.com/ipld/go-car/blob/7ba9372ba205c2baf620c0b104dc08ed80b1dffb/v2/options.go#L66-L74

That only exists for Lotus, for this case, internally; but someone could make the argument that this should be a generally acceptable usage mode outside of Lotus too, even for js-car. It's just kind of annoying since it's a fork in the spec (this option existed on a custom branch of go-car for some time post Lotus launch because it was a fork in the spec that we didn't accept, but Lotus didn't care). And the "systems that expect a 'well formed car file'" doesn't go away, we just end up making those systems aware that "well formed" is not what they thought it was.

ribasushi commented 2 years ago

we just end up making those systems aware that "well formed" is not what they thought it was

This approach requires a flag-day, which is something we are really bad at in the ecosystem.

My point higher up is to create amalgams that are indistinguishable from any other car file structure-wise. Sure they will have a few repeated blocks, but this is inevitable anyway since the contained sub-cars are independent. Remember - the goal here is to establish a straightforward path to claim and verify "X is in Y": the only robust way to do this is to make the claim on bytes, not dags.

Easiest would be to explain with an implementation, but I won't have cycles to get to this maybe until ~Thu... 😿

rvagg commented 2 years ago

04 01 55 00 00 - you repeat these 5 bytes over and over again

you're also assuming that the padding required is going to be a multiple of 5 bytes; making a "well formed" CAR may end up being more complicated than adding empty minimal non-blocks, there may need to be at least one real block with a valid CID for it too, but then that's even less "padding" than the repeated 5 byte patten is.

BigLep commented 1 year ago

Closing due to lack of activity. Please feel free to reopen if this is still a pressing need.

multiformats / multicodec

zero fill multihash and multicodec #275