Add field to store/add to include commP

Gozala commented 1 year ago

We should add a field to store/add capability in order to include commP proof with every write

@ribasushi is there js code somewhere we could use to compute it on client side ?

@mikeal computing it on the client side is fine, but we can’t trust client & would need to verify, which makes me wonder if doing it on a client is even worth it, as we may only complicate things by having to handle case where client provides incorrect value.

ribasushi commented 1 year ago

@Gozala the client-side calculation is so that a client has a full chain-of-custody, that starts on their own machine.

The only thing I am aware of as prior-art is https://github.com/rvagg/js-fil-utils. @rvagg can you chime in wrt completeness / whether you'd recommend it to be used today as-is?

A test corpus is available here: https://github.com/filecoin-project/go-fil-commp-hashhash/tree/master/testdata

Gozala commented 1 year ago

@Gozala the client-side calculation is so that a client has a full chain-of-custody, that starts on their own machine.

Wouldn't same exact CAR file produce same exact commP ? If so I'd argue user providing CAR CID does provide full chain-of-custody. We will capture computed commP in the receipt and sign it so we can be hold accountable if computed incorrectly.

ribasushi commented 1 year ago

commP is a stream-hash, a bit weird one internally, but for an end-user behaving exactly like say sha256. What is special about commP is that a user can plug it into a Fil oracle directly, without going through w3s at all, and verify that indeed you guys did deliver the exact bytes that were uploaded, all the way to the SP. You can not do this with a CAR CID, nor with sha256 nor with any other hash aside from the fil-native commP.

Hence you want to train your users ( at least these that care about Fil ) to key everything they upload off commP. Hence the user-side calculation.

mikeal commented 1 year ago

if the client lies about commp, then the data won’t land in a deal.

when the miner notices it doesn’t match they invalidate the deal and can notify the broker (spade) why and they can create a new deal that just drops that data.

the purpose of the cryptography being used here is to give trustless guarantees between actors who don’t already have the underlying data to calculate the proof. the moment you accept that some actor is “responsible” for validation that isn’t the miner or the client you’ve accept the permanent need for an actor to sit between them with a full copy of the data.

there may be some communication necessary between actors to repair the deal, but it must be accomplished using the proofs to avoid actors between the client and the miner having to validate each proof themselves against the deal data because this will always necessitate an additional copy (additional infra) between the client and filecoin forever, or as long as you use that means of deal creation.

ribasushi commented 1 year ago

when the miner notices it doesn’t match they invalidate the deal

Better - the chain consensus itself ( specifically the PoRep part ) does this. CommD in the linked verification function is the "CommP of the entire sector", with all deals and free space. Every single node in the network ensures that the storage claim is legit when it is made.

Gozala commented 1 year ago

@mikeal so if I understand you correctly we basically don't do anything with the CommP other than store it in the invocation, correct ?

What do we do if we have two users uploading same CAR but specify different CommP ? Do we not care and just store ?

P.S.: I'm less worried about someone lying, and more about a bug in the client and deploying updates to clients.

Gozala commented 1 year ago

@yusefnapora found this relevant repo https://github.com/rvagg/rust-fil-commp-generate

ribasushi commented 1 year ago

@Gozala @yusefnapora note: this contains a pretty old copy of the currently used rust code: https://github.com/filecoin-project/rust-fil-proofs/blob/128f7209/filecoin-proofs/src/api/mod.rs#L322-L353 It will work, but possibly missing a lot of speedups since.

Gozala commented 1 year ago

@ribasushi thanks for the pointer! I think creating a wasm wrapper library from there is probably a best way to go about it. Also probably a good self-contained project for hacking on rust/wasm

rvagg commented 1 year ago

So https://github.com/rvagg/js-fil-utils does commp, it's written for Node.js files but that's just cause that's all I cared about at the time. It's written for uint8array iterables so you'd just rip out the Node.js imports and make it take a plain asynciterable and replace the crypto with (probably) native browser sha2-256. I wouldn't want to be doing this in the browser over very large assets, but I assume we're talking relatively small things.

WASM might be overkill? It's really not that complicated a process, maybe WASM gives you a perf boost but the most expensive thing is calling out to sha2-256. Have a look at the JS code, there's 3 main components to it - zero padding to get up to a ^2, fr32 padding to space out 2 bits every 254, and a merkle tree using a slightly truncated/zeroed sha2-256 result.

Gozala commented 1 year ago

Depends on https://github.com/web3-storage/w3up/issues/712

Gozala commented 1 year ago

I've started modernizing https://github.com/rvagg/js-fil-utils but as I've added tests I uncovered that produces commP had been inconsistent with https://github.com/filecoin-project/go-fil-commp-hashhash/tree/master/cmd/stream-commp#usage-example. After which I end porting some of the code from https://github.com/filecoin-project/go-fil-commp-hashhash/blob/master/commp.go instead. Resulting library had been published here https://github.com/web3-storage/data-segment/ and it has test vectors in place (@ribasushi I can share what bugs I've found / fixed if you want to backport them)

Problem

While integrating commP into our client lib it became clear that computing it in browser (using web native crypto APIs) had been unacceptably slow ~26s for 46MiB file, while same code with same source input in node was 0.02s.

My hypothesis is that browsers rate-limit sha-256 compute to prevent malicious actors from doing crypto mining. I did more investigation on the matter and here some data on the same 46MiB input

sha256 impl	thread	API	time
node crypto	main	async	0.02s
web crypto	main	async	26s
web crypto	worker	async	2s
@noble/hashes	main	sync	0.4 s
@noble/hashes	worker	sync	0.25 s
@noble/hashes	main	async	0.8 s

I think numbers support my hypothesis, profiler info gathered from browser devtools provided no evidence of it. It could be that hopping between native thread and js thread is a culprit instead.

Conclusions

I think it might be warranted to reconsider idea of pre-computing piece cid (a.k.a commp) in the client before and including it with store/add request. Even if we can ship clients with @noble/hashes implementation and get negligible overhead it is no longer clear to me that we want to impose commP pre-computation. Reflecting on this it would make more sense to me to let the client do that job async and submit a commP for the uploaded CAR as content claim.

The tradeoff is that client may never submit a commP, but it's no longer clear that it is a problem. In fact I think it may be better to incentivize clients to submit commP as opposed to enforcing it, that creates an opportunity to outsource that computation. Furthermore we could even expire uploads for which commP had not been provided. Our client libs could also perform "store/add" / "commP claim" operations concurrently as part of a single upload task.

mikeal commented 1 year ago

Something we should flag as an ongoing concern that we need to keep an eye on: doing proofs in the browser means we’re being chased by browser security features that are trying to fight crypto mining.

dchoi27 commented 1 year ago

The tradeoff is that client may never submit a commP, but it's no longer clear that it is a problem

Could you elaborate on this (or if already mentioned somewhere just share a link to why)? I think we need to make it as easy as possible today to have what users need for getting data into Filecoin - not enough users care enough about the gory details of Filecoin. Think we should just ship using the efficient CommP generation solution in the client and make it required (though if trivial we can structure sending the CommP to our API using a content claim and make it not required in the future)

Gozala commented 1 year ago

The tradeoff is that client may never submit a commP, but it's no longer clear that it is a problem

Could you elaborate on this (or if already mentioned somewhere just share a link to why)?

Stated rational is "that creates an opportunity to outsource that computation", more specifically other entities could plug in here. It could be us doing it (and potentially charging for it) or things roughly equivalent of of indexer nodes or Saturn doing it.
Clients we ship can compute it and send it but parallelize it with actual uploading making end user experience faster.
If you send as a store/add request and we already have a CAR we tell you don't bother uploading we've got it. In such scenario computing commP would be a waste and it would make sense not to do it. In fact if we have a CAR already but not the commP would could then tell client, don't bother with upload but do send us commP.
If we will be verifying commP, which I suspect we'll have to otherwise we may end up paying with our reputation, there is little value from getting it from clients because verification requires computing it on our end.
If my hypothesis correct and browsers are indeed rate-limiting sha256 to prevent crypto mining we are going to loose that battle it's matter of time. If our system presumes precomputed commP with every upload most our clients will be rendered impractical and we'll be forced to revisit this coupling. Decoupling things now would cost us far less effort and stress than trying to doing to it in firefighter mode.

All that said I think I do struggle rationalizing introducing barriers in a name of reducing cost of computation on our end. It seems like we should provide be giving users a choice to:

Run compute on their own
Run compute for them at their expense
Outsource compute to whoever they choose

By decoupling store/add from commP we make above choice possible. By precluding commP we remove such a choice and consequently create a barrier for adoption.

dchoi27 commented 1 year ago

Got it, thanks! I think the point on us needing to eventually verify it is a key one. But let's just opt for the simple solution for now that's future-proof: just either do it in the client or ourselves (if the latter would love to understand the projected compute cost), and not expose users to this choice. But we can put things in UCAN claims so that if users end up asking us for this choice, we can give it to them (and it sounds like it has additional benefits like being able to parallelize in the client upload experience).

Outsourcing the computation, creating an open market around it, etc. are less important today.

ribasushi commented 1 year ago

Folks... "computing" commp is dirt cheap. Accessing the data is what's expensive. Just compute it in a spot where the data flows through anyway and call it a day.

The original hesitation was around pulling the data back out from "at rest", which is what is expensive, not the computation itself.

alanshaw commented 1 year ago

So doing the maths relevant to our shard sizes (using @noble/hashes sync main time 0.4s as base):

~35s overhead on the biggest shard we can send - 4GiB.

On a more typical chunk (~128MiB) it would be ~1s.

For most uploads (very small) it will be negligible.

It's not great, but it's not the worst.

@Gozala how about this?:

Decouple such that submitting a commp is a seperate invocation, but (at least for now) have the client compute and submit it at the same time as the store/add invocation.

That way we allow ourselves flexibility in the event that it becomes untenable to generate in the browser in the future. We can change the client without changing the protocol.

Folks... "computing" commp is dirt cheap.

@ribasushi I don't understand!? @Gozala is explicitly raising that this is not the case in the browser, even if it's being done as the data is being generated/sent.

ribasushi commented 1 year ago

@alanshaw I was responding to

Run compute for them at their expense

Would you phrase things this way if you were speaking about md5? This phrasing frames thing as if the "compute" is significant. I was trying to reframe the discussion as a whole 😅

Gozala commented 1 year ago

@alanshaw I was responding to

Run compute for them at their expense

@ribasushi it's was a general sentiment I'm making towards attitude of "we don't want to do this on backend because it's expensive". Most services don't care about it they just charge you for their spend & I'm suggesting that is how we should approach things as well.

In this specific case, computing commP in the backend will be very fast, probably good idea to do to ensure we don't get invalid one from client anyway.

storacha / w3up

Add field to store/add to include commP #353

Problem

Conclusions