feat: blob protocol draft

Gozala commented 3 months ago

Based on #114. PR is written with UCAN 1.0 format and assuming https://github.com/web3-storage/RFC/pull/12 however in terms of immediate implementation I suggest that we instead

Use blob/add instead of /space/content/add/blob
Use web3.storage/blob/* in place of /service/blob/

I suggest above because I think it would make more sense to transition to https://github.com/web3-storage/RFC/pull/12 once we have UCAN@1.0 implemented, because I suspect links to tasks vs invocations are going to be a pain to deal with otherwise. This will give us cleaner break.

In terms of implementing /service/blob/accept and specifically how does client signal that they've completed upload I suggest that we do whichever is easiest option from following two:

Make client sent second blob/add invocation after they've done upload so we can perform a lookup.
Add another temp capability with empty output either very specific like blob/add/poll or very general like invocation/wake { task: CID }.

vasco-santos commented 3 months ago

@hannahhoward yes! This is our way forward with write to anywhere to avoid the migration pains. But we need to include in the write to anywhere ticket

Gozala commented 3 months ago

Is the intent for blob/ to eventually replace store/?

Do we have the work to implement this tracked anywhere?

I do think we want to some day, but there is no urgency to do so. We do want to introduce new blob capabilities here as they do not imply bucket events and we want to migrate clients to that.

(seems like it's not trivial amount of work -- should probably ship with spec)

Generally we have been aligning on spec first and only then going about implementing it, sometimes that imply changes to spec but alignment on how seemed to be a good way to avoid changing code back and forth.

hannahhoward commented 3 months ago

So blob/* is the new write to anywhere api?

This needs clarification in the PR description

hannahhoward commented 3 months ago

Realizing removing the redelegation requires invoking IPNI/offer and/or store/publish as seperate capabilities. Again, this is I think the right approach.

If a location is managed by the provider, it must be responsibile for keeping location claims, wherever they reside, updated.

Gozala commented 3 months ago

If a location is managed by the provider, it must be responsibile for keeping location claims, wherever they reside, updated.

What we settled on in the https://github.com/web3-storage/RFC/pull/13 that our issued location claims will have HTTP URLs with hints of where content is. Those URLs will simply route / redirect to the location content is in the system.

This enables us to make long term location commitments while retaining ability to change actual site of the content over time.

Does this address the concern you're raising ?

Gozala commented 3 months ago

For now, I think we should simply make the audience of the location claim public. The current flow implies a redelegation of authority every time the location changes.

I think this is inaccurate, as mentioned in prior comment per https://github.com/web3-storage/RFC/pull/13 we intend to make issued location changes agnostic of content site changes.

I'm concerned we're thinking we can get private data for free here without thinking through the implications. Private retrieval is a huge topic -- I think we should design for it intentionally, rather than throw in features that may create lots of headaches on a theory it will be sufficient for private data. I think privacy should be set at the ingress point (i.e. by the space owner on the blob itself, absent the location, either during blob/add or through some seperate update apit), not on the location claim itself which is temporary.

I think there is a misunderstanding here. I also want to assure you that I have been thinking about private retrievals for more than a year now it's just there were always higher priorities at play.

I'd like to step back from privacy here for moment. What is proposed here is not to assume that adding a blob implies you want it to be indexed, advertised and served across all possible channels. Instead we can issue a commitment to the space itself that we will serve the blob from the given URL through the commitment validity window. This commitment can be published by user if they want to make that data available publicly. This also creates an opportunity to create a system where our commitments can be verified and we can be held accountable if we fail to uphold them.

In other words it gives user a choice to do what they want with our commitment to serve the blob.

hannahhoward commented 3 months ago

Thinking back to https://github.com/web3-storage/RFC/pull/13 I now see this is a bit of a misnomer as a solution.

Yes, it's a public URL that will continue to work, if you go through w3s.link.

It still has an expiration and the data still could move. Presumably if I have https://w3s.link/ipfs/bafy...BLOBCID?origin=r2://region/bucketName/key and then it gets moved, the service would want to publish at minimum a second claim with a better hint -- i.e. https://w3s.link/ipfs/bafy...BLOBCID?origin=r2://region2/bucketName2/key2-- even if the first continued to be valid. But how does the client know to re-delegate this new claim in order to publish it? Or if the claim expires, and the publisher wants the new claim (potentially still at the same location) re-published, are they in charge of monitoring that?

Also, unless I misunderstand, we are trying to get to a public mechanism whereby a client can find the location of the car and its index, and do range reads. How would it do so if the hints become out of date or the claim expires? Would it talk directly to the w3s content claims service? How would it know to do so (assuming it wasn't a web3.storage exclusive retrieval client)? Or would it be forced to fetch from w3s.link instead at that point?

Ultimately, I think we're dealing with two different chains of authority we're trying to smash into one here:

Authority to manage the visibility / discoverability of a piece of data
Authority to move that data around and update its location

I think we should probably take a step back and sync on all this. I feel that these are valid concerns, not just nitpicking, especially as we look forward to the architecture we're building.

My proposal is to hold on this till we can sync during check-in on Thursday (in the interest of not adding meetings, use what's scheduled)

Note: I'd prefer not to mess up @vasco-santos ability to implement, so I'm ok if he starts working in the meantime. But I think we're not in alignment completely yet.

hannahhoward commented 3 months ago

Documentation about why I removed my objections:

A public location is in fact a commitment to make data available at a given URL for the length of the claim. Therefore, it makes sense that it's ok to let the user choose to then share the claim with the wider audience
https://github.com/web3-storage/RFC/pull/13 is intended to enable this - in the sense that it enables the provider to make a claim that data is available at a given URL, and maintain the URL in a context where it moves around internally
The same PR is not a solution for the point where we have a network of providers. But that's ok -- we can explore other solution when we get there, either by having those providers publish claims themselves, or by using w3s as a router in claims. There is probably more design work to do here.

Also: I can see a world where the right solution is to have the location claims be issued by provider, and then retrieved when you do blob/get. I wonder if we should treat "repair" -- where a provider fails to satisfy their claims -- as a seperate service.

storacha-network / specs

feat: blob protocol draft #115