storacha-network / w3up

⁂ w3up protocol implementation
https://github.com/storacha-network/specs
Other
50 stars 16 forks source link

A stop gap solution for blob implementation until invocation spec #1339

Open vasco-santos opened 4 months ago

vasco-santos commented 4 months ago

See blob protocol https://github.com/web3-storage/specs/pull/115

  1. In the meantime I suggest we go with blob/* instead of space/content/*/blob and web3.storage/blob/* instead of service/blob/* for new capabilities until invocation spec given we are still not at UCAN 1.0 and links to tasks vs invocations are going to be a pain to deal with otherwise ✅
  2. I think we have a similar pattern as w3-filecoin. Here is where my mind is: a. blob/add queues for allocation if content is not already allocated in given space (resource of invocation) b. blobAllocateQueue consumer reads blob requests for allocation and self invokes (by w3s) blob/allocate , which if space has enough bytes available will insert in an allocationStore record with content+space.
  3. We store in our store/table (counterpart of allocationStore) this input. Looking at blob/allocate we are missing issuer and size in the args. Size is part of blob, so everything is fine there. We have the taskCid though, which would allow us to find out the issuer if needed, but it may be an old artefact and not needed ✅
  4. I assume we are opting on the implementation to invoke blob/add second time when bytes were written to presigned URL. Can you confirm my assumption? ✅
  5. car/multihash checks on migration path ✅ a. we currently have carpark-prod-0 in R2/S3. When we perform a store/add we check bytes are there so that we ask users to write things. b. we now move to write things by multihash and I am wondering if you have any thoughts on deduping here? Could we simply test multihash with CAR codec and give location claim? c. actually same also happen on our storeTable today. We key it by the link…
Gozala commented 4 months ago

I assume we are opting on the implementation to invoke blob/add second time when bytes were written to presigned URL. Can you confirm my assumption?

In terms of implementing /service/blob/accept and specifically how does client signal that they've completed upload I suggest that we do whichever is easiest option from following two:

  1. Make client sent second blob/add invocation after they've done upload so we can perform a lookup.
  2. Add another temp capability with empty output either very specific like blob/add/poll or very general like invocation/wake { task: CID }.
Gozala commented 4 months ago

In the meantime I suggest we go with blob/* for new capabilities until invocation spec given the current implementation limitation with task/delegations. Just not sure if we also want a temporary prefix for service ones?

Spec is written with UCAN 1.0 format and assuming https://github.com/web3-storage/RFC/pull/12 however in terms of immediate implementation I suggest that we instead

  1. Use blob/add instead of /space/content/add/blob
  2. Use web3.storage/blob/* in place of /service/blob/

I suggest above because I think it would make more sense to transition to https://github.com/web3-storage/RFC/pull/12 once we have UCAN@1.0 implemented, because I suspect links to tasks vs invocations are going to be a pain to deal with otherwise. This will give us cleaner break.

Gozala commented 4 months ago

I think we have a similar pattern as w3-filecoin. Here is where my mind is: a. blob/add queues for allocation if content is not already allocated in given space (resource of invocation) b. blobAllocateQueue consumer reads blob requests for allocation and self invokes (by w3s) blob/allocate , which if space has enough bytes available will insert in an allocationStore record with content+space.

I wrote some comments on how I imagine this could work, but I think we need to try to determine if it is in fact viable.

Gozala commented 4 months ago

3. We store in our store/table (counterpart of allocationStore) this input. Looking at blob/allocate we are missing issuer and size in the args.

We have the taskCid though, which would allow us to find out those two, however I would prefer to make them just part of the param instead. Would you think this is reasonable? I honestly do not know why we store the issuer and the size . Size may be nice to have to query all items in space and get sum of their sizes without HEAD request, and we actually return it in the proposed receipt. So I think that I would def add it

We aren't missing size because Blob is { size, content } see

https://github.com/web3-storage/specs/blob/9cc1b039431ed2853b30e5874c5fe3a1acf7876e/w3-blob.md?plain=1#L263-L267 https://github.com/web3-storage/specs/blob/9cc1b039431ed2853b30e5874c5fe3a1acf7876e/w3-blob.md?plain=1#L188-L191

I wonder why do we care about issuer though, I'll try to investigate. Without further context I would argue we should not care who the issuer is as long as space has authorized them to do the invocation, but maybe I'm missing some context here ?

I also would suggest say that we can simply require that linked task cause got embedded in the invocation, that way we'll have an issuer info (at least before UCAN@1.0), but even that we should probably figure out why do we have it in first place.

Gozala commented 4 months ago

I did little bit of archeological digging to try to find out where the issuer is coming from and it looks like it came from this PR https://github.com/web3-storage/w3infra/pull/48

Which in turn seems to have come from hugo's table https://github.com/web3-storage/w3up/blob/cc0f4f44c06a5884c6af50cf739dad6164d8e468/packages/access-api/migrations/0000_create_spaces_table.sql

So I have feeling that nothing really depends on this and we could probably drop the field, but we probably need to change types and see if there is anything that depends an that field

Gozala commented 4 months ago

car/multihash checks on migration path a. we currently have carpark-prod-0 in R2/S3. When we perform a store/add we check bytes are there so that we ask users to write things. b. we now move to write things by multihash and I am wondering if you have any thoughts on deduping here? Could we simply test multihash with CAR codec and give location claim? c. actually same also happen on our storeTable today. We key it by the link…

I'm not sure I fully understand the question here. But I'll try my best to address what I think you're getting at

a. We can derive CAR cid from the blob multihash, simply by calling Link.create with CAR.code and multihash. Se we could perform the same check. b. Not sure I fully understand, but yeah we could check if we have CAR for that multihash and if we do we could issue location claim that strips of the CID header, basically taking a cid.multihash.bytes c. I'm not entirely sure what are the assumptions in the store table, but we should:

  1. Consider if we should reuse store table or make another one.
  2. We could always map multihash to CID to keep link invariant, but I would suggest using RAW codec instead of CAR if we go that route.
  3. Consider what queries we run over that table and how those would be affected by non CAR cids.
vasco-santos commented 4 months ago

Thanks for the follow ups @Gozala and looking into the issue thing. I think we are quite aligned on 1, 3 and 4 ✅ I will follow up on points for the remaining discussion items

vasco-santos commented 4 months ago

on point 5 @Gozala your assumptions were kind of right on for the first part. So yes, my proposal was to do exactly that "simply by calling Link.create with CAR.code and multihash", but the general question/discussion topic is IF and HOW we want to deal with old world.

More specifically for both stores we have at the moment: carpark (Bucket for car bytes keyed as link/link.car) and store-table. If we do not check on these with injecting the CAR codec to compute such link, we will receive a lot of repeated content. But also needing to deal with two versions of the world is hard and makes more expensive needing to verify for everything if we have in these tables/buckets new or old keys... So, we need to define what migration/rollout strategy we would like here. Probably worth to involve Hannah, so I will write a more specific issue for this

EDIT: https://github.com/web3-storage/w3up/issues/1343

vasco-santos commented 4 months ago

@Gozala regarding point 2, as well as your thoughts provided on https://github.com/web3-storage/w3up/pull/1340#discussion_r1536145607:

  1. ok, so we are actually more closely aligned than I thought based on previous code snipped shared in spec. This was as I understood an example on more futuristic Invocation spec, and not what we can do today or we would need to block on accept.
  2. I will answer on https://github.com/web3-storage/w3up/pull/1342#pullrequestreview-1957849688