Open vasco-santos opened 4 months ago
I would suggest to try and do an amortized migration from CAR → Blob. Specifically I suggest to do following:
This way extra costs will be temporary, although sadly on every new write which is not great. Also I suspect we can manage to do dynamo queries without doing one as car other as blob, but I don't believe that would work for S3.
Actually now that I'm thinking about it we probably need to move from looking if we have CAR/Blob in S3 to looking if we have it in location claims, don't we ? Because in the future we will not have it in S3 but we will have it in R2, so perhaps we should be checking index instead. We do need to consider that we may have content in S3/R2 before we have it indexed however.
Actually now that I'm thinking about it we probably need to move from looking if we have CAR/Blob in S3 to looking if we have it in location claims, don't we ? Because in the future we will not have it in S3 but we will have it in R2, so perhaps we should be checking index instead. We do need to consider that we may have content in S3/R2 before we have it indexed however.
Yes we need to look for claims, being location or other. But same exact problem happens there, dynamo/allocation store has same thing happening and claim for CarCID or (TBD, we talked about raw right?) CID for the multihash.
Personally I'd punt on de-duping against old data. There's already a lot more to implement here than I'd imagined and dealing with de-duping might make the code messy and hard to follow and leaves us with dependencies on buckets we may not be using in the future. When we get to the state where we're uploading to a node on a decentralized network de-duping will be on the level of the node you're uploading to, not some global store.
If necessary we can implement de-duping with old data at a later date.
Agree with Alan here. I'm fine with not worrying about deduping for now.
I would rather handle the migration in a script when we feel its safe to deprecate store/add
store protocol persisted state
Since we shipped w3up, the
store/*
protocol implementation is backed by two state stores:storeTable
:space
andlink
(CID with CAR codec)carStoreBucket
:${link}/${link}.car
blob protocol persisted state
On the other side, we are now implementing the
blob/*
protocol, which is less opinated about the bag of blocks ingested. Therefore, the blob protocol receives themultihash
bytes and returns backmultihash
bytes, even though naturally it will need to encode this multihash internally (for instance in base64).Blob protocol needs persisted state quite similar to the store protocol. To untie it from the "store" and "car" related namings, at the moment we are using names closer to the blob protocol:
allocationStorage
instead ofstoreTable
blobStorage
instead ofcarStoreBucket
Note that the indexing SHOULD be quite similar, and is likely out of scope of this issue to discuss it. The main thing is that the index keys will now be different for same CARs uploaded
Integrate new world with old world
The main problem we want to solve here is how to make both worlds work together, or if it is actually desired to do so.
When
store/add
handler is called, thecarStoreBucket
is checked so that we know if that CAR is already being stored. If so, we do not need to receive the bytes. Moreover, we check ifstoreTable
has a mapping of the CAR link to that space. Depending on the result of these ops, we can do one of the following:In the
blob/add
handler, we MUST do same set of verifications as the ones above. However, we MAY want to continue decoupling both allocating on user space, and requesting bytes to be written for content we already have received as a CAR before.We can check if we already received a CAR with the same bytes (in other words, we can derive CAR Cid from the multihash by creating a CID with CAR codec). However, this will also mean:
Note that this will be tied with looking up on bucket now, but then same applies to look for claims for that content
Alternatively, we could just start from scratch with the new bucket in R2/other write targets. This would also tie nicely with the previous discussions that a new Bucket should exist once nucleation happens, instead of having in nucleated entity bill historical content.
Would like your opinions to get to a decision cc @hannahhoward @alanshaw @Gozala @reidlw