Leverage multipart blob uploads to remove need for sharding / reassembly

Gozala commented 3 months ago

S3 has multipart uploads that allows users to shard content and upload it in parts, just like we shard into cars and upload them.

In our case however we have to do whole lot of bookkeeping and to reassemble content on reads. This will make even less sense if we transition from UinxFS into BAO / blake3 world where sharding will offer no value just an overhead.

I think it would be interesting to explore how multipart uploads could be leveraged to avoid sharding and whether multipart uploads are supported by other storage providers.

Gozala commented 3 months ago

Also this article describes concatenation strategy using multipart uploads where objects in s3 can be used as parts without having to re-upload them. I think it is really interesting as it may allow us to create append-only logs which can be basis for all the queues and whole lot of other things relevant in distributed systems.

hannahhoward commented 3 months ago

My primary question @Gozala -- what's the Filecoin strategy? We have to shard at that level for super large uploads. I always understood 4GB to be a limit that was safe to get in a single Filecoin piece.

Gozala commented 3 months ago

My primary question @Gozala -- what's the Filecoin strategy? We have to shard at that level for super large uploads. I always understood 4GB to be a limit that was safe to get in a single Filecoin piece.

Currently we submit aggregates that are 16GiB which we chose over 32GiB (both options recommended by spade) to move content faster through the pipeline.

There is a some overhead to be considered for the segment index, but otherwise blobs up to whole aggregate size - segment index size could be used without a problem. You are correct however that blobs that are larger in size than aggregates would need to be shaded for filecoin pipeline, but even than we could provide them as a byte range within our large blob to an SPs.

It is worth calling out motivating benefits. Currently we create much smaller shards around 127MiB that are CARs so for large uploads (e.g. video stream) we'd end up with lot of shards, then index mapping blocks across those shards and when serving we'd need to reverse all of that back. That is a lot of overhead and you loose ability to perform (efficient) skip forward/backward on a video stream just to satisfy first iteration over hash addressed content. In contrast if we retain content as is, we remove lot of overhead associated with sharding, transcoding and then reconstructing and get ability to skip forward / backward. Better yet actual blob could be read right out of copy that SPs holds.

hannahhoward commented 3 months ago

Wow, ok. I don't know where I got 4GB as the shard size. 127MiB is super small. And the shard as index into a larger blob is SUPER interesting.

Alright now I really like this. It makes 100% sense we should apply sharding only when needed, and specific to the situation where it's needed, rather than have a fixed sharding all the way through for both Filecoin and the upload part.

So essentially, multipart upload is "temporary sharding" for the upload process, while byte range indexes into a blob for an SP are sharding tailored to filecoin.

That's pretty cool. And we don't really have to go all in on Blake3 to do it -- we could just stop sharding into multiple CAR files.

Ok well don't hate me but while I think this is a great thing to do I also don't think it goes in Write To R2. I am imagining an endeavor, at a later point, we can call "simplfying storage format" which would cover this as well as Blake3 instead of cars for flat files.

vasco-santos commented 3 months ago

Looks like R2 also supports multipart https://developers.cloudflare.com/r2/objects/multipart-objects/

What is important to look at is whether presigned URL supports multipart upload in S3/R2. at least while we use them

Currently we submit aggregates that are 16GiB which we chose over 32GiB (both options recommended by spade) to move content faster through the pipeline.

We submit aggregates that are 26GiB minimum actually :) It is a env VAR in seed.run

Wow, ok. I don't know where I got 4GB as the shard size. 127MiB is super small. And the shard as index into a larger blob is SUPER interesting.

4GB is the bigger size we support for a shard for two main reasons:

S3 presigned URL only goes up to 5GB IIRC
Accepting 5 would not be efficient for size of a Filecoin Piece as we would need 8Gb for it, so we decided to max it out on 4

127MiB is the client default chunking. User can pass a parameter to make it bigger or smaller, we could actually discuss if we want more on it. I think we decided on lower value to make faster retries on failure, allow parallel chunk write, etc. I think @alanshaw may have thoughts here

alanshaw commented 3 months ago

Ya, 127MB is the default in the client but is configurable up to ~4GB. Tuning the shard size can allow resuming a failed upload to be quicker. Amongst other historical reasons - can elaborate if required...

⚠️ Last time I used R2 multipart it did NOT support "checksum" for each part. I cannot remember if you can use checksum for the entire multipart upload. We need to check.

ℹ️ Note: uploaded parts persist in the bucket until a complete/abort request is sent at the end. If the user never sends that final request then I think by default the parts hang around indefinitely. That said, I imagine there is some kind of lifecycle rules you can set to clean them up, but we'd need to check. It does make things a bit more complicated though...

storacha-network / RFC

Leverage multipart blob uploads to remove need for sharding / reassembly #20