n0-computer / iroh

peer-2-peer that just works
https://iroh.computer
Apache License 2.0
2.6k stars 164 forks source link

`iroh-blobs` store that uses s3 as a storage backend #1956

Open rklaehn opened 10 months ago

rklaehn commented 10 months ago

There is customer demand for a s3 storage backend for iroh-bytes.

There are various use cases, one, from https://www.eqtylab.io/ , to serve from existing s3 buckets. Other projects that might find this useful are s5 (they are doing this already) and bacalhau.

Questions:

Note: the current store traits allow importing a file from the local file system. But it seems that for some users this is not wanted. Instead you want to compute an outboard for an existing s3 bucket. That is reasonably easy to do at the library level, but fully integrating this into iroh would take some time.

So maybe the best way to do this in a reasonable time would be a small separate project similar to sendme or dumbpipe?

b5 commented 10 months ago

s3 just for the data itself, or also for the outboard?

I'd like to see a world where we can do both, but would start with s3 just for the data. In the situation where we also store outboards on the bucket we'd likely want to use .obao extension convention. having read/write access to the store would require write access to the bucket, which will take a while. Let's start with local outboard & remote data.

ability to compute outboard for existing s3 asset?

Yes. This one is crucial.

should this be just a library (some assembly required) or a full iroh feature?

We should start it as a library, and only move it down if lots of people want it & we can stabilize the API.

I think the starting point should focus on the readonly paths of the S3 API. After a basic "I give iroh one object URL, iroh computes the outboard & can serve subsequences using that URL", I'd want us to look toward deeper integration with the S3 API for listing bucket objects, possibly creating collections from "folders".

rklaehn commented 10 months ago

s3 just for the data itself, or also for the outboard?

I'd like to see a world where we can do both, but would start with s3 just for the data. In the situation where we also store outboards on the bucket we'd likely want to use .obao extension convention. having read/write access to the store would require write access to the bucket, which will take a while. Let's start with local outboard & remote data.

ability to compute outboard for existing s3 asset?

Yes. This one is crucial.

should this be just a library (some assembly required) or a full iroh feature?

We should start it as a library, and only move it down if lots of people want it & we can stabilize the API.

I think the starting point should focus on the readonly paths of the S3 API. After a basic "I give iroh one object URL, iroh computes the outboard & can serve subsequences using that URL", I'd want us to look toward deeper integration with the S3 API for listing bucket objects, possibly creating collections from "folders".

OK, so something like this: I write a s3 compatible store impl with a few extra fns in an example project. The extra fns are about importing from existing s3 buckets (as opposed to importing from the local fs). This allows you to point to a s3 resource and then serves it using the iroh-bytes protocol, similar to sendme send.

Once that is refined, this s3 compatible store impl moves to iroh-bytes under a feature flag.

Then we decide whether to integrate this into iroh itself or not. But I think this would be rather hard to do in a generic way, so let's see.

ppodolsky commented 10 months ago

Just my 50 cents, may it worth to create a file system implementation and then just mount S3 through s3fs?

rklaehn commented 10 months ago

Just my 50 cents, may it worth to create a file system implementation and then just mount S3 through s3fs?

That's something you could do now. But it would be limiting in terms of platform compatibility (only linux and limited mac support for fuse). And also I think one of the benefits of this idea is that you can take existing resources on s3 and make them content-addressed and range queryable retroactively by just computing an outboard. That would not work with the s3fs approach, since that requires iroh to control the resource name.

ppodolsky commented 10 months ago

I may missed something. But why "computing an outboard" can be done for S3 and cannot be done for local files? Or why S3 does not require control the resource name and FS approach does require?

Xuanwo commented 1 month ago

Hi, I discovered these issues from my friends who shared the Iroh project. This project is incredibly cool, and I personally find this issue interesting.

I'm from the Apache OpenDAL community which offers a unified data access layer, empowering users to seamlessly and efficiently retrieve data from diverse storage services.

Currently, OpenDAL supports S3, GCS, AzBlob, and other major storage services, making it a suitable choice for this project.

I am interested in building a demo based on our existing trait. What are your thoughts?

https://github.com/n0-computer/iroh/blob/39d4bd9c757da0dc7005f97b5c3d588532f48c42/iroh-blobs/src/store/traits.rs#L298-L389

image

sandreae commented 3 weeks ago

I'm curious if any progress has been made towards an s3 backed blob store implementation? We have a use case for exactly this. We need peers to import blobs to their store from an existing s3 bucket, the calculated BAO outboard will be stored locally (although I'm open to other suggestions...), the actual bytes should stay in the bucket. When a blob is downloaded from the network, the bytes should end up on the receiving nodes own configured bucket. Importantly, calculating the BAO outboards should be done incrementally over chunked data as we can't have the whole blob on the local filesystem ever (as they're very large).

I saw the example code in iroh-experiments for an s3 BAO store, this is great as a reference, but a full implementation of the store traits will be more involved than what's been done there.

Any efforts towards this from within iroh org or otherwise?