ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
402 stars 31 forks source link

S3-backed IPFS #214

Closed edsilv closed 4 years ago

edsilv commented 7 years ago

I'm keen to adopt IPFS for some projects I'm working on, but a key requirement for me is reliable, scalable, and cheap storage. Normally I'd use S3, but it appears that it isn't possible to back IPFS with S3 at the current time?

@VictorBjelkholm @flyingzumwalt @lgierth

parkan commented 7 years ago

I believe cubepin may have had some relevant pieces that could potentially be released on their own. The interface I'd love is being able to throw up an image (AMI?) and give it S3 credentials as either env vars or instance role and just have it transparently use that bucket.

Beyond that, I'm guessing that we're waiting on ipfs-cluster to be a bit more mature -- seems worthwhile to stabilize API there before working out consensus etc?

Kubuxu commented 7 years ago

We should for sure think about it, we used to have an adapter for this.

For sure what you will want to do when using S3 is have EC2 instance in the same region so you don't pay the transfer tariff twice (one from the S3 to EC2 and second from the EC2 to the rest of IPFS network).

flyingzumwalt commented 7 years ago

I've always thought it would be useful if someone published a docker container with an ipfs pinning service installed on it. Then all you would have to do is tell it where to read/store the data (ie. an S3 bucket) and configure write permissions. Maybe someone in the community wants to create this?

jbenet commented 7 years ago

Thanks for bumping this-- we really should have this solved by now. it's very much on my mind and may be relevant for this coming sprint.

At the very least, if someone is willing to write a solid implementation of this, Protocol Labs can sponsor it.

shuoy commented 7 years ago

@edsilv, what is your use case? why would you need S3 storage layer along with the IPFS protocol (which is more or less p2p oriented)? I'd love to hear more and see if it may meet some of the use case in my mind. (and if so, love to discuss how to collaborate)

@jbenet please feel free to jump in and give the picture here as well.

edsilv commented 7 years ago

Hi @shuoy,

Primarily I'm thinking about ways to persist 3D models I've optimised for the web with their associated assets, but this could apply to any form of content really.

Normally I'd use S3 for storage rather than disk space on a VM, and certainly not one of my personal devices. This is how blokdust.com (a recent project) works for example. BlokDust runs on a DigitalOcean droplet with one CPU and 1GB of RAM, but through combining S3 for all the static files and Cloudflare's CDN it was able to cope with quite a lot of traffic. As people can save their own "compositions" I wanted to be sure that it would scale indefinitely if needed. The alternative would be provisioning more disk space on DO. I think I did a price comparison at some point for this, but having made many such "user-generated content" websites in the past it's almost a no-brainer to use S3.

To bring this back to the 3D models I mentioned - I'm hosting several of these on my personal S3 account right now. Ideally organisations like the British Library and Stanford would be hosting these themselves, but it's useful to have this middle ground right now where they're definitely going to be available until they choose to take responsibility for hosting them. At such time that they do, it would be really nice to just give them a hash to pin.

For BlokDust, I could foresee this (or other similar future projects) being more p2p-oriented. I'd probably still want to provide a storage "safety net" however in the form of an IPFS-wrapped S3 bucket.

shuoy commented 7 years ago

@edsilv thanks for providing the context.

Normally I'd use S3 for storage rather than disk space on a VM, and certainly not one of my personal devices. This is how blokdust.com (a recent project) works for example. BlokDust runs on a DigitalOcean droplet with one CPU and 1GB of RAM, but through combining S3 for all the static files and Cloudflare's CDN it was able to cope with quite a lot of traffic. As people can save their own "compositions" I wanted to be sure that it would scale indefinitely if needed. The alternative would be provisioning more disk space on DO. I think I did a price comparison at some point for this, but having made many such "user-generated content" websites in the past it's almost a no-brainer to use S3.

Agreed, S3 is the cost competitive solution for this kind of asset storage.

To bring this back to the 3D models I mentioned - I'm hosting several of these on my personal S3 account right now. Ideally organisations like the British Library and Stanford would be hosting these themselves, but it's useful to have this middle ground right now where they're definitely going to be available until they choose to take responsibility for hosting them. At such time that they do, it would be really nice to just give them a hash to pin.

But I still have not interpret the need of IPFS in this picture. British Library can still go the S3 + CDN route as you do today. I guess I am trying to understand how IPFS fits here and brings what extra-value in this context. Are you trying to use IPFS to replace the CDN component? Or maybe I missed something here?

edsilv commented 7 years ago

@shuoy I guess I'm preempting the general adoption of IPFS, and in a small way trying to aid in that adoption. As you point out, the current status quo of S3 + CDN is acceptable, however I believe in the mission of IPFS and genuinely think that once it starts to be adopted it will be an improvement on location addressing in terms of bandwidth efficiency and content availability. I work in the digital humanities, primarily with libraries. Libraries generally do not have the resources of large corporations, but I think these new efficiencies enabled by IPFS will help them to maintain their role as independent publicly funded stores of knowledge. So there's certainly a philosophical component to this in terms of the value-add. I blame @jbenet ;-)

flyingzumwalt commented 7 years ago

I ran into @b5 this weekend. He might be able to help with this.

whyrusleeping commented 7 years ago

All thats needed for this is to implement the datastore interface using s3. Here are a few reference datastore implementations that might help to look at:

b5 commented 7 years ago

Thanks @whyrusleeping for such clear directions! I've taken an initial stab: https://github.com/qri-io/go-ds-s3

Feedback welcome.

kyledrake commented 7 years ago

One quick and dirty way to get S3 backend support is to use an S3 mounted FUSE filesystem for the datastore directory: https://github.com/kahing/goofys

I have successfully done this with IPFS already. The main issue is that performance is not great. This is more related to the nature of S3 (slow HTTP uploads/downloads) than it is to IPFS or the FUSE interface. Goofys gives you some caching abilities that improves things a bit.

Another issue is cost. S3 follows the usual "cheap in, expensive out" cloud business model, and at $0.09/GB you're going to be in for a world of hurt if you're working on dozens of terabytes without a real budget. Also because IPFS "shards" files into 256KB chunks, each of those ends up being a new call to S3, potentially increasing latency and costs ($0.004 / 10k requests).

There are possible performance optimizations to be had here by making a custom datastore just for S3, as the filesystem datastore possibly makes some assumptions about read access speed on a lot of the data. I wouldn't know without digging into the code to see what the daemon does with the datastore files. Some of the obvious things would be not using directory sharding (dirs are just special files on S3) and just doing straight K/V, not doing any access()/stat() calls, and doing chunk uploading/downloading in parallel.

If you really want to do this fast, right now my recommendation is to use CephFS as the backend. It requires setting up your own servers, but it has great performance and is suited well for an IPFS datastore. It's expensive up front but at scale it's a lot cheaper and faster than S3. I wish someone provided CephFS-as-a-service... Ceph clusters are not trivial to operate in production.

daviddias commented 6 years ago

You can do the S3 Repo with js-ipfs now, see how at https://github.com/ipfs/js-ipfs/tree/master/examples/custom-ipfs-repo#other-options

bertrandfalguiere commented 4 years ago

There is also a plugin by RTrade that use their infrastructure but can (and will?) be adapted to use anything else. https://github.com/RTradeLtd/s3x

daviddias commented 4 years ago

Thanks for sharing @bertrandfalguiere!

daviddias commented 4 years ago

And there is also https://github.com/ipfs/go-ds-s3

NEW UPDATE