ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.14k stars 3.02k forks source link

Option to restrict gateway to only serve locally available / cluster content #5513

Open rotemdan opened 6 years ago

rotemdan commented 6 years ago

I'm investigating the suitability of IPFS as a server side file storage and distribution medium.

I can manage, upload and pin files through the HTTP REST API (:5001). I would like to have the stored files available both through the IPFS network and through HTTP. The gateway seems like an easy, simple solution to provide the files directly through HTTP with good latency to the user (and would possibly be reverse proxied through NGINX and/or a third-party CDN as well).

The only issue is, I couldn't find a way to limit it to only provide locally pinned content. Making a custom intermediate server to filter out requests seems unnecessary and would require maintaining a duplicate (and probably inefficient) index into the IPFS datastore. Since it might serve millions of files, maintaining a gigantic ipns-based ipld document to index the files also seems wasteful and inefficient (and a possible privacy issue if directory content is exposed).

I'm not interested (at this time) in creating a private network, using --offline or using custom bootstrap nodes, since I want to data to be available through the public IPFS network as well.

I believe this "dual-stack" approach might be reasonably classified as "plausible" (given that IPFS matures to the point it provides sufficient value to be used in mainstream projects), so I decided to publish it here as a feature request (in case it is not already available! in that case I'd be really happy to know how to achieve this!)

(Edit: as a natural extension, it would probably also be useful to have an option to only serve content pinned by a cluster of servers -- thus any node in the cluster could act as a restricted IPFS gateway - that only serves content hosted within the cluster itself)

magik6k commented 6 years ago

So I don't think there is an easy way to tell if a block is pinned without loading the whole pinned tree to the memory (which seems to be what GC currently does).

For a simpler solution - we could ad an option which makes the gateway not look for blocks in the network and instead only use what already is in the blockstore (so only pinned data and cached content (which can be removed with ipfs repo gc)), essentially making the gateway offline. Would that work for you?

rotemdan commented 6 years ago

For the most part, given that the kind of node I'm describing would be dedicated to only storing content, and not retrieving it from the larger network (possibly aside from replicating other cluster nodes, which I'll describe next), almost all local content would be pinned anyway, so I guess simply checking for local availability, regardless of pin status, could work (if the performance would be significantly better I would probably choose this less restrictive option anyway I guess).

For a cluster, I guess it would mean that the gateway would be effectively "local" in relation to the cluster. I'm not very familiar with IPFS internals but I could imagine that the DHT would be queried in such a way to constrain the results to "whitelist" only sources originating from within the cluster. In any case, in the vast majority of requests, the hashes would be resolved very quickly, since the nodes would have very good network connectivity to each other, even if geographically disparate. In the rare cases when clients try to "abuse" the gateway by using it as a proxy to the larger IPFS network, the request would simply stall and timeout (I'm not sure if there's a DHT timeout setting for this type of query but it could possibly be set reasonably low to mitigate this scenario).

There are some interesting prospects to having something like this. It seems like a relatively simple/cheap way to run a highly-available CDN, where, since each node also acts as gateway, popular content is automatically fetched and cached by other cluster nodes (in addition to client nodes from outside the cluster) (of course all this would only be truly feasible given that the datastore is scalable and performant enough, and the software stable/mature enough etc.)

magik6k commented 6 years ago

For the cluster case - this would probably have to be implemented at bitswap level, where we'd filter from which peers we want to fetch content. We could do that at lower level, but:

rotemdan commented 6 years ago

Thanks for looking at this. I've found an alternative approach to filter URLs using cryptography instead (for the cluster case mainly, since the local-only case is trivial to implement efficiently), though it requires additional intermediary filtering server (unless IPFS would support it as a part of the CID/URI spec) and has various other limitations:

Instead of links, being, say:

https://my-cdn.com/ipfs/<IPFS-CID>

Have them as

https://my-cdn.com/signed-ipfs/<IPFS-CID>-<HMAC(KEY, IPFS-CID)>

So every request would be required to include a signature that would be verified by an intermediary HTTP server (possibly running on each node), or, as I mentioned, the gateway itself.

Limitations:

ozars commented 5 years ago

A restricted gateway would be a quite useful feature for mirroring large datasets receiving frequent updates as well.

Is there any endpoint in API which returns information about whether an object path is pinned (e.g. /api/v0/pin/get)? If so, a reverse proxy to the gateway could be configured to filter requests to accept only if the requested object is pinned. It would be a lot better if this is implemented in the IPFS itself, but this might be a simple workaround until then.

Stebalien commented 5 years ago

The next release (which I need to get out the door ASAP...) will have a Gateway.NoFetch option. However, that may not be sufficient for the cluster use-case.

See: https://github.com/ipfs/go-ipfs/pull/5649

kyledrake commented 5 years ago

This is a very good feature to have and I'm glad it's being released soon.

There's going to be a lot of use cases where people want to have an HTTP convenience gateway for their own pinned sites/content, but are unwilling to allow all content from everyone to be served from their HTTP servers as a side consequence.

magik6k commented 5 years ago

Note that few read-only /api endpoints aren't yet covered by this option - see https://github.com/ipfs/go-ipfs/pull/5649#issuecomment-451337849 for the list

Stebalien commented 5 years ago

@magik6k speaking of which, can you file an issue for that so we don't forget?

magik6k commented 5 years ago

https://github.com/ipfs/go-ipfs/issues/5929

KevinYum commented 2 years ago

Gateway.NoFetch has already been a great move for gateway!

For cluster use case, how about we could extend Gateway.NoFetch to something like Gateway.FetchOnlyFromSpecificPeers? My current practice would be use some load balancer in front of cluster nodes' gateway.

Jorropo commented 2 years ago

@ywk248248 this can be done by setting Routing:none, removing the bootstrap peers and peering with your specific peers.