Make ipfs.dag.export built-in feature of HTTP gateways

lidel commented 4 years ago

tldr this is attempt to improve how do we want to represent non-unixfs DAGs on the HTTP gateway under /ipfs/ and /ipns/ paths by doing CAR export when data is not a raw/unixfs

Today: only unixfs+raw

Right now the /ipfs/ path on gateway only supports dag-pb+raw, everything else fails with unrecognized object type error:

Gateway exposes /api/v0/get so one can read blobs with other DAGs that way, but people rarely use it.

Future: export anything

I believe the HTTP gateways should return every DAG type.

Here is initial idea: we recently added support for CAR import/export to go-ipfs (https://github.com/ipfs/go-ipfs/issues/6870) – what if we return non-unixfs/raw DAGs as ~~.car~~ .ipfs.dag files?

CAR format makes it easy for people to import/export DAGs
gateway would not need to know how to render specific codec to be useful for data distribution
gateway could return CAR with proper cache control header, namely the immutable hint for everything under /ipfs/
we would set proper content-disposition header to ensure {cid}.ipfs.dag filename and trigger download, so the browser does not try to render the blob
we could support ?download=dag everywhere, so even unixfs DAG could be fetched as CAR from any gateway
- update: ?format=car is more flexible, as it allows for format=tar or json for dag-cbor (https://github.com/ipfs/go-ipfs/pull/8037)

cc @mikeal @aschmahmann @autonome @Gozala @achingbrain – is this a good idea? any concerns? would PR be accepted?

Future: import DAGs

If we have export.. could we also add import? This could be safely enabled on localhost, and people could experiment with this on public gateways (there, it could be guarded by reverse proxy or some bearer token):

we could improve the concept of a writable gateway to support DAG import via HTTP PUT /ipfs/{cid}
IPNS publishing could be as easy as HTTP PUT /ipns/{libp2p-key}

References

https://docs.ipfs.io/reference/cli/#ipfs-dag-export / https://docs.ipfs.io/reference/cli/#ipfs-dag-import
DAG Import/Export : https://github.com/ipfs/go-ipfs/issues/6870
- .car support pieces left out of 0.5
import/export a DAG from/to a CAR file: https://github.com/ipfs/js-ipfs/issues/2745
?download=true support was recently added in https://github.com/ipfs/go-ipfs/pull/7677
DAR format (streamable + optional index for random access) https://github.com/anorth/go-dar
Example of Content-Location in response to specific Accept in request: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Location#examples

Gozala commented 4 years ago

I like general idea of allowing to get / put data without dedicated http client, that this would enable.

My only (weak) concern would be that this could lead to some unexpected behavior if hitting any non unixfs path would start .car download. It might be better to make this more explicit e.g. via ?download=car query parameter, or dare I say Accept header ?

Another argument for explicitness would be that if in the future we add support for another "file" like codec that would not require a breaking change.

lidel commented 3 years ago

@Gozala good point. I think for non-unixfs DAGs we would return an error by default informing that there is no available preview for CID, but the original DAG can be downloaded if ?download=car or Accept are passed.

Q: any thoughts regarding Content-Type returned with CAR payload? We could return application/octet-stream to indicate binary payload, will work fine provisionally, but if we want to support Accept we need something specific to CAR that we could add to https://www.iana.org/assignments/media-types/media-types.xhtml at some point in the future.

@ribasushi

does CAR format have mime|media|content-type? If not, any thoughts how it should look like?
are there any "magic bytes" that can be used by content-type sniffers to identify CAR archive based on its header? adding CAR content-type detection to libraries like file-type (js) and mimetype (go) would improve developer experience

Gozala commented 3 years ago

Q: any thoughts regarding Content-Type returned with CAR payload? We could return application/octet-stream to indicate binary payload, will work fine provisionally, but if we want to support Accept we need something specific to CAR that we could add to https://www.iana.org/assignments/media-types/media-types.xhtml at some point in the future.

I think using custom mime types over application/octet-stream is more useful as it provides some info beyond it just being binary data. From browser standpoint primary diff between octet-stream and unknown mime types handling was that later is been sniffed falling back to former, while former is not.

rvagg commented 3 years ago

Comment I made on this elsewhere:

The varint at the front of the CAR format messes this up, so the first byte or two don't conform to a tight pattern.

We did come up with some ideas to get us out of this annoying mess and fix the header in place, but that's a dream for CARv2. For now we're stuck with this design where even the version field comes late in a DAG-CBOR block after a dynamically sized roots field.

This byte pattern is likely to catch most CAR files:

0x??a265726f6f7473

Assuming ?? will fit a varint describing the length of the header, which may break down for a very large CID or multiple CIDs (not supported by go-car though, but supported by JS). That string is basically this: "single byte varint + a CBOR encoded { roots:...". Going beyond that makes more assumptions that start to break down.

A common case will be:

0x3aa265726f6f7473 as a prefix, which will be the case of a single CID with a 256 bit hash and a low multihash number like sha2-256.

but Filecoin uses blake2b-256 which has a higher multihash number so it makes its CIDs longer .. So they’re going to start with:`

0x3ca265726f6f7473 simply because Filecoin CIDs are 2 bytes longer… yay.

So, no “magic number” sadly, we’re lumped with this design for now. But you could approximate if you were keen.

achingbrain commented 3 years ago

Part of me would like to see the ipld explorer on the gateway, then a download link/querystring param for either file data or a car file.

That'd give a preview of any DAG type and discourage people from using it as a cheap CDN.

lidel commented 3 years ago

Somehow related: CAR export at gateways would provide means for Verifiable HTTP Gateway Responses (https://github.com/ipfs/in-web-browsers/issues/128) without the need for exposing DAG metadata as custom HTTP headers.

Exporting full DAG may not always be the best performance-wise: in the future we may want to be are able to control the depth of CAR export to facilitate parallel streaming from multiple gateways and/or seeking within media files.

@rvagg are there prior issues/plans about controlling the depth of CAR export?

ec1oud commented 3 years ago

I would just like to have dag get -f cbor working ( ipfs/go-ipfs#4313 )

dag export exports only car format, and dag get exports only json. Either way, the ipfs daemon (or the ipfs embedded implementation) is spending CPU time to convert from native cbor to something else, but cbor itself would be fine as an export format, I think.

warpfork commented 3 years ago

This is very similar to something now also independently conceptualized and proposed as a project in another planning&tracking repo: https://github.com/protocol/web3-dev-team/pull/1

(I'm not sure if either is a superset of the information contained in the other yet, so this isn't a suggestion to close, but they certainly seem related.)

lidel commented 3 years ago

Updated thoughts after recent discussions from https://github.com/ipfs/in-web-browsers/issues/182#issuecomment-814446591 and forward.

I agree with @ec1oud that dag-cbor should be supported natively. We track that in https://github.com/ipfs/in-web-browsers/issues/182
CAR format may change, and I would not like to ossify current version into long-term gateway semantics.
- Perhaps Gateway should avoid talking about specific CAR version, and simply say that it returns a stream with an opaque blob archive compatible with ipfs dag import|export ?
- If go-ipfs commits to guarantee that import/export always work with old formats, this should be enough for most use cases, without being too explicit about the format itself
- Requesting unsupported DAG type would return same bytes as ipfs dag export
- DAG export could be triggered on every path by passing ?download=dag (?download=true already exists and sets appropriate content-disposition header, this would simply change the output format before streaming the response)
- Use the default content-type for binary files ( application/octet-stream) and set the filename in content-disposition to {cid}.ipld.dag or {cid}.ipfs.dag

I feel this is a safer way to support DAG export, but lmk if there are any concerns.

anorth commented 3 years ago

Perhaps Gateway should avoid talking about specific CAR version, and simply say that it returns a stream with an opaque blob archive compatible with ipfs dag import|export ?

I think we should specify some format into the gateway semantics, and I think it should be decoupled from ipfs dag (though may coincidentally match initially).

I think maybe we have different use cases in mind. My read of your use case is that someone would fetch a DAG from a gateway and then import it into an ipfs binary for whatever subsequent use.

My perspective is that someone is using the gateway because they're not using an ipfs node. Why wouldn't a node operator just use the IPFS network? No, the gateway is bridging the IPFS network to older/limited world like Web2 apps, mobile devices, low-power devices etc. They shouldn't need to depend on IPFS code at all, only on IPLD (and maybe IPLD-ish repos that for legacy reasons are in the ipfs GitHub org). So it's no good specifying in terms of what ipfs dag import can or will support.

My canonical use case (hypothetical, because currently impossible) is a Web3 dapp that reads IPLD-native data from the IPFS network, via a gateway for the speed, reliability, pinning etc needs that the specific app has. The envelope and format for the data it fetches can be pretty thin, but needs to be well-defined. The body of a DAR is just one or more DAGs, with the nodes in a well-defined order, deduplicated etc. The index is optional, and probably unnecessary here. We could adjust that body format to suit needs here – I designed it with this use case specifically in mind, so if I was off target I'd want to change anyway.

lidel commented 3 years ago

Why wouldn't a node operator just use the IPFS network? [..] They shouldn't need to depend on IPFS code at all, only on IPLD [..]

My pet use case is thin clients like mobile web browsers and IoT devices using gateway as energy-efficient alternative to p2p transports (which would still be used as fallback, to conserve energy).

Browsers want to support websites and assets loaded over ipfs:// which is a bit more involved than IPLD data structures, and the usual feedback from mobile browsers is that they would like to ensure content integrity (#128), but can't run full libp2p stack due to battery constraints.

Received similar comments about IoT using content-addressing for fetching firmware updates in a trustless manner, without draining battery for usual libp2p transports.

I think we should specify some format into the gateway semantics, and I think it should be decoupled from ipfs dag (though may coincidentally match initially).

If we want to specify format, then we should plan for both import and export from the very start. When we improve the concept of a writable gateway, it could support DAG import via something like HTTP PUT /ipfs/{cid}

Decoupling should be fine, as long we ensure that ipfs dag import supports gateway responses, and that writable gateway can process archive produced by ipfs dag export.

Do you feel this is blocked on DAR replacing CAR, or can we move forward with CAR for now? I'd like to open a PR with spec draft in ipfs/spec for how DAG import/export should work on Gateways, but feel we need to answer the format question before that.

anorth commented 3 years ago

Great, I agree with your points. No, please don't interpret my statements as anything blocking moving forward; but as considerations and options with which you may choose to move forwards, or not. I do think there are significant advantages to the more tightly-defined DAR format over CAR, for these use cases, but I'll be satisfied with a story about how we could upgrade to this in the future (e.g. with content-type headers).

I do still think that the behaviour should be fairly defined, so perhaps would tweak your suggestions to make that more explicit, e.g. with ?download=car and cid.car filename.

raulk commented 3 years ago

I think for non-unixfs DAGs we would return an error by default informing that there is no available preview for CID, but the original DAG can be downloaded if ?download=car or Accept are passed.

This makes sense. If we're providing this option, it should be consistent and predictable for all URLs. So even if you hit a UnixFS node with ?download=car or the right content type in Accept:, I would expect this to result in a CAR download.

Browser detection through user agent is also worth considering, although controversial and potentially too much magic. If what's hitting you is a browser, and you know you can't do anything useful, the gateway could force the CAR download, whether or not the explicit parameter is passed.

raulk commented 3 years ago

I do think there are significant advantages to the more tightly-defined DAR format over CAR, for these use cases

@anorth Mind elaborating on which parts of DAR you consider to be advantageous over CAR with a depth-first, deduplicated traversal logic?

I would imagine the deterministic depth-first nature, and the space efficiency through CID ellision, provided that it doesn't make us lose CID roundtripping, which I suspect it does in its current form.

I do think that a DAR is costlier to generate on the server side in terms of memory footprint due to the substitution of CIDs links for absolute stream offsets, which requires server-side tracking, and does open up security risks IMO if used to implement this use case.

anorth commented 3 years ago

I think the advantages are primarily the tight definition of the format. If some service producing a CAR also enforced depth-first logic and no duplicates as additional semantics, that would cover much of the benefit. A consumer could then rely on and validate that property. The stronger guarantees can be exploited by clients.

This allows, for example, streaming construction of application data. E.g. a web app that builds a data model as a projection from IPLD could build that model incrementally, while validating CIDs, without ever writing blocks to a blockstore. A direct example, an application could reconstruct a unixfs file raw data in a streaming fashion. (Hmm, except for blocks referred to twice. Maybe there are cases where deduplication is unwanted?).

Another, less important, advantage of DAR is that it's a deterministic representation given the ordering of roots. And then there's the space efficiency improvement, which is valuable with small node sizes (which we might see in application data, more than unixfs), even after we add a couple of bytes to fix up the CID roundtripping.

It's not CIDs that are substituted, but duplicated block bodies. You're right that the deduplication does require keeping a CID->offset map on the server. Code I've observed that generates CARs does this explicitly from the outside (without the offset). I imagine that if the gateway built CARs on the side it would do the same thing, so I don't think DAR is meaningfully costlier. One could view the deduplication as best-effort: I think it would be fine in principle to cap the size of that map at some large but finite number. But a client fetching a DAG with hundreds of thousands of nodes should probably paginate somehow anyway. A gateway could limit the number of nodes or bytes it will serve in one archive: DAR is resumable in that the stack of CIDs observed as links but not yet received as blocks provides the roots for the remainder of the DAG.

olizilla commented 3 years ago

https://github.com/ipfs/go-ipfs/pull/8111 landed to open up the /api/v0/dag/export endpoint as part of the gateway api so we can start experimenting with light clients that fetch and verify data over http. It'll be available on https://ipfs.io once go-ipfs 0.9.0 is ready for testing.

Gozala commented 3 years ago

Browser detection through user agent is also worth considering, although controversial and potentially too much magic. If what's hitting you is a browser, and you know you can't do anything useful, the gateway could force the CAR download, whether or not the explicit parameter is passed.

Browsers tend to send Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 headers (when page is navigated to) which is pretty good indication that they want HTML view. Which is why I think gateway should render something like explore.ipld.io when accept header are requesting html / xhtml.

olizilla commented 3 years ago

/api/v0/dag/export is exposed on https://ipfs.io via nginx config. dweb.link will support it when go-ipfs 0.9.0 is deployed. You can now use ipfs-get to fetch a CAR over http and verify the blocks before writing to disk.

npx ipfs-get bafkreigh2akiscaildcqabsyg3dfr6chu3fgpregiymsck7e7aqa4s52zy --output room-guardian.jpg

If you prefer to curl it yourself, then you can pipe it to ipfs-car to verify and unpack it

curl -X POST "https://ipfs.io/api/v0/dag/export?arg=bafkreigh2akiscaildcqabsyg3dfr6chu3fgpregiymsck7e7aqa4s52zy" \
| npx ipfs-car -o room-guardian.jpg

or import it to your local ipfs with

curl -X POST "https://ipfs.io/api/v0/dag/export?arg=bafkreigh2akiscaildcqabsyg3dfr6chu3fgpregiymsck7e7aqa4s52zy" \
| ipfs dag import

lidel commented 2 years ago

The CAR response format for /ipfs/{cid} is ready for review: https://github.com/ipfs/go-ipfs/pull/8758

lidel commented 2 years ago

Done:

go-ipfs 0.13+ shipped with https://github.com/ipfs/go-ipfs/pull/8758
application/vnd.ipld.car and application/vnd.ipld.raw response types are registered at IANA and documented in HTTP Gateway specs (https://github.com/ipfs/specs/pull/283)

Future work will happen as PR against Gateway specs (https://github.com/ipfs/specs/pull/283):

leveraging HTTP Caching when CAR response is deterministic
IPLD selector support for fetching a subset of a DAG – https://github.com/ipfs/go-ipfs/issues/8769
Writable gateways will be proposed by Agregore team, we will include CAR import there

ipfs / in-web-browsers