Closed lidel closed 2 years ago
I like general idea of allowing to get / put data without dedicated http client, that this would enable.
My only (weak) concern would be that this could lead to some unexpected behavior if hitting any non unixfs path would start .car download. It might be better to make this more explicit e.g. via ?download=car
query parameter, or dare I say Accept
header ?
Another argument for explicitness would be that if in the future we add support for another "file" like codec that would not require a breaking change.
@Gozala good point. I think for non-unixfs DAGs we would return an error by default informing that there is no available preview for CID, but the original DAG can be downloaded if ?download=car
or Accept
are passed.
Q: any thoughts regarding Content-Type returned with CAR payload?
We could return application/octet-stream
to indicate binary payload, will work fine provisionally, but if we want to support Accept
we need something specific to CAR that we could add to https://www.iana.org/assignments/media-types/media-types.xhtml at some point in the future.
@ribasushi
Q: any thoughts regarding Content-Type returned with CAR payload? We could return
application/octet-stream
to indicate binary payload, will work fine provisionally, but if we want to supportAccept
we need something specific to CAR that we could add to https://www.iana.org/assignments/media-types/media-types.xhtml at some point in the future.
I think using custom mime types over application/octet-stream
is more useful as it provides some info beyond it just being binary data. From browser standpoint primary diff between octet-stream
and unknown mime types handling was that later is been sniffed falling back to former, while former is not.
Comment I made on this elsewhere:
The varint at the front of the CAR format messes this up, so the first byte or two don't conform to a tight pattern.
We did come up with some ideas to get us out of this annoying mess and fix the header in place, but that's a dream for CARv2. For now we're stuck with this design where even the version
field comes late in a DAG-CBOR block after a dynamically sized roots
field.
This byte pattern is likely to catch most CAR files:
0x??a265726f6f7473
Assuming ?? will fit a varint describing the length of the header, which may break down for a very large CID or multiple CIDs (not supported by go-car though, but supported by JS). That string is basically this: "single byte varint + a CBOR encoded { roots:
...". Going beyond that makes more assumptions that start to break down.
A common case will be:
0x3aa265726f6f7473
as a prefix, which will be the case of a single CID with a 256 bit hash and a low multihash number like sha2-256.
but Filecoin uses blake2b-256 which has a higher multihash number so it makes its CIDs longer .. So they’re going to start with:`
0x3ca265726f6f7473
simply because Filecoin CIDs are 2 bytes longer… yay.
So, no “magic number” sadly, we’re lumped with this design for now. But you could approximate if you were keen.
Part of me would like to see the ipld explorer on the gateway, then a download link/querystring param for either file data or a car file.
That'd give a preview of any DAG type and discourage people from using it as a cheap CDN.
Somehow related: CAR export at gateways would provide means for Verifiable HTTP Gateway Responses (https://github.com/ipfs/in-web-browsers/issues/128) without the need for exposing DAG metadata as custom HTTP headers.
Exporting full DAG may not always be the best performance-wise: in the future we may want to be are able to control the depth of CAR export to facilitate parallel streaming from multiple gateways and/or seeking within media files.
@rvagg are there prior issues/plans about controlling the depth of CAR export?
I would just like to have dag get -f cbor working ( ipfs/go-ipfs#4313 )
dag export exports only car format, and dag get exports only json. Either way, the ipfs daemon (or the ipfs embedded implementation) is spending CPU time to convert from native cbor to something else, but cbor itself would be fine as an export format, I think.
This is very similar to something now also independently conceptualized and proposed as a project in another planning&tracking repo: https://github.com/protocol/web3-dev-team/pull/1
(I'm not sure if either is a superset of the information contained in the other yet, so this isn't a suggestion to close, but they certainly seem related.)
Updated thoughts after recent discussions from https://github.com/ipfs/in-web-browsers/issues/182#issuecomment-814446591 and forward.
ipfs dag import|export
? ipfs dag export
?download=dag
(?download=true
already exists and sets appropriate content-disposition
header, this would simply change the output format before streaming the response)content-type
for binary files ( application/octet-stream
) and set the filename in content-disposition
to {cid}.ipld.dag
or {cid}.ipfs.dag
I feel this is a safer way to support DAG export, but lmk if there are any concerns.
Perhaps Gateway should avoid talking about specific CAR version, and simply say that it returns a stream with an opaque blob archive compatible with
ipfs dag import|export
?
I think we should specify some format into the gateway semantics, and I think it should be decoupled from ipfs dag
(though may coincidentally match initially).
I think maybe we have different use cases in mind. My read of your use case is that someone would fetch a DAG from a gateway and then import it into an ipfs
binary for whatever subsequent use.
My perspective is that someone is using the gateway because they're not using an ipfs
node. Why wouldn't a node operator just use the IPFS network? No, the gateway is bridging the IPFS network to older/limited world like Web2 apps, mobile devices, low-power devices etc. They shouldn't need to depend on IPFS code at all, only on IPLD (and maybe IPLD-ish repos that for legacy reasons are in the ipfs GitHub org). So it's no good specifying in terms of what ipfs dag import
can or will support.
My canonical use case (hypothetical, because currently impossible) is a Web3 dapp that reads IPLD-native data from the IPFS network, via a gateway for the speed, reliability, pinning etc needs that the specific app has. The envelope and format for the data it fetches can be pretty thin, but needs to be well-defined. The body of a DAR is just one or more DAGs, with the nodes in a well-defined order, deduplicated etc. The index is optional, and probably unnecessary here. We could adjust that body format to suit needs here – I designed it with this use case specifically in mind, so if I was off target I'd want to change anyway.
Why wouldn't a node operator just use the IPFS network? [..] They shouldn't need to depend on IPFS code at all, only on IPLD [..]
My pet use case is thin clients like mobile web browsers and IoT devices using gateway as energy-efficient alternative to p2p transports (which would still be used as fallback, to conserve energy).
Browsers want to support websites and assets loaded over ipfs://
which is a bit more involved than IPLD data structures, and the usual feedback from mobile browsers is that they would like to ensure content integrity (#128), but can't run full libp2p stack due to battery constraints.
Received similar comments about IoT using content-addressing for fetching firmware updates in a trustless manner, without draining battery for usual libp2p transports.
I think we should specify some format into the gateway semantics, and I think it should be decoupled from ipfs dag (though may coincidentally match initially).
If we want to specify format, then we should plan for both import and export from the very start.
When we improve the concept of a writable gateway, it could support DAG import via something like HTTP PUT /ipfs/{cid}
Decoupling should be fine, as long we ensure that ipfs dag import
supports gateway responses, and that writable gateway can process archive produced by ipfs dag export
.
Do you feel this is blocked on DAR replacing CAR, or can we move forward with CAR for now?
I'd like to open a PR with spec draft in ipfs/spec
for how DAG import/export should work on Gateways, but feel we need to answer the format question before that.
Great, I agree with your points. No, please don't interpret my statements as anything blocking moving forward; but as considerations and options with which you may choose to move forwards, or not. I do think there are significant advantages to the more tightly-defined DAR format over CAR, for these use cases, but I'll be satisfied with a story about how we could upgrade to this in the future (e.g. with content-type headers).
I do still think that the behaviour should be fairly defined, so perhaps would tweak your suggestions to make that more explicit, e.g. with ?download=car
and cid.car
filename.
I think for non-unixfs DAGs we would return an error by default informing that there is no available preview for CID, but the original DAG can be downloaded if ?download=car or Accept are passed.
This makes sense. If we're providing this option, it should be consistent and predictable for all URLs. So even if you hit a UnixFS node with ?download=car
or the right content type in Accept:
, I would expect this to result in a CAR download.
Browser detection through user agent is also worth considering, although controversial and potentially too much magic. If what's hitting you is a browser, and you know you can't do anything useful, the gateway could force the CAR download, whether or not the explicit parameter is passed.
I do think there are significant advantages to the more tightly-defined DAR format over CAR, for these use cases
@anorth Mind elaborating on which parts of DAR you consider to be advantageous over CAR with a depth-first, deduplicated traversal logic?
I would imagine the deterministic depth-first nature, and the space efficiency through CID ellision, provided that it doesn't make us lose CID roundtripping, which I suspect it does in its current form.
I do think that a DAR is costlier to generate on the server side in terms of memory footprint due to the substitution of CIDs links for absolute stream offsets, which requires server-side tracking, and does open up security risks IMO if used to implement this use case.
I think the advantages are primarily the tight definition of the format. If some service producing a CAR also enforced depth-first logic and no duplicates as additional semantics, that would cover much of the benefit. A consumer could then rely on and validate that property. The stronger guarantees can be exploited by clients.
This allows, for example, streaming construction of application data. E.g. a web app that builds a data model as a projection from IPLD could build that model incrementally, while validating CIDs, without ever writing blocks to a blockstore. A direct example, an application could reconstruct a unixfs file raw data in a streaming fashion. (Hmm, except for blocks referred to twice. Maybe there are cases where deduplication is unwanted?).
Another, less important, advantage of DAR is that it's a deterministic representation given the ordering of roots. And then there's the space efficiency improvement, which is valuable with small node sizes (which we might see in application data, more than unixfs), even after we add a couple of bytes to fix up the CID roundtripping.
It's not CIDs that are substituted, but duplicated block bodies. You're right that the deduplication does require keeping a CID->offset
map on the server. Code I've observed that generates CARs does this explicitly from the outside (without the offset). I imagine that if the gateway built CARs on the side it would do the same thing, so I don't think DAR is meaningfully costlier. One could view the deduplication as best-effort: I think it would be fine in principle to cap the size of that map at some large but finite number. But a client fetching a DAG with hundreds of thousands of nodes should probably paginate somehow anyway. A gateway could limit the number of nodes or bytes it will serve in one archive: DAR is resumable in that the stack of CIDs observed as links but not yet received as blocks provides the roots for the remainder of the DAG.
https://github.com/ipfs/go-ipfs/pull/8111 landed to open up the /api/v0/dag/export
endpoint as part of the gateway api so we can start experimenting with light clients that fetch and verify data over http. It'll be available on https://ipfs.io once go-ipfs 0.9.0 is ready for testing.
Browser detection through user agent is also worth considering, although controversial and potentially too much magic. If what's hitting you is a browser, and you know you can't do anything useful, the gateway could force the CAR download, whether or not the explicit parameter is passed.
Browsers tend to send Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
headers (when page is navigated to) which is pretty good indication that they want HTML view. Which is why I think gateway should render something like explore.ipld.io when accept header are requesting html / xhtml.
/api/v0/dag/export
is exposed on https://ipfs.io via nginx config. dweb.link will support it when go-ipfs 0.9.0 is deployed. You can now use ipfs-get
to fetch a CAR over http and verify the blocks before writing to disk.
npx ipfs-get bafkreigh2akiscaildcqabsyg3dfr6chu3fgpregiymsck7e7aqa4s52zy --output room-guardian.jpg
If you prefer to curl it yourself, then you can pipe it to ipfs-car
to verify and unpack it
curl -X POST "https://ipfs.io/api/v0/dag/export?arg=bafkreigh2akiscaildcqabsyg3dfr6chu3fgpregiymsck7e7aqa4s52zy" \
| npx ipfs-car -o room-guardian.jpg
or import it to your local ipfs
with
curl -X POST "https://ipfs.io/api/v0/dag/export?arg=bafkreigh2akiscaildcqabsyg3dfr6chu3fgpregiymsck7e7aqa4s52zy" \
| ipfs dag import
The CAR response format for /ipfs/{cid}
is ready for review: https://github.com/ipfs/go-ipfs/pull/8758
Done:
Future work will happen as PR against Gateway specs (https://github.com/ipfs/specs/pull/283):
Today: only unixfs+raw
Right now the
/ipfs/
path on gateway only supportsdag-pb
+raw
, everything else fails withunrecognized object type
error:Gateway exposes
/api/v0/get
so one can read blobs with other DAGs that way, but people rarely use it.Future: export anything
I believe the HTTP gateways should return every DAG type.
Here is initial idea: we recently added support for CAR import/export to go-ipfs (https://github.com/ipfs/go-ipfs/issues/6870) – what if we return non-unixfs/raw DAGs as
.car
.ipfs.dag
files?immutable
hint for everything under/ipfs/
{cid}.ipfs.dag
filename and trigger download, so the browser does not try to render the blob?download=dag
everywhere, so even unixfs DAG could be fetched as CAR from any gateway?format=car
is more flexible, as it allows forformat=tar
orjson
for dag-cbor (https://github.com/ipfs/go-ipfs/pull/8037)cc @mikeal @aschmahmann @autonome @Gozala @achingbrain – is this a good idea? any concerns? would PR be accepted?
Future: import DAGs
If we have export.. could we also add import? This could be safely enabled on localhost, and people could experiment with this on public gateways (there, it could be guarded by reverse proxy or some bearer token):
HTTP PUT /ipfs/{cid}
HTTP PUT /ipns/{libp2p-key}
References
?download=true
support was recently added inhttps://github.com/ipfs/go-ipfs/pull/7677
Content-Location
in response to specificAccept
in request: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Location#examples