feat: customizeable content type parsing in @helia/verified-fetch

SgtPooki commented 9 months ago

Discussed with @achingbrain due to https://github.com/ipfs/helia/pull/416

Goals

keep bundle size small
Provide some content-type recognition for VERY COMMON use-cases (see below)
allow overriding of content-type parsing for more complicated consumer scenarios.

Initial design idea

Remove dependency on mime-types, don't depend on file-type

some interface such as createVerifiedFetch(helia, { contentTypeParser: (bytes) => myFn}) and we provide a default contentTypeParser that determines content type for the below list only.

We would pass the contentTypeParser function the first block of bytes we receive; and because most of our blocks are 1MB or below, we can safely assume the majority of content types users need to recognize can be determined by looking at those first 1MB of bytes.

If content-type is not a recognized type from the below list, we do not set it (allows browser sniffing).

Supported content types

image/jpg
image/png
video
tar
[TODO: which types should we support by default]?

References

https://en.wikipedia.org/wiki/List_of_file_signatures

cc @achingbrain @aschmahmann @lidel @2color

SgtPooki commented 9 months ago

FYI that file-type is fairly small compared to the entirety of @helia/verified-fetch (currently totals 560.2kb) at only 26.7kb:

lidel commented 9 months ago

Providing a way to pass custom content type sniffer sounds sensible, but will be a very niche feature request if your default is something comprehenbsive like file-type with magic bytes sniffing.

I think the question we could ask is when is content-type relevant:

If we use verified-fetch in JS the same way as fetch, the content-type header won't matter. End user will use .json(), .text(), .blob() etc themselves.
If we use verified-fetch in service worker for web gateway implementation, then we pass response to browser renderer directly, and returned content-type matters. In this case hard-coding a few content types won't be enough anyway, and the user wants something more future-proof, like file-type.

That is to say, I think it is sensible to either:

go with file-type everywhere (avoid maintaining "minimal list of types we support", ~5% bundle size increase does not sound like a lot when compared to UX/DX of content-type being taken care of)
OR skip setting content-type by default entierely, and only use it gateway contexts, in which you use contentTypeParser to pass file-type that does the comprehensive magic bytes sniffing.

achingbrain commented 9 months ago

My feeling is that if we don't need to do content type sniffing then let's not do it.

If we need to do it, we should do the minimum required (e.g. just support detecting a small subset of content types) and provide an extension mechanism for more comprehensive detection.

Given that we're billing this as fetch-like, most people will just do .json(), .blob(), etc and get on with things which suggests that we don't need to detect content types - we just try to process the data as the requested type and fail loudly if we can't.

There are valid use-cases for content detection though (e.g. service worker gateway) so allowing users to configure a mime type sniffer if they need it seems like a good compromise.

2color commented 9 months ago

Mostly agree with @lidel and @achingbrain, though I don't have a strong inclination either way.

If we don't include magic-byte sniffing by default, it should be as easy as possible to configure so it works smoothly in service workers.

SgtPooki commented 9 months ago

Sounds good. Ill get a PR out today that will not do content-type unless passed a function for it

ipfs / helia