mime-types / ruby-mime-types

Ruby MIME type registry library
Other
324 stars 122 forks source link

How to use MIME::Types.type_for to accurately find the right mime type of file #161

Open HoneyryderChuck opened 2 years ago

HoneyryderChuck commented 2 years ago

Hi, I'm the maintainer of httpx, which optionally uses mime-types, when available, to identify the mime type of a file to be sent in a multipart request; this feature is itself inherited from a similar technique found in shrine.

It was reported to me recently an issue found when performing an upload of an mp4 video, where the request would fail because the content-type set for the file was application/mp4, not video/mp4. The issue surfaces due to mimet-types being loaded, and application/mp4 being the first option returned:

MIME::Types.type_for("a.mp4") #=>  [#<MIME::Type: application/mp4>, #<MIME::Type: audio/mp4>, #<MIME::Type: video/mp4>, #<MIME::Type: video/vnd.objectvideo>]

Is there an alternative way to get more accurate mime type? I'm tempted to remove the integration, but not without first exploring the options here. (cc @janko in case you have pointers based on your XP in shrine).

halostatue commented 2 years ago

From mime-types and the filename? Not at this moment. There’s not really a good way to specify the relative priority of types with respect to related types that can have the same extension(s)‡. We don’t know that video.mp4 might just be an audio/mp4 because there’s no video stream.

From https://www.coolutils.com/Formats/MP4

Here are some file extensions used on files that contain data in the *.mp4 format:

  • .mp4: official extension, for audio, video and advanced content (see above) files
  • .m4a: for audio-only files; can safely be renamed to *.mp4, though opinions differ on the wisdom of this.
  • .m4p: FairPlay protected files
  • .mp4v, .m4v: video-only (sometimes also used for raw mpeg-4 video streams not in the *.mp4 container format)
  • .3gp, .3g2: used by 3G mobile phones, may also store content not specified directly in the *.mp4 specification (H.263, AMR, TX3G)

Marcel has limited ability to understand some magic type definitions from the data (it is using a fraction of what is in the tika mime file). Ideally, tools like httpx and shrine shouldn’t be making an automatic decision on the type of data being handled, but should ideally be deferring decision to the application and its context, because they might be able to look for magic bytes to determine what the .mp4 file actually contains.

I have some thoughts on how it might be possible to customize the priority of types by extension, but I’m not sure how this could be done in a backwards-compatible way (this affects the data).

I suspect that I could write something that parses the Tika data format and folds that into mime-types-data, but that would probably require some clarity on licensing (mime-types-data is nominally MIT, but is derived primarily from IANA data). I would probably not add this functionality into MIME::Types directly, as mime-types is not optimized for ideal performance and but instead standards conformance. Discourse uses mini_mime for performance reasons, which is derived from mime-types-data but that does not help:

[1] pry(main)> MiniMime.lookup_by_extension('mp4')
=> #<MiniMime::Info:0x00000001208f0658 @content_type="application/mp4", @encoding="base64", @extension="mp4">