jshttp / mime-db

Media Type Database
MIT License
1.12k stars 260 forks source link

Missing extensions from vnd.comicbook+zip and vnd.comicbook-rar registered with IANA #321

Open rluetzner opened 7 months ago

rluetzner commented 7 months ago

The two MIME types have clearly defined file extensions.

https://www.iana.org/assignments/media-types/application/vnd.comicbook+zip https://www.iana.org/assignments/media-types/application/vnd.comicbook-rar

However, I compared this with a few entries that do have extensions listed in src/iana-types.json and as opposed to the ones I looked at, these two MIME type definitions have their file extensions in a numbered list, e.g.

Additional information:

1. Deprecated alias names for this type: application/x-cbr
2. Magic number(s): none
3. File extension(s): .cbz
4. Macintosh file type code: N/A
5. Object Identifiers: N/A

(excerpt form vnd.comicbook+zip). I guess the parsing logic needs to be adjusted to match these, but I'm not good enough with JS to do that myself.

rluetzner commented 6 months ago

Regexes make my brain hurt. However, I've figured out at least a few things.

  1. The layout from my summary above could be parsed with an older regex /^\s*(?:\d\.\s+)?File extension(?:\(s\)|s|)\s?:\s+(?:\*\.|\.|)([0-9a-z_-]+)\s*(?:\(|$)/im, which was replaced in commit be9ca41d sometime in 2018.
  2. Going by the new variable name and what I've seen, the old regex was not able to parse file extensions with quotes.
  3. The new regex to handle quotes does not work for file extensions that have no quotes, e.g. https://www.iana.org/assignments/media-types/application/atom+xml .
  4. Both regexes fall down when multiple file extensions are given that are separated by a comma.
  5. I have no idea how something like https://www.iana.org/assignments/media-types/application/mp4 is parsed, because the two file extensions are given in prose text.

I've played around a bit with a regex tester and was able to fix some of these things. Making the quotes optional in particular is quite easy. But I'm very uncertain as to how this will affect a full rebuild. There doesn't seem to be a clear scheme to the IANA MIME type declarations, so I don't think there's a way to handle all cases anyway.

rluetzner commented 6 months ago

For what it's worth, here's the regex I came up with that works with and without quoted file extensions:

/^\s*(?:\d\.\s+)?File extension(?:\(s\)|s|)\s?:[\s]*['"]?(?:\.?([0-9a-z_-]+))['"]?$/im

I used https://regex101.com/ to test things and copied the declaration for atom+xml and modified it manually.

This does not work properly with multiple comma separated file extensions, but none of the other regexes do, so I'd count it as an improvement.