jshttp / mime-db

Media Type Database
MIT License
1.12k stars 260 forks source link

Better way to manage extension priority for multiple types #20

Open dougwilson opened 9 years ago

dougwilson commented 9 years ago

We need a better way to manage extension priority for multiple types (i.e. provide an extension -> mime mapping).

The reason we need this is because as we source from more places, you cannot just build this mapping in other libraries by iterating over the types and just accumulating the extensions in a map, as they may not be in the most optimal order.

dougwilson commented 9 years ago

I am debating between two different APIs here:

  1. require('mime-db') will look the exact same, but there will be a new, non-enumerable property require('mime-db').extensions that will contain the extension map. Pro is backward-compatible with all the modules that depend on this library on npm. Con it's mixing things together.
  2. Just make require('mime-db') no longer be a direct mapping to the db.json file, but rather place it at a property (like require('mime-db').types). Pro better separation of concerns and fits ES6 modules a little better. Con requires a major version bump.

I'm putting this out here in case anyone has opinions on it.

/cc @Fishrock123 @jonathanong @broofa @hueniverse

broofa commented 9 years ago

My $.02: Prioritize at the source level, and be explicit about which source trumps which. E.g. I'd suggest:

  1. Apache
  2. Nginx
  3. Node community submitted extensions

One thing to consider is what happens if one of these datasets (e.g. Apache) is updated to cause a conflict. The priority here would ideally reflect the willingness of organizations to bring there dataset in line with what other orgs are doing.

dougwilson commented 9 years ago

@broofa yes, this is what I'm doing, but in order to provide this organized list for extension -> mime lookup (we already do this for mime -> data), we will have to actually provide the list in some way. Right now the modules on top of this are simply iterating over the db and just building the list, but of course that means their list is going to be organized alphabetically by mime type, rather than order by source preference (this is an issue specific to extension -> mime lookups).

What I'm currently contemplating is what the API of require('mime-db') should look like when providing this second mapping.

jonathanong commented 9 years ago

@dougwilson you mean something like

// require() style
import types from 'mime-db/types.json'
import extensions from 'mime-db/extensions.json'

// all at once
import { types, extensions } from 'mime-db'

// by default
import mime from 'mime-db'
mime.extensions
mime.types
jonathanong commented 9 years ago

i don't mind breaking changes :D

broofa commented 9 years ago

I'm confused about what the issue is here.

extension priority for multiple types

You mean, if the same extension appears in multiple type definitions? Is that allowed??? In node-mime I disallowed that. (i.e. node.types was not allowed to conflict with mime.types.) I would suggest you do the same here. That's part of the reason for having an explicit priority for the sources, so you can resolve such conflicts as part of building db.json.

Assuming the above, then I don't believe shipping an extensions map is necessary. mime-db already provides provides a per-type extensions list which is ordered. Just be explicit about the fact the order there is important: "First entry is the default extension for each type".

If your concern is that secondary modules may be building their own extensions map in inconsistent ways, then codify how it should be done as a separate module (mime-db-extensions) that provides the code for doing that.

One reason for this is that redundant data in datasets is presumably an anti-pattern. As long as the extension map can be built dynamically from what's currently in db.json, then I would think encouraging dependent module authors to do so would be a good thing.

... or am I missing something.

dougwilson commented 9 years ago

You mean, if the same extension appears in multiple type definitions?

Correct.

Is that allowed???

It sure is, even within the IANA database itself (which is what we're supposed to be mimicking here). For example, both http://www.iana.org/assignments/media-types/image/vnd.dvb.subtitle and http://www.iana.org/assignments/media-types/text/vnd.dvb.subtitle list .sub as their file extension. There are a bunch more than that, but I knew of that one off the top of my head.

In node-mime I disallowed that. (i.e. node.types was not allowed to conflict with mime.types.) I would suggest you do the same here. That's part of the reason for having an explicit priority for the sources, so you can resolve such conflicts as part of building db.json.

So, there is a fundamental disconnect between this module and some of the things that depend on it: this module provides a mime -> data mapping, but I know module are trying to build an extension -> mime mapping out of this data, but it doesn't work that great, since there is no fundamental reason multiple mime types cannot be mapped to the same file extension.

Just think legacy reasons: at one time, .coffee was text/coffeescript by the community. Now, there is an official IANA registration: application/vnd.coffeescript. From this library's point of view, both db['application/vnd.coffeescript'].extensions and db['text/coffeescript'].extensions should contain coffee, because both those mime types map to that extension.

Assuming the above, then I don't believe shipping an extensions map is necessary. mime-db already provides provides a per-type extensions list which is ordered. Just be explicit about the fact the order there is important: "First entry is the default extension for each type".

Yes, this is try, but you cannot take this list and reverse it into an extension -> mime list; it's a pure mime -> extension list.

If your concern is that secondary modules may be building their own extensions map in inconsistent ways, then codify how it should be done as a separate module (mime-db-extensions) that provides the code for doing that.

So, if you want, you can sort of do this by simply looking at the source value when building out your extension -> mime map and make sure you pick the ones out in the correct order by source. Right now it seems it's pretty much happening just alphabetically.

As far as a separate module, even doing it based of "source" is not good enough, because that's the source of the mime type, not the source of the mime -> extension mapping. The only way to do it as a separate module correctly is to duplicate this entire module and maintain two things doing the same basic task. The decision needs to be made directly when pulling down from IANA, Apache, etc.; waiting until after db.json is created is too late, because various data is gone (like, what source did the mime -> extension link come from?).

As long as the extension map can be built dynamically from what's currently in db.json, then I would think encouraging dependent module authors to do so would be a good thing.

So, TL;DR what I'm saying is that it's impossible to build a correct extension -> mime mapping from the current db.json format, because it doesn't contain the source for each extension in the list to be able to pick the right reverse mapping.

broofa commented 9 years ago

Just think legacy reasons: at one time, .coffee was ...

Ah, I see. Thanks for clarifying.

it's impossible to build a correct extension -> mime mapping from the current db.json format, because it doesn't contain the source for each extension in the list to be able to pick the right reverse mapping.

I see source properties for every mime type except the ones brought in from custom.json. Assuming undefined source == 'custom', is that sufficient information?

Regardless, the more I think about this, the more I suspect priority is going to be a matter of preference. Someone running on nginx may want those mappings to take precedence over Apache. And do you give IANA types precedence over custom.json types? Hard to say.

Fundamentally, creating extensions at build time requires you to pick one set of priorities. Is that a mistake? Should db.json remain a source-agnostic description of what mime information is out there, and leave prioritization to the secondary modules?

Aside: Following up on my idea of a separate module, what about an API as follows for allowing clients to specify priority?

// Build extension map with default prioritization (['iana', 'apache', 'nginx', 'custom'])
var extensions = require('mime-db-extensions).build();

// Build map with custom priorities
var extensions = require('mime-db-extensions).build(['custom', 'iana', 'nginx', 'apache']);
dougwilson commented 9 years ago

It still does not work, because db.json lost information: it does not contain the source for each extension, only the source for the mime.

Scenario 1

{
  "mime/type": {
    "source": "iana", // (because it's iana registered)
    "extensions": ["foo", "bar"]
  }
}

So in the above, foo was defined by IANA, but bar was defined by Apache. You cannot tell this information and thus build your extension -> mime with the correct ordering according to your preference. Basically this is what I'm trying to solve. Suggestions?

Scenario 2

Also, this is becoming a problem and it's even harder to resolve, and we need a solution:

{
  "mime/type": {
    "source": "iana", // (because it's iana registered)
    "extensions": ["foo"]
  },
  "mime/type2": {
    "source": "iana", // (because it's iana registered)
    "extensions": ["foo"]
  }
}

Well... so there are two MIMEs that are IANA registered, but due to historical reasons, one has the officially registered extension, and the other has the traditional community extension from Apache or nginx. How can a library like mime-db-extensions determine this?

dougwilson commented 9 years ago

For anyone that's following along, I'm looking for at least two votes for doing one of the following:

1. expand the current extensions to add a source

Entries in db.json would look like the following:

{
  "mime/type": {
    "source": "iana", // this is the source for the _mime_
    "extensions": [
      {
        "source": "iana", // this is the source for the extension
        "name": "foo"
      },
      {
        "source": "apache", // this is the source for the extension
        "name": "bar"
      }
    ]
  }
}

2. add a second db to provide extension -> mime mappings

Entries in db.json would stay the same; entries in this new db would look like:

{
  "foo": {
    "source": "iana", // this is the source for the _extension_
    "type": "mime/type"
  },
  "bar": {
    "source": "apache", // this is the source for the _extension_
    "type": "mime/type"
  }
}

Both of these will help libraries trying to build a proper extension -> mime mapping; neither of these solutions allow clients to specify priority, since they would loose information from source being a string--they cannot say to prefer nginx -> apache -> iana if they think the source was only iana since it's just a string--they have no way to know that the extension appeared in both nginx and iana; we would have to make source an array to enable that.

jonathanong commented 9 years ago

either one is fine for me. how do you handle "default" mime types, though? just the first extension?

dougwilson commented 9 years ago

how do you handle "default" mime types, though? just the first extension?

I'm not sure what this means. Do you mean what is the "default mime type for a given extension"? for num 2, it's just a straight map. for num 1, it's out of the scope of this library, like it is today in v1.

jonathanong commented 9 years ago

oh fuck. what i want to know is, "what is the default extension for a mime type?"

dougwilson commented 9 years ago

what i want to know is, "what is the default extension for a mime type?"

Gotcha. So that is here already: it is db[type].extensions[0]. The mime-db of today only supports going type -> extension, not the other way around, but I don't think we have to have that limitation :)

jonathanong commented 9 years ago

okay cool. that's all i'm worried about, so i would opt for num 1 unless num 2 handles that.

dougwilson commented 9 years ago

so i would opt for num 1 unless num 2 handles that.

Both provide type -> default extension; in fact, num 2 doesn't touch db.json format at all, as noted in the description :)

broofa commented 9 years ago

Is it appropriate for mime-db to make decisions about conflicts and ambiguity in the dataset? This is the question I'm wrestling with right now.

If the answer is, "no", then wouldn't it make sense for the db.json to retain as much information as possible so downstream projects can use that to act in whatever manner works best for them. And, I believe, that implies limiting the amount of restructuring that's done. I.e. have the data segregated by source at the top level, like so:

[
  {
    "source": "iana",
    "types": {
      "text/foo": {
        "extensions": [...],
        "compressible": true
      },  // etc... other types from IANA
    }
  },
  {
    "source": "apache",
    "types": {
      // etc... other types from Apache
    }
  },
  // etc ... data from other sources
]

There are a couple of advantages to this:

If the answer is, "yes", then ... well... I'm not sure what to do. You can codify how sources should be prioritized ("IANA trumps Apache trumps nginx trumps custom"?) in the build script, but 1. that doesn't solve the problem of inconsistencies w/in a particular source and 2. consumers of mime-db may not agree with that prioritization.

You can make it configurable by downstream modules, but I'm having a hard time convincing myself that's what is needed. The three of us are probably the only people who really care about that debate at the moment. Everyone else probably feels more like, "just fix the damn problem and tell us how it should work." (If/when people have issues with what you decide, maybe they just hack a workaround into their project as needed?)

Or you can make a decision independently for each type/extension. But how/where are such decisions recorded? The fact that "mime/type2 > mime/type2" has to be recorded somewhere... essentially become part of the custom dataset being maintained in this project. The biggest downside to this is that this increases the support overhead needed. You end up dealing with everyone's "this type/extension isn't what I expect!" issues... which is the main reason node-mime has an API for enhancing the mapping information per-project.

(Sorry, I know this doesn't narrow the problem down any, which probably isn't helpful... I'm just regurgitating some thoughts.)

dougwilson commented 9 years ago

I appreciate the feedback. We definitely cannot maintain a manual resolution map, unless someone is going to volunteer to go through all 1.7k entries and create this map and be available once a week to resolve issues from doing pulls. The intent is that everything from remote sources requires no manual intervention.

broofa commented 7 years ago

FWIW, https://github.com/broofa/mime-score is now a thing. It's my best attempt at the logic needed to resolve this issue. It prioritizes by (in decreasing priority) RFC "facet", source, type, and, lastly, string length. (The string length is of debatable merit, but very rarely comes into play)

Would it make sense to add this as a score to each mime-type entry? I'm happy to put up a PR if you think this is a good idea.

[Note: As I mentioned in the mime-types PR I just posted, I'm happy to transfer that module to jshttp if it simplifies maintenance concerns.]