Accept slashes in parameter values

gwicke commented 9 years ago

While the RFC is excluding slashes from tokens and thus parameter values, some existing content-type values do contain them, and existing content-type parsers are generally lenient when encountering them.

dougwilson commented 9 years ago

It's highly unlikely I'll accept violating the RFC, as that's the whole point of this module: to parse according to the RFC. Can you at least provide very detailed explanation why you want to violate the RFC?

If you are asking to be more lenient and implement a second parser that is, then that can probably happen. Please link to the source code of other such lenient parsers so we can implement their proven parsing algorithm.

gwicke commented 9 years ago

We have some stored content that has a content-type header with missing quoting in one of its parameters. Without the patch, content-type simply throws an exception when encountering this header. With the patch, the header is parsed correctly, which lets us use content-type to consistently format those headers before exposing them to the outside world.

Examples for lenient parsing in the wild:

Content-type parameter parsing code in Chrome, which mentions being based on similar code from Mozilla. Note that the tokenizer is only looking for delimiters & quoted strings, but does not care about the characters in unquoted strings.
Content-type parameter parsing code in Mozilla. Same principle, focused on delimiters.

Practical example:

Navigate to https://en.wikipedia.org/api/rest_v1/page/html/San_Francisco/632646397 using a web browser (I used Chrome).
In the network console, note that the raw returned content-type is text/html;profile=mediawiki.org/specs/html/1.1.0;charset=utf-8 (missing escaping around the profile parameter value).
In the JS console, check the parsed content type via

> document.contentType
"text/html"

dougwilson commented 9 years ago

Cool. Would you be willing to update your PR? It needs at least two changes:

There are no tests, so they would need to be added.
We will not change the current behavior, unless you are willing to wait an indeterminate amount of time for a major version bump. Plus, if we change this, we'll first need to implement a way in this module for Express.js and all it's related libraries to validate the Content-Type header, as it does today.

This means that either this change needs to be opt in or something else. Let me know what suggestions you have to implement this change in a backwards-compatible way, including keeping the current feature of this module which allows you to do strict validation. The whole reason this does strict validation and we use this in Express.js is to deter people from putting up web servers that serve invalid Content-Type headers in the first place, like that URL you provided is doing.

If we do want to add a non-strict parsing in here, it should at least be an exact copy of a long-standing algorithm, for example, it should perhaps parse exactly how that Chrome one does (and all an entire suite of tests to go along with this new parser, please).

dougwilson commented 9 years ago

Ideally, after the changes here, I can see you being able to do the "Content-Type invalid -> valid fixing" by doing contentType.format(contentType.parse(str, { loose: true })) or something like that.

dougwilson commented 9 years ago

In fact, thinking about it more, I'm open to perhaps making the non-strict parsing the default, but still, would want both strict and non-strict parsing supported with a switch (and, of course, tests).

gwicke commented 9 years ago

@dougwilson: An option sounds good to me, and I also think that implementing the non-strict algorithm as in Chrome & FF makes sense. It will likely also be faster.

I made a note to get back to this, but might not find the time right away.

dougwilson commented 9 years ago

Cool. Yea, this open PR will also put it on my list as well, so it'll likely end up being whichever one of us gets to it first, then :)

dougwilson commented 9 years ago

Also, this could be a temporary work-around to achieve the desired goal, without needing to maintain a custom fork, which should help with keeping timeline pressure at bay:

var contentType = require('content-type')
var rawParameterValueRegExp = /(; *[^=]+=)([^"][^;]*)/g

module.exports = function normalizeContentType(str) {
  try {
    return contentType.format(contentType.parse(str))
  } catch (err) {
    try {
      return contentType.format(contentType.parse(str.replace(rawParameterValueRegExp, '$1"$2"')))
    } catch (e) {
      throw err // original error
    }
  }
}

dougwilson commented 8 years ago

Hi @gwicke I hope it's OK if I close this PR; I know we discussed above what to do, and I feel like we did come up with a good plan: implement the same loose parsing algorithm as Chrome/Firefox/whatever and expose it through a parsing option (perhaps { lax: true }). I don't really want to keep this PR open forever, especially since it's not similar to what we think the ideal solution should be, so doesn't really have value staying open. Please feel free to resubmit a PR with the new implementation at any time :) !

jshttp / content-type

Accept slashes in parameter values #5