jshttp / content-disposition

Create and parse HTTP Content-Disposition header
MIT License
224 stars 43 forks source link

Generate ext-parameter when utf8 and latin1 representations don't match #27

Open jeremyspiegel opened 5 years ago

jeremyspiegel commented 5 years ago

This library will only generate the extended filename parameter if the filename is not representable in ISO-8859-1 (checking against NON_LATIN1_REGEXP).

However, some clients will not try to interpret the filename as ISO-8859-1. Chromium's Content-Disposition parsing code tries to interpret it as UTF-8, a referrer_charset, and the native OS default charset in turn. This ends up sometimes working for normal downloads in Chromium since it passes a referrer_charset that defaults to windows-1252. But this doesn't work for some other locales or cases where this referrer_charset isn't passed (like downloads in Electron).

It seems like it would be useful to generate the extended parameter when the utf8 and latin1 representations don't match.

dougwilson commented 5 years ago

Ah, I see. So the default module of this module is to follow the specification, as in just because Chrome does not follow the actual specification does not make the bug in this module, as of course there are other clients who do follow the spec.

What if instead of just generating a longer string when not actually required this module adds an option to allow you to always generate both, for example, so you can decide what to do (maybe sniff the user-agent header and only do it then, or unconditionally if that is what you want)?

dougwilson commented 5 years ago

Another potentional option is that this module could kick in the use of extension parameter if the filename falls outside of US-ASCII instead of ISO-8859-1, though in order to know if that would actually be effective, we would need a much better understanding of what exactly these buggy clients are doing with the bytes in the file name, as not all single byte character schemes match the US-ASCII set for the first 127 characters.

jeremyspiegel commented 5 years ago

I'll create a Chromium issue as well for this. However, the specification advises to use ASCII rather than ISO-8859-1:

Appendix D.  Advice on Generating Content-Disposition Header Fields

   To successfully interoperate with existing and future user agents,
   senders of the Content-Disposition header field are advised to:
   o  Avoid using non-ASCII characters in the filename parameter.
      Although most existing implementations will decode them as
      ISO-8859-1, some will apply heuristics to detect UTF-8, and thus
      might fail on certain names.

This strategy seems to be followed by Django, Flask, and Rails.

dougwilson commented 5 years ago

Right, probably the reason that is added to the appendix is because there are buggy clients out there who are not following the specification, as they are trying to account for buggy servers who are accidentally generated bad header values in the wrong character set. It is definitely a tricky problem, which is why I generally try to follow the actual standards, since they actually spell out the way to do it.

That is why I also provided the second suggestion above, though. The spec itself clearly spells out that the header is ISO-8859-1 character encoding, regardless of buggy client implementations. So this implementation is not buggy here.

I suggested that maybe we can add options so you can make the decisions you like to better control the behavior that would work better for your case (my first comment). Would that work for you?

jeremyspiegel commented 5 years ago

I appreciate your quick response, and I understand wanting to avoid sending extra data for misbehaving clients, but I think this module would be better if by default it followed the advice of the spec and the behavior of other popular web server frameworks. I understand if you disagree — an option would also be helpful. Thanks!

dougwilson commented 5 years ago

Right. Of course the RFC has specific wording for the spec, like MUST NOT and SHOULD NOT, etc. The appendix is not even a main point of the spec, relegated to a footnote, without even using the standard wording, so it's hard to draw concrete instructions from it.

So to be clear: I'm not saying I am not going to change anything, and your discussion on this is valuable, so please do not feel dismayed.

jeremyspiegel commented 5 years ago

I might be missing something, but I don't actually see the spec use MUST, SHOULD, MAY, etc. in regards to using ISO-8859-1. It seems to only mention ISO-8859-1 when talking about the "filename*" parameter, and how it allows characters outside the ISO-8859-1 character set. This does imply that the "filename" parameter can be ISO-8859-1, but it doesn't actually come out and say that anywhere else, so it seems a little ambiguous to me. However, I do think the appendix is clear about the recommendation to avoid non-ASCII characters.

Here is the Chromium bug I filed: https://bugs.chromium.org/p/chromium/issues/detail?id=1006345.

Thanks again for considering a change here!

dougwilson commented 5 years ago

I might be missing something, but I don't actually see the spec use MUST, SHOULD, MAY, etc. in regards to using ISO-8859-1. It seems to only mention ISO-8859-1 when talking about the "filename*" parameter, and how it allows characters outside the ISO-8859-1 character set. This does imply that the "filename" parameter can be ISO-8859-1, but it doesn't actually come out and say that anywhere else, so it seems a little ambiguous to me.

This is because since this spec is in relation to a HTTP header, the spec for HTTP itself is what declares header values as ISO-8859-1, so it only passively mentions it since it assumes that as prior knowledge to the specific Content-Disposition header specification. You can see they try to provide the needed link a few times in the spec, notably in https://tools.ietf.org/html/rfc6266#appendix-C

By default, HTTP header field parameters cannot carry characters outside the ISO-8859-1 ([ISO-8859-1]) character encoding (see [RFC2616], Section 2.2).

However, I do think the appendix is clear about the recommendation to avoid non-ASCII characters.

Yes, I do not disagree here. I was just noting it was relegated to an appendix, when it could have been made an actual rule by either not defining the ABNF of the the header as

     filename-parm       = "filename" "=" value
     value         = <value, defined in [RFC2616], Section 3.6>
                   ; token | quoted-string

in which case the value is defined in RFC2616 as

       value                   = token | quoted-string
       quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )
       qdtext         = <any TEXT except <">>

with TEXT being

The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].

TEXT           = <any OCTET except CTLs,
but including LWS>

or annotating in the main part of the spec that TEXT "MUST NOT" contain characters above 0x7f (which, also, they could have just not used the quoted-string ABNF from the RFC2616).

dougwilson commented 5 years ago

Looking around it seems like Firefox has a similar bug open: https://bugzilla.mozilla.org/show_bug.cgi?id=588409

jeremyspiegel commented 5 years ago

From the new spec:

   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.

I guess Content-Disposition isn't a newly defined field so I'm not sure if this is necessarily relevant, but it's interesting to note.

dougwilson commented 5 years ago

Yea, the CD header existed well before the new specs were written, back in the historical days. The main reason why this module is emitting them was due to users who needed this (back when it was part of expressjs before extracted into it's own module).

Many languages are covered by latin1 and there are clients where when the extended filename parameter is included they fail to parse the header. So back then the compromise was to make it so the full usage of latin1 would result in just the plain filename parameter, allowing things like spanish, french, and more all to generate the header without the extended parameter being there and working for various clients.

There are many more clients out there than web browsers, for example both cURL and wget command line tools parse this header to determine the filename to save to disk.

This is a hard balancing act and so that I don't have to read every client implementation out there, the compromise was fairly straight forward: just follow the spec and since it allows the usage of iso-8859-1 (latin1) in the basic filename parameter it made everyone happy. Apparently for many mant years until now...? Haha

dougwilson commented 5 years ago

And, ultimately, nothing in the specs seems to be pointing to Chrome is doing the correct thing. At best, the specs admit that there are buggy clients out there (like Chrome, I guess).

jeremyspiegel commented 5 years ago

Fair enough, thank you for that explanation.

jeremyspiegel commented 5 years ago

A Chromium developer has responded to the bug:

I'm skeptical that changing behavior here is worth the investment and dealing with the inevitable breakage, unless we want to reach consensus with other browsers. https://tools.ietf.org/html/rfc7230 pretty clearly bans non-ASCII characters in HTTP headers:

 header-field   = field-name ":" OWS field-value OWS

 field-name     = token
 field-value    = *( field-content / obs-fold )
 field-content  = field-vchar [ 1*( SP / HTAB ) field-vchar ]
 field-vchar    = VCHAR / obs-text

From https://tools.ietf.org/html/rfc5234#appendix-B.1 VCHAR = %x21-7E

So we're dealing with violations of the HTTP spec, and the question is just how to deal with them. (obs-fold is related to line folding, and considered obsolete - regardless, not relevant here).