ByteString vs Text - Githubissues

xich commented 11 years ago

Need to carefully examine where it is appropriate to use ByteString and where it is appropriate to use Text. For instance, headers currently return Text values... but are all HTTP headers encode-able as Text?

hdgarrood commented 10 years ago

Are all HTTP headers encode-able as Text?

Is this the same as asking if it's possible to construct a valid HTTP header from any two Text values representing the name and the value? The answer appears to be no, because header names must be ASCII (0..127). The value can contain any data, but only when encoded in accordance with RFC 2047, and I don't know if HTTP clients can be expected to support it.

Here's the grammar for an HTTP header, from http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2:

message-header = field-name ":" [ field-value ]
field-name     = token
field-value    = *( field-content | LWS )
field-content  = <the OCTETs making up the field-value
                        and consisting of either *TEXT or combinations
                        of token, separators, and quoted-string>

And relevant definitions (all from http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.1):

OCTET         = <any 8-bit sequence of data>
CHAR          = <any US-ASCII character (octets 0 - 127)>
CR            = <US-ASCII CR, carriage return (13)>
LF            = <US-ASCII LF, linefeed (10)>
SP            = <US-ASCII SP, space (32)>
HT            = <US-ASCII HT, horizontal-tab (9)>
CRLF          = CR LF
LWS           = [CRLF] 1*( SP | HT )
token         = 1*<any CHAR except CTLs or separators>
separators    = "(" | ")" | "<" | ">" | "@"
              | "," | ";" | ":" | "\" | <">
              | "/" | "[" | "]" | "?" | "="
              | "{" | "}" | SP | HT
quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
qdtext        = <any TEXT except <">>
quoted-pair   = "\" CHAR

The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].

TEXT = <any OCTET except CTLs,
       but including LWS>

A very abridged summary of RFC 2047: encoded text looks like this:

encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

where 'encoding' is either 'Q' (Quoted-Printable) or 'B' (base64). Examples:

The following are examples of message headers containing 'encoded-
   word's:

   From: =?US-ASCII?Q?Keith_Moore?= <moore@cs.utk.edu>
   To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@dkuug.dk>
   CC: =?ISO-8859-1?Q?Andr=E9?= Pirard <PIRARD@vm1.ulg.ac.be>
   Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

hdgarrood commented 10 years ago

Although that doesn't necessarily mean that setHeader should take ByteStrings. Maybe the solution should just be to call T.encodeLatin1 on the arguments to setHeader instead of T.encodeUtf8 (which currently seems to be happening?)

scotty-web / scotty

ByteString vs Text #42