Closed zeusdeux closed 3 years ago
HTTP headers are encoded as ISO-8859-1 as per the RFC, not UTF-8.
@pallas if that's the case and I see that both á
and ø
are included in the ISO-8859-1 character set, why aren't they parsed correctly?
Because they're decoded as ISO-8859-1 but whatever is generating them is actually encoding them as UTF-8. In Python,
>>> "áø".encode('utf-8').decode('iso-8859-1')
'áø'
>>> "áø".encode('utf-8') # the application is sending this...
b'\xc3\xa1\xc3\xb8'
>>> "áø".encode('iso-8859-1') # ...but it should be sending this
b'\xe1\xf8'
tl;dr: You can't put UTF-8 in HTTP headers, only ISO-8859-1. That's not a limitation of llhttp, it's part of the standard. The bytes above are not being parsed incorrectly by llhttp, the application generating them is using the wrong encoding.
Ah this makes a lot of sense! Thanks @pallas, not just for the explainer but for a TIL! Strings are weird 😅
No problem, happy to help!
Ah gotcha. Thanks!
I see the following line in both rfc5987 and rfc8187 that states the following —
However, RFC 2231 does not specify a mandatory-to-implement character encoding, making it hard for senders to decide which encoding to use. Thus, recipients implementing this specification MUST support the "UTF-8" character encoding [RFC3629].
Doesn't this imply llhttp
should support UTF-8 encoded header values?
No, IMO that's out of scope for the HTTP parser. Q.v. rfc7230§3.2.4
Historically, HTTP has allowed field content with text in the ISO-8859-1 charset, supporting other charsets only through use of RFC2047 encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.
Whatever the field value means is for the receiver to figure out anyway, except the ones llhttp already decodes. If you want your application to support encodings that are not ISO-8859-1 or US-ASCII, that's for it to support. Since obs-text can be opaque, it could be actively dangerous to try to decode it in llhttp if it is not RFC8187 encoded but is interpreted as though it were. The suggestion that RFC2047 has been used historically to encode field values is indicative of this, and the semantics of any field not specified by the RFC are at the application level.
Would the application in the case of llhttp be node? And do I understand correctly that llhttp provides access to header values as raw bytes which node can then choose to interpret as UTF-8 if it so chooses?
Yes, the callbacks provide access to the raw bytes. Note that if the buffers are split, a single field might produce multiple callbacks corresponding to separate underlying buffers.
Gotcha. So the way forward would be to get node to expose the header values as Buffer
s rather than just string
as it does right now, do I understand it correctly?
That sounds right to me.
Right, will open an issue there then. Thanks @pallas!
Hi folks! Am I wrong in understanding that
llhttp
and thus node only supportsUS-ASCII
encoded values for all http headers, custom or otherwise?For e.g.,
x-some-header: áø
gets parsed asx-some-header: áø