nodejs / llhttp

Port of http_parser to llparse
http://llhttp.org
Other
1.65k stars 184 forks source link

Custom HTTP headers with utf8 values #126

Closed zeusdeux closed 3 years ago

zeusdeux commented 3 years ago

Hi folks! Am I wrong in understanding that llhttp and thus node only supports US-ASCII encoded values for all http headers, custom or otherwise?

For e.g., x-some-header: áø gets parsed as x-some-header: áø

pallas commented 3 years ago

HTTP headers are encoded as ISO-8859-1 as per the RFC, not UTF-8.

zeusdeux commented 3 years ago

@pallas if that's the case and I see that both á and ø are included in the ISO-8859-1 character set, why aren't they parsed correctly?

pallas commented 3 years ago

Because they're decoded as ISO-8859-1 but whatever is generating them is actually encoding them as UTF-8. In Python,

>>> "áø".encode('utf-8').decode('iso-8859-1')
'áø'
>>> "áø".encode('utf-8') # the application is sending this...
b'\xc3\xa1\xc3\xb8'
>>> "áø".encode('iso-8859-1') # ...but it should be sending this
b'\xe1\xf8'
pallas commented 3 years ago

tl;dr: You can't put UTF-8 in HTTP headers, only ISO-8859-1. That's not a limitation of llhttp, it's part of the standard. The bytes above are not being parsed incorrectly by llhttp, the application generating them is using the wrong encoding.

zeusdeux commented 3 years ago

Ah this makes a lot of sense! Thanks @pallas, not just for the explainer but for a TIL! Strings are weird 😅

pallas commented 3 years ago

No problem, happy to help!

zeusdeux commented 3 years ago

Hi @pallas! Quick question — is the RFC you are alluding to rfc5987?

pallas commented 3 years ago

Yes, and rfc8187, which replaced it.

zeusdeux commented 3 years ago

Ah gotcha. Thanks!

I see the following line in both rfc5987 and rfc8187 that states the following —

However, RFC 2231 does not specify a mandatory-to-implement character encoding, making it hard for senders to decide which encoding to use. Thus, recipients implementing this specification MUST support the "UTF-8" character encoding [RFC3629].

Doesn't this imply llhttp should support UTF-8 encoded header values?

pallas commented 3 years ago

No, IMO that's out of scope for the HTTP parser. Q.v. rfc7230§3.2.4

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset, supporting other charsets only through use of RFC2047 encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

Whatever the field value means is for the receiver to figure out anyway, except the ones llhttp already decodes. If you want your application to support encodings that are not ISO-8859-1 or US-ASCII, that's for it to support. Since obs-text can be opaque, it could be actively dangerous to try to decode it in llhttp if it is not RFC8187 encoded but is interpreted as though it were. The suggestion that RFC2047 has been used historically to encode field values is indicative of this, and the semantics of any field not specified by the RFC are at the application level.

zeusdeux commented 3 years ago

Would the application in the case of llhttp be node? And do I understand correctly that llhttp provides access to header values as raw bytes which node can then choose to interpret as UTF-8 if it so chooses?

pallas commented 3 years ago

Yes, the callbacks provide access to the raw bytes. Note that if the buffers are split, a single field might produce multiple callbacks corresponding to separate underlying buffers.

zeusdeux commented 3 years ago

Gotcha. So the way forward would be to get node to expose the header values as Buffers rather than just string as it does right now, do I understand it correctly?

pallas commented 3 years ago

That sounds right to me.

zeusdeux commented 3 years ago

Right, will open an issue there then. Thanks @pallas!