nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.74k stars 29.67k forks source link

Issue with cyrillic symbols in location headers #1693

Closed LavrovArtem closed 8 years ago

LavrovArtem commented 9 years ago

To reproduce this bug, execute the code below and open http://localhost:777/ in the browser. This page redirects to a page from the location header, and browser should open http://localhost:777/?text=что-то%20русское .

In node.js version 0.10.15 and earlier, it works fine, but io.js opens the http://localhost:777/?text=GB>-B>%20@CAA:>5 url.

I think that the issue is linked to the writeHead function that incorrectly encodes headers and sends them to the browser.

var http = require('http');

http.createServer(function(req, res) {
    if (req.url === '/')
        res.writeHead(302, {location:'?text=что-то русское'});

    res.end();
}).listen(777);
vkurchatkin commented 9 years ago

I'm not sure if this is a bug, but it is always better to encode url manually:

var http = require('http');

http.createServer(function(req, res) {
    if (req.url === '/')
        res.writeHead(302, {location:'?text=' + encodeURI('что-то русское')});

    res.end();
}).listen(7777);
bnoordhuis commented 9 years ago

As general advise, you're better off sticking to ASCII in HTTP headers. From RFC 7230:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset, supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset.

Neither MIME (RFC 2047) nor ISO-8859-1 are widely supported. ASCII is the safest bet.

vkurchatkin commented 9 years ago

Something like this happens: new Buffer('что-то русское', 'binary').toString()

LavrovArtem commented 9 years ago

I understand that it would be reasonable to encode cyrillic or go without using these symbols, but we are creating a proxy server, whereas even quite well-known sites are sending headers with cyrillic.

rlidwka commented 9 years ago

whereas even quite well-known sites are sending headers with cyrillic

Do you have an example of a such website? It might be useful to mention a real-life example in the respective test should that be written.

vkurchatkin commented 9 years ago

Adding res._send(''); forces headers to be flushed with correct encoding. Ironically, calling res.flushHeaders() doesn't really flush headers

LavrovArtem commented 9 years ago

@rlidwka Take https://market.yandex.ru/ If you type cyrillic letters into the search box (say, "телефон"), the header will be like http://screencast.com/t/cKzbSW0a

vkurchatkin commented 9 years ago

Another workaround:

var http = require('http');

http.createServer(function(req, res) {
    if (req.url === '/') {
      res.statusCode = 302;
      res.setHeader('location', '?text=что-то русское');
      res.flushHeaders();
    }

    res.end();
}).listen(7777);
ChALkeR commented 9 years ago

@vkurchatkin Does that mean that response.writeHead behaves differently from response.setHeader? How that could be expected?

ChALkeR commented 9 years ago

@LavrovArtem Yandex.Market gives properly encoded url in the Location header for me. What tool are you using on the screenshot? Maybe it decodes the Location header? Have you tried pressing the «[Raw]» button?

ChALkeR commented 9 years ago

Guess this link goes here: https://github.com/joyent/node/issues/25293

LavrovArtem commented 9 years ago

@ChALkeR I used Fiddler2 to view requests. Also, when I debug our proxy-server, I see the response coming from the original site with unencoded location header.

LavrovArtem commented 9 years ago

@vkurchatkin thank you for workarounds

vkurchatkin commented 9 years ago

@ChALkeR totally unexpected, at least for me. see #1695

jasnell commented 9 years ago

Good to have those workarounds but this definitely feels like something that should be fixed. Granted, best practice would be to escape prior to setting but writeHead shouldn't be acting this way

vkurchatkin commented 9 years ago

@jasnell the problem with writeHead is that it doesn't actually write anything to the socket and prevents flushHeaders from doing so. Can you take a look at #1695 ?

vkurchatkin commented 9 years ago

There is a similar problem with parsing:

var http = require('http');

http.createServer(function(req, res) {
  console.log(req.url);
  console.log(req.headers)
  res.end('');
}).listen(3000);

curl -H "X-Test: Тест" http://localhost:3000/?тест

On node 0.10:

/?тест
Тест

On io.js 3.2:

/?тест
ТесÑ

This is a seems like a major regression and I can't think of a proper workaround.

trevnorris commented 9 years ago

@bnoordhuis Quick note. ISO-8859-1 is what's returned by OneByteString(). This started back when v8 switched from String::WriteAscii() to String::WriteOneByte(). And since it would be extra overhead to convert the headers to US-ASCII, we simply leave them latin-1 encoded. Demonstrative example (check X-Test header):

var http = require('http');

http.createServer(function(req, res) {
  console.log(req.headers);
  res.end('bye\n');
}).listen(3000, makeRequest);

var options = {
  port: 3000,
  headers: { 'X-Test': 'Ä' },
};

function makeRequest() {
  http.request(options, function (res) { }).end();
}

Output:

{ 'x-test': 'Ä', host: 'localhost:3000', connection: 'close' }
Fishrock123 commented 9 years ago

Closing as per https://github.com/nodejs/node/pull/2629#issuecomment-140075580.

This is working as intended. The spec only allows ascii for security reasons. :S

ChALkeR commented 9 years ago

@Fishrock123 What about https://github.com/nodejs/node/issues/1693#issuecomment-101726168 then?

Fishrock123 commented 9 years ago

Hmm, I suppose that is an actual bug, but from my knowledge of internals quite difficult to reasonably (performantly) fix.

The entire http headers API (internals) needs a rewrite/cleanup.

trevnorris commented 9 years ago

Discrepancies I'm aware of:

The entire thing needs to be looked at.

Fishrock123 commented 9 years ago

Related: https://github.com/nodejs/node/issues/962

misterdai commented 8 years ago

Just encountered a problem with this. Wanted to apply some headers containing non-ascii characters, to a response that is sitting behind IIS (which is proxying requests). IIS threw a fit over the header values, eventually leading me to use the module mimelib to Mime word encode the header values before using response.setHeader().

ChALkeR commented 8 years ago

Wanted to apply some headers containing non-ascii characters

This is against the HTTP spec, I guess. Edit: no, I am incorrect. The restrictions are a bit different: http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2 http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2

jasnell commented 8 years ago

Just to be clear... RFC 7230 and it's additional docs replace RFC 2616:

https://tools.ietf.org/html/rfc7230#section-3.2 https://tools.ietf.org/html/rfc7230#section-1.2

The specific ABNF for Header field values is:

VCHAR (any visible [USASCII] character)

header-field   = field-name ":" OWS field-value OWS

field-name     = token
field-value    = *( field-content / obs-fold )
field-content  = field-vchar [ 1*( SP / HTAB ) field-vchar ]
field-vchar    = VCHAR / obs-text

obs-text = %x80-FF
obs-fold       = CRLF 1*( SP / HTAB )

Also note the comment in Section 3.2.4:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

Also, from Appendix A.2:

Non-US-ASCII content in header fields ... has been obsoleted and made opaque

For backwards compatibility, header fields are generally always supposed to be treated as ISO-8859-1 but anything outside of the VCHAR (visible US-ASCII) range is obsolete and treated as "opaque". Unfortunately the spec doesn't actually define what "opaque" means in this case. I assume that means the bytes are intended to be uninterpreted (meaning we're not supposed to decode them as UTF-8).

In other words, a header value must contain only bytes in the x20-xFF range, and should ONLY decode bytes x20-x7E while leaving bytes x80-xFF as "raw bytes". This range, of course, does happen to allow for valid UTF8 byte sequences. Interpretation of those, however, is considered "unspecified".

ChALkeR commented 8 years ago

The original testcase should be fixed in recent versions on all supported Node.js branches, it just throws errors now, as it should.

Any reason for keeping this open?

ChALkeR commented 8 years ago

I'm going to close this. Everyone, leave a comment here if you disagree and want this reopened for some reason.

The behaviour is consistent now. https://github.com/nodejs/node/issues/1693#issuecomment-101726168 — this «workaround» no longer works, because it shouldn't. Proper usage looks like this: https://github.com/nodejs/node/issues/1693#issuecomment-101659204 (or just use querystring).

https://github.com/nodejs/node/issues/1693#issuecomment-136427052 is also fixed.