nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.39k stars 29.51k forks source link

Unexpected Parse Error inside HTTP client #15582

Closed Ralle closed 6 years ago

Ralle commented 7 years ago

I am working on a web scraper in Node.js. It has worked fine for over 30000 URLs, but this one seemingly causes an issue in the HTTP stack inside Node.js.

The following code:

const http = require('http');
const url = require('url');

const date = parseInt(new Date / 1000, 10);
const link = `http://www.wc3c.net/attachment.php?attachmentid=33111&d=${date}`;
const options = url.parse(link);
options.headers = { 'Cookie': `bblastvisit=${date}` };

http.get(
    options,
    (res) => {
        const rawData = [];
        res.on('data', (chunk) => console.log('Data', chunk.length));
        res.on('end', () => console.log('Done'));
    }
);

Causes the following error:

events.js:182
      throw er; // Unhandled 'error' event
      ^

Error: Parse Error
    at Socket.socketOnData (_http_client.js:454:20)
    at emitOne (events.js:115:13)
    at Socket.emit (events.js:210:7)
    at addChunk (_stream_readable.js:266:12)
    at readableAddChunk (_stream_readable.js:253:11)
    at Socket.Readable.push (_stream_readable.js:211:10)
    at TCP.onread (net.js:585:20)

The weird thing is, this request works fine in the browser and also in cURL, so I am expecting something to be wrong in the Node.js HTTP stack.

It also works if you change the attachment ID to something else. All IDs preceding it have worked.

8.5.0 and 6.9.1

macOS Sierra 10.12.6 (16G29)

mscdex commented 7 years ago

cURL is probably more lenient. Can you provide the actual URL that is problematic?

FWIW you should be able to catch that error by adding an 'error' handler on res IIRC.

Ralle commented 7 years ago

This is the URL, but you have to open it twice because the first view is an anti-hotlinking measure. http://www.wc3c.net/attachment.php?attachmentid=33111&d=1241580637

The measure is the bblastvisit cookie being a unix timestamp less than 120 seconds old.

neroux commented 7 years ago

The error seems to be _HPE_INVALIDCONSTANT. There were similar reports already at nodejs/node-v0.x-archive#4863

I couldnt fully debug it yet (especially the HTTP parsing part) but it seems it fails after it downloaded the response body. Maybe it got confused with the headers how it should determine the end of the stream and/or there is an off-by-one issue. All speculation so far though.

Following is the response header, particularly the empty content encoding stands out

HTTP/1.1 200 OK
Date: Sun, 24 Sep 2017 15:17:58 GMT
Server: Apache/2.2.22 (Fedora)
X-Powered-By: PHP/5.3.20
Content-Encoding:
Cache-control: max-age=31536000
Expires: Mon, 24 Sep 2018 15:17:58 GMT
Last-Modified: Wed, 06 May 2009 03:30:37 GMT
ETag: "33111"
Content-disposition: inline; filename="Longinus Spear.w3x"
Content-transfer-encoding: binary
Content-Length: 33358
Connection: close
Content-Type: unknown/unknown
mscdex commented 7 years ago

Ok I think what is happening here is that the Content-Length is incorrect. It seems to be 31 bytes or so fewer than the actual content, so node is probably trying to parse that last 31 bytes as another HTTP response. REDbot agrees with this.

neroux commented 7 years ago

@mscdex Is the file you downloaded smaller? When I download it the content length would match.

mscdex commented 7 years ago

@neroux No, it's larger. Content-Length reports 33358, but the body is ~33389 IIRC.

Also, the Content-Encoding header shouldn't even be sent in the response because we aren't sending an Accept-Encoding header in the request. The server (probably more specifically the PHP script) is just generating a bad response in many ways.

neroux commented 7 years ago

@mscdex Alright, the other way round then :). Well, in my case it does match. The file is 33,358 bytes when downloaded via a browser.

But that is what the browser did. Should the body actually be larger (and it is ignored by the browser) your earlier assumption - that the remaining data is considered a second response - might very well apply.

mscdex commented 7 years ago

@neroux It could very well be that the browser is just truncating anything after 33358 bytes. Browsers are typically very lenient as well.

bnoordhuis commented 6 years ago

Closing, seems this thread ran its course.

mattpr commented 4 years ago

FYI same issue can be reproduced if the content-type header in an http response provides an unknown/invalid/unsupported charset.

This may be fine behaviour. node's http client speaking the charset of the http response is probably a requirement for node's http client to properly handle the response. However a more specific error message (e.g. "server provided content in unknown charset: ____" would be more useful for troubleshooting when this comes up.

e.g. http response headers:

Content-Type: text/plain; charset=ISO-8859-2

will result in

Error: Parse Error
      at Socket.socketOnData (_http_client.js:442:20)
      at Socket.emit (events.js:198:13)
      at addChunk (_stream_readable.js:288:12)
      at readableAddChunk (_stream_readable.js:269:11)
      at Socket.Readable.push (_stream_readable.js:224:10)
      at TCP.onStreamRead (internal/stream_base_commons.js:94:17) bytesParsed: 0, code: 'HPE_CB_headers_complete'

while the following is fine

Content-Type: text/plain; charset=ISO-8859-1

(node v10.16.3)

mscdex commented 4 years ago

@mattpr Node doesn't care about Content-Type specifically, it only cares that valid header characters are used throughout the header. There must be something else that is different between the two responses that is causing the parse error.

sam-github commented 4 years ago

@mattpr can you provide the complete text of the response that does not parse?