tj / node-querystring

querystring parser for node and the browser - supporting nesting (used by Express, Connect, etc)
MIT License
457 stars 66 forks source link

moved: Body parsing bug due to special characters/encoding? #7

Open tj opened 13 years ago

tj commented 13 years ago

I was working on a bookmarklet that, among other things, form-posts the title of whatever page you're on to my server running Express, and I'm seeing Connect's body parser choke on some pages from Amazon.

Here's a super simple test case:

https://gist.github.com/947895

Run that website locally, drag the bookmarklet to your toolbar, and click it on any of the provided Amazon links. You should see an error message like this one:

URIError: URI malformed
    at decodeURIComponent (native)
    at /usr/local/lib/node/.npm/qs/0.1.0/package/lib/querystring.js:28:18
    at Array.reduce (native)
    at /usr/local/lib/node/.npm/qs/0.1.0/package/lib/querystring.js:27:6
    at IncomingMessage.<anonymous> (/usr/local/lib/node/.npm/connect/1.3.0/package/lib/middleware/bodyParser.js:74:15)
    at IncomingMessage.emit (events.js:61:17)
    at HTTPParser.onMessageComplete (http.js:132:23)
    at Socket.ondata (http.js:1007:22)
    at Socket._onReadable (net.js:677:27)
    at IOWatcher.onReadable [as callback] (net.js:177:10)

This happens on Amazon pages where the title has special characters, like é or ü. You can change the title of an Amazon page (e.g. by setting document.title in the console) to just é, for example, and it will cause the bug.

I've done some investigating and can give you some more info, but at a high level, it seems that the browser in this case encodes the form differently than encodeURIComponent() does, which causes decodeURIComponent() — used by Connect's body parser — to choke.

For example, calling encodeURIComponent() on that é yields %C3%A9 everywhere, but what the server receives in the form body from these Amazon pages is %E9. Attempting to decodeURIComponent() on %E9 causes this error.

I tried making a sample page for this, but the form post matched encodeURIComponent(). I'm guessing the behavior on Amazon is related to encoding, but I haven't been able to confirm, maybe because Express sends a Content-Type header that specifies utf-8.

All said, it seems that Connect's body parser shouldn't break on these encodings. Hope this info helps. Thanks!

tj commented 13 years ago

^ moved from senchalabs/connect

hokaccha commented 13 years ago

I also had a similar case. When POST with Shift_JIS, decodeURIComponent cannot decode.

Because decodeURIComponent use only UTF-8. Other charset should use an appropriate function.

For example, This is Shift_JIS decoder library. http://lightbox.on.coocan.jp/ecl_new.txt

How about such a code? https://github.com/hokaccha/connect/commit/1f7c870ddbdf40978426d60754b31ca1f27e1df2 https://github.com/hokaccha/node-querystring/commit/8c0d5141cc92d385d9777c09f94412a49fffa0ce

then,

var express = require('express');
express.bodyParser.qs.decoder = UnescapeSJIS;
...

But, ISO-8859-1 decoder was not able to be found.