mjackson / mach

HTTP for JavaScript
810 stars 41 forks source link

Mach 1.0 truncating UTF characters #53

Closed appsforartists closed 9 years ago

appsforartists commented 9 years ago

I'm getting a bunch of garbage characters in my app. I reduced it down to the following test:

var mach = require("mach");

var stack = new mach.stack();
stack.use(mach.charset, "utf-8");

stack.route(
    "*",
    function (connection) {
        connection.html("<!DOCTYPE html><body>This should be a triangle: \"▼\"</body>");
    }
);

mach.serve(stack, 8089);

Here's the hex result, as viewed through Charles:

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Date: Fri, 14 Nov 2014 21:36:45 GMT
Connection: keep-alive
Transfer-Encoding: chunked

00000000  3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e 3c   <!DOCTYPE html><
00000010  62 6f 64 79 3e 54 68 69 73 20 73 68 6f 75 6c 64   body>This should
00000020  20 62 65 20 61 20 74 72 69 61 6e 67 6c 65 3a 20    be a triangle: 
00000030  22 bc 22 3c 2f 62 6f 64 79 3e                     " "</body>      

As you can see, 0x25bc is being returned as 0xbc. In the browser, it is rendered as char code 0xbc if I use no charset and 0xfffd if I do. (0xfffd is literally "replacement character" - e.g. the browser doesn't know WTF it's supposed to be).

I've never used Charles to convert an HTTP response to hex before, but based on what I'm seeing, it looks like only the last two characters of a four-character hex code are being returned:

expected received
0x 25 bc 0x bc
0x 11 cd 0x cd
0x 12 34 0x 34

Any idea what would cause this? I'm certain it worked before the 1.0 push.

mjackson commented 9 years ago

Hmm, this could be due to the fact that we're now using bodec to encode binary data instead of node's Buffer module directly. I did this because I wanted a way to do binary data in the browser without importing a ton of code that we don't need (ala Browserify's Buffer shim).

Maybe this has something to do with that?

mjackson commented 9 years ago

@appsforartists Are you still seeing this issue? If so, I'd love to get a failing test case for it and squash this bug.

appsforartists commented 9 years ago

Yeah I am. I just installed a fresh copy of all my npm dependencies (including mach@1.0.2) on the server, and I still see it. Oddly, I don't always see it on my local machine. Could be worth investigating if there are any differences between the two. I'll let you know.

appsforartists commented 9 years ago

I looked into it. The npm stuff is all identical. The difference is that on the dev server, I inline the scripts (so they are passed through mach). On my local machine, they are served by Webpack, which serves them correctly.

FWIW, here's the result of npm list:

├─┬ mach@1.0.2
│ ├── bodec@0.1.0 (git://github.com/mjackson/bodec#1c31ee1311969f4471e3582cdcefe1a93b711288)
│ ├─┬ bufferedstream@3.0.7
│ │ ├── bodec@0.1.0 (git://github.com/mjackson/bodec#1c31ee1311969f4471e3582cdcefe1a93b711288)
│ │ └── events@1.0.2
│ ├── describe-property@1.0.1
│ ├── object-assign@2.0.0
│ ├── qs@2.3.3
│ ├── redis@0.11.0
│ ├── strftime@0.8.2
│ └── when@3.6.4
appsforartists commented 9 years ago

(FWIW, bodec is now tagged at 1.0.0, so you can update your dependency.)

nicholascloud commented 9 years ago

I am having a similar problem. If I read a JSON file with unicode characters, parse it on the server, then send it to the browser with conn.json(), the browser cannot parse the response. If I serve the file up directly with mach.file, however, everything works fine.

Here is a tiny app that demonstrates the problem: https://dl.dropboxusercontent.com/u/22414/mach-unicode-testcase.zip

Just npm install then browse to http://localhost:5000/index.html

EDIT: To clarify, in the first case the response is delivered to the browser with HTTP status 200, but the body can't be parsed with JSON.parse().

mjackson commented 9 years ago

@appsforartists @nicholascloud thanks for the report. i'll look into this today.

mjackson commented 9 years ago

@appsforartists @nicholascloud Just FYI, this is fixed and published in version 1.1.0. The issue was with how I was treating strings in the BufferedStream library as binary by default, instead of utf-8. Silly me.

Thanks for your help!

appsforartists commented 9 years ago

Thanks @mjackson. Might I recommend augmenting your Message unit tests with non-ASCII characters to prevent regressions? :smiley:

mjackson commented 9 years ago

@appsforartists excellent idea.

nicholascloud commented 9 years ago

@mjackson Excellent :exclamation: