jetti777Ltd / mochiweb

Automatically exported from code.google.com/p/mochiweb
Other
0 stars 0 forks source link

Coalesce UTF-16 surrogate pairs in mochijson2:decode #35

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. mochijson2:decode("{\"bar\":\"foo \\ud834\\udd1e\"}").

What is the expected output? What do you see instead?
That should not result in an error. Code point U+1D11e can validly be sent
as 0xD834 0xDD1e

Attached is a patch which needs some bounds checking on the hex character
ranges, but otherwise can parse out the code point in question.

Original issue reported on code.google.com by metrofindings@gmail.com on 5 May 2009 at 4:07

GoogleCodeExporter commented 8 years ago

Original comment by metrofindings@gmail.com on 5 May 2009 at 4:11

Attachments:

GoogleCodeExporter commented 8 years ago
This patch contains proper bounds checking for surrogate pairs in the D800-DBFF
range, as opposed to just any two UTF chars together.

Test with:
1> c(mochijson2).
2> mochijson2:decode("{\"foo\":\"\\ud834\\udd1e\"}").
3> mochijson2:decode("{\"foo\":\"\\u0023\\u0101\"}").

Original comment by metrofindings@gmail.com on 5 May 2009 at 9:40

Attachments:

GoogleCodeExporter commented 8 years ago
I should have checked the issues list before embarking on fixing this myself!  
I ended up with a shorter patch 
that relies on xmerl_ucs to calculate the code point for the surrogate pair, 
but is otherwise similar.

This issue prevents CouchDB from replicating documents containing unicode 
outside the BMP, because encode() 
escapes it as surrogate pairs, but decode() can't handle that format.

Original comment by adam.kocoloski@gmail.com on 5 Jun 2009 at 3:08

Attachments:

GoogleCodeExporter commented 8 years ago
applied in r108

Original comment by bob.ippo...@gmail.com on 28 Sep 2009 at 7:12