haskell / aeson

A fast Haskell JSON library
Other
1.25k stars 321 forks source link

"Invalid UTF-8 stream" error on valid characters #476

Closed vshabanov closed 7 years ago

vshabanov commented 7 years ago

Aeson can't decode some characters, for example U+1F3FF

> JSON.eitherDecode "\"\\ud83c\\udfff\"" :: Either String String
Left "Error in $: Failed reading: Cannot decode input: Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream"

I've hacked a test to check out what other valid characters are affected https://gist.github.com/vshabanov/4653f07311fc61bc397cc53db98f2407 here are results:

10000;LINEAR B SYLLABLE B008 A;Lo;0;L;;;;;N;;;;;
10400;DESERET CAPITAL LETTER LONG I;Lu;0;L;;;;;N;;;;10428;
10800;CYPRIOT SYLLABLE A;Lo;0;R;;;;;N;;;;;
10C00;OLD TURKIC LETTER ORKHON A;Lo;0;R;;;;;N;;;;;
11000;BRAHMI SIGN CANDRABINDU;Mc;0;L;;;;;N;;;;;
11400;NEWA LETTER A;Lo;0;L;;;;;N;;;;;
11C00;BHAIKSUKI LETTER A;Lo;0;L;;;;;N;;;;;
12000;CUNEIFORM SIGN A;Lo;0;L;;;;;N;;;;;
12400;CUNEIFORM NUMERIC SIGN TWO ASH;Nl;0;L;;;;2;N;;;;;
13000;EGYPTIAN HIEROGLYPH A001;Lo;0;L;;;;;N;;;;;
133FF;EGYPTIAN HIEROGLYPH Z015E;Lo;0;L;;;;;N;;;;;
13400;EGYPTIAN HIEROGLYPH Z015F;Lo;0;L;;;;;N;;;;;
14400;ANATOLIAN HIEROGLYPH A001;Lo;0;L;;;;;N;;;;;
16800;BAMUM LETTER PHASE-A NGKUE MFON;Lo;0;L;;;;;N;;;;;
17000;<Tangut Ideograph, First>;Lo;0;L;;;;;N;;;;;
18800;TANGUT COMPONENT-001;Lo;0;L;;;;;N;;;;;
1B000;KATAKANA LETTER ARCHAIC E;Lo;0;L;;;;;N;;;;;
1BC00;DUPLOYAN LETTER H;Lo;0;L;;;;;N;;;;;
1D000;BYZANTINE MUSICAL SYMBOL PSILI;So;0;L;;;;;N;;;;;
1D400;MATHEMATICAL BOLD CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D7FF;MATHEMATICAL MONOSPACE DIGIT NINE;Nd;0;EN;<font> 0039;9;9;9;N;;;;;
1D800;SIGNWRITING HAND-FIST INDEX;So;0;L;;;;;N;;;;;
1E000;COMBINING GLAGOLITIC LETTER AZU;Mn;230;NSM;;;;;N;;;;;
1E800;MENDE KIKAKUI SYLLABLE M001 KI;Lo;0;R;;;;;N;;;;;
1F000;MAHJONG TILE EAST WIND;So;0;ON;;;;;N;;;;;
1F3FF;EMOJI MODIFIER FITZPATRICK TYPE-6;Sk;0;ON;;;;;N;;;;;
1F400;RAT;So;0;ON;;;;;N;;;;;
1F800;LEFTWARDS ARROW WITH SMALL TRIANGLE ARROWHEAD;So;0;ON;;;;;N;;;;;
20000;<CJK Ideograph Extension B, First>;Lo;0;L;;;;;N;;;;;
2F800;CJK COMPATIBILITY IDEOGRAPH-2F800;Lo;0;L;4E3D;;;;N;;;;;
F0000;<Plane 15 Private Use, First>;Co;0;L;;;;;N;;;;;
100000;<Plane 16 Private Use, First>;Co;0;L;;;;;N;;;;;

There is a clear pattern here. And it becomes even more clear if you check every possible UTF-16 character (not only Unicode 9 ones) https://gist.github.com/vshabanov/4653f07311fc61bc397cc53db98f2407#file-output2-txt

I suspect that it's related to https://github.com/bos/aeson/blob/master/cbits/unescape_string.c but I don't understand completely what this code do and how to fix it.

bergmark commented 7 years ago

Can you please try bisecting to check if it is due to that change?

vshabanov commented 7 years ago

Yes. It was precisely due to https://github.com/bos/aeson/commit/2f24e555d86a36fdda6d4cad79976004b382ab3b change. It turned out to be a simple off-by-one error. I've made a pull request which fixes it https://github.com/bos/aeson/pull/477.

Previous aeson UTF-16 decoder didn't handled \uFFFF character (the only one that wasn't handled). Fixed decoder handles everything.

bergmark commented 7 years ago

Released in v1.0.2.1!