Do not encode utf8 in document.Accept(writer)

GoogleCodeExporter commented 9 years ago

I writen some codes to test read/write chinese.

  char json_test_utf8_chinese[] = {(char)228, (char)184, (char)173};
  char* chinese_json = "{\"chinese\": \"\\u4e2d\"}";
  rapidjson::Document document;
  ASSERT_FALSE(document.Parse<0>(chinese_json).HasParseError());
  EXPECT_TRUE(document.HasMember("chinese"));
  LOG(INFO) << document["chinese"].GetString() << std::endl;
  EXPECT_STREQ(json_test_utf8_chinese, document["chinese"].GetString());

  rapidjson::StringBuffer s(0, 65535);
  rapidjson::Writer<rapidjson::StringBuffer> writer(s);
  document.Accept(writer);
  LOG(INFO) << "json size: " << s.GetSize() << " bytes";
  LOG(INFO) << s.GetString();
  google::FlushLogFiles(INFO);

Test passed, but the log shown the chinese characters ware not encode as 
"\uXXXX" format.

Log as below:

I1011 17:30:58.483409  2880 main.cpp:2076] 中
I1011 17:30:58.484410  2880 main.cpp:2082] json size: 11 bytes
I1011 17:30:58.484410  2880 main.cpp:2083] {"1":"中"}

Original issue reported on code.google.com by zhangpei...@gmail.com on 11 Oct 2012 at 9:39

GoogleCodeExporter commented 9 years ago

I modified write::WriteString to support it.
=================================================================
  void WriteString(const Ch* str, SizeType length)  {
    static const char hexDigits[16] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };
    static const char escape[256] = {
#define Z16 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
#define B16 'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b'
      //0    1    2    3    4    5    6    7    8    9    A    B    C    D    E    F
      'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', 'b', 't', 'n', 'u', 'f', 'r', 'u', 'u', // 00
      'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', 'u', // 10
        0,   0, '"',   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, // 20
      Z16, Z16,                                    // 30~4F
        0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,'\\',   0,   0,   0, // 50
      Z16, Z16, // 60~7F
      B16, B16, B16, B16, B16, B16, B16, B16                // 80~FF
#undef Z16
#undef B16
    };

    os_.Put('\"');
    GenericStringStream<SourceEncoding> is(str);
    while (is.Tell() < length) {
      const Ch c = is.Peek();
      if ((sizeof(Ch) == 1 || (unsigned)c < 256) && escape[(unsigned char)c])  {
        os_.Put('\\');
        if (escape[(unsigned char)c] == 'u') {
          is.Take();
          os_.Put(escape[(unsigned char)c]);
          os_.Put('0');
          os_.Put('0');
          os_.Put(hexDigits[(unsigned char)c >> 4]);
          os_.Put(hexDigits[(unsigned char)c & 0xF]);
        }
        else if (escape[(unsigned char)c] == 'b') {
          unsigned codepoint;
          if (!SourceEncoding::Decode(is, &codepoint)) {
            // Error
          }
          else
          {
            os_.Put('u');
            os_.Put(hexDigits[(codepoint & 0xf000) >> 12]);
            os_.Put(hexDigits[(codepoint & 0x0f00) >> 8]);
            os_.Put(hexDigits[(codepoint & 0x00f0) >> 4]);
            os_.Put(hexDigits[(codepoint & 0x000f) >> 0]);
          }
        }
        else {
          is.Take();
          os_.Put(escape[(unsigned char)c]);
        }
      }
      else
        Transcoder<SourceEncoding, TargetEncoding>::Transcode(is, os_);
    }
    os_.Put('\"');
  }
...
================================================

I think we should append a boolean return for WriteString() function to 
indicate the process is seccess or not.

Original comment by zhangpei...@gmail.com on 12 Oct 2012 at 1:51

GoogleCodeExporter commented 9 years ago

I think, both Chinese characters or "\uXXXX" are correct format in UTF-8 JSON.

I propose two solutions.
1. Add an option to choose "\uXXXX" format.
2. Add an encoding which cannot encode unicode charaters, such as ASCII.

Welcome for discussion.

Original comment by milo...@gmail.com on 14 Nov 2012 at 3:05

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

https://github.com/miloyip/rapidjson/pull/70

Original comment by milo...@gmail.com on 13 Jul 2014 at 5:14

Changed state: Started

wangxiaowei0303 / rapidjson

Do not encode utf8 in document.Accept(writer) #40