szhambo / rapidjson

Automatically exported from code.google.com/p/rapidjson
MIT License
0 stars 0 forks source link

Encoding conversion #4

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Currently, the input and output of Reader uses the same encoding.

It is often needed to read a stream of one encoding (e.g. UTF-8), and output 
string of another encoding (e.g. UTF-16). Or in the other way, stringify a DOM 
from one encoding (e.g. UTF-16) to an output stream of another encoding (e.g. 
UTF-8)

The most simple solution is converting the stream into a memory buffer of 
another encoding. This requires more memory storage and memory access.

Another solution is to convert the input stream into another encoding before 
sending it to the parser. However, only characters in JSON string type are 
really the ones necessary to be converted. Conversion of other characters just 
wastes time.

The third solution is letting the parser distinguish the input and output 
encoding. It uses an encoding converter to convert characters of JSON string 
type. However, since the output length may longer than the original length, in 
situ parsing cannot be permitted.

Try to design a mechanism to generalize encoding conversion. And it should 
support UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE. It can also support 
automatic encoding detection with BOM, while incurring some overheads in 
dynamic dispatching.

Original issue reported on code.google.com by milo...@gmail.com on 26 Nov 2011 at 4:33

GoogleCodeExporter commented 9 years ago
Reader/Writer can now perform transcoding with Transcoder.
New EncodedInputStream can decode characters from byte input stream
New EncodedOutputStream can encode characters to byte output stream
New AutoUTFInputStream can specify an UTF encoding in runtime, or detect UTF 
encoding from the beginning of stream (BOM and RFC4627). And then it can 
dynamically delicate operations to the actual UTF encoding.
New AutoUTFOutputStream can specify an UTF encoding in runtime, optionally 
writes BOM.
New AutoUTF can do operations according to UTF encoding type in the 
input/output stream.
All AutoXXX classes can handle UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.

Original comment by milo...@gmail.com on 3 Dec 2011 at 4:43